subject:"\[vpp\-dev\] mheap performance"

Re: [vpp-dev] mheap performance issue and fixup

2018-07-02 Thread Kingwel Xie

Hi Dave,

I notices you made some improvement to mheap in the latest code. I’m afraid 
that it would not work as you expected. Actually the same idea came to my mind 
and then I realized it doesn’t work well.

Here is your code:

  if (align > MHEAP_ELT_OVERHEAD_BYTES)
n_user_data_bytes = clib_max (n_user_data_bytes,
  align - 
MHEAP_ELT_OVERHEAD_BYTES);

Instead of allocating a very small object, you round it up to a bit bigger. 
Let’s take a typical case: you want 8 bytes, but aligned to 64B. Then you are 
actually allocating from free bin started with 64-8=56.

This is problematic if all meahp elements in free bin size 56B are not 64B 
aligned. It would become a performance issue when this 56B free bin happens to 
have a lot of elements, which means you have to go through all elements one by 
one.

It would not do better even if you turn to a bigger free bin, because you round 
the requested block size up to a bigger size.

As you can see in my patch, I add a modifier to the request block size when 
looking up an appropriate free bin:

In mheap_get_search_free_list:

  /* kingwel, lookup a free bin which is big enough to hold everything 
align+align_offset+lo_free_size+overhead */
  word modifier =
(align >
 MHEAP_USER_DATA_WORD_BYTES ? align + align_offset +
 sizeof (mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

This is to ensure we can always locate an element without going through the 
whole list. Also I take the align_offset into consideration.

BTW, probably I need more time to work out a test case in test_mheap to prove 
my fix. It seems the latest code doesn’t generate test_mheap any more. I have 
to figure it out at first.

Regards,
Kingwel


From: Dave Barach (dbarach) 
Sent: Thursday, June 28, 2018 9:06 PM
To: Kingwel Xie ; Damjan Marion 
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] mheap performance issue and fixup

Allocating a large number of 16 byte objects @ 64 byte alignment will never 
work very well. If you pad the object such that the mheap header plus the 
object is exactly 64 bytes, the issue may go away.

With that hint, however, I’ll go build a test vector. It sounds like the mheap 
required size calculation might be a brick shy of a load.

D.

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
mailto:vpp-dev@lists.fd.io>> On Behalf Of Kingwel Xie
Sent: Thursday, June 28, 2018 2:25 AM
To: Dave Barach (dbarach) mailto:dbar...@cisco.com>>; Damjan 
Marion mailto:dmar...@me.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

No problem. I’ll do that later.

Actually there has been a discussion about mheap performance, which describe 
the issue we are talking about. Please also check it again:

https://lists.fd.io/g/vpp-dev/topic/10642197#6399


From: Dave Barach (dbarach) mailto:dbar...@cisco.com>>
Sent: Thursday, June 28, 2018 3:38 AM
To: Damjan Marion mailto:dmar...@me.com>>; Kingwel Xie 
mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: RE: [vpp-dev] mheap performance issue and fixup

+1.

It would be super-helpful if you were to add test cases to 
.../src/vppinfra/test_mheap.c, and push a draft patch so we can reproduce / fix 
the problem(s).

Thanks... Dave

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
mailto:vpp-dev@lists.fd.io>> On Behalf Of Damjan Marion
Sent: Wednesday, June 27, 2018 3:27 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup


Dear Kingwei,

We finally managed to look at your mheap patches, sorry for delay.

Still we are not 100% convinced that there is a bug(s) in the mheap code.
Please note that that mheap code is stable, not changed frequently and used for 
years.

It will really help if you can provide test vectors for each issue you observed.
It will be much easier to understand the problem and confirm the fix if we are 
able to reproduce it in controlled environment.

thanks,

Damjan


From: mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
Xie mailto:kingwel@ericsson.com>>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" mailto:damar...@cisco.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>&g

Re: [vpp-dev] mheap performance issue and fixup

2018-06-28 Thread Dave Barach via Lists.Fd.Io

Allocating a large number of 16 byte objects @ 64 byte alignment will never 
work very well. If you pad the object such that the mheap header plus the 
object is exactly 64 bytes, the issue may go away.

With that hint, however, I’ll go build a test vector. It sounds like the mheap 
required size calculation might be a brick shy of a load.

D.

From: vpp-dev@lists.fd.io  On Behalf Of Kingwel Xie
Sent: Thursday, June 28, 2018 2:25 AM
To: Dave Barach (dbarach) ; Damjan Marion 
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup

No problem. I’ll do that later.

Actually there has been a discussion about mheap performance, which describe 
the issue we are talking about. Please also check it again:

https://lists.fd.io/g/vpp-dev/topic/10642197#6399


From: Dave Barach (dbarach) mailto:dbar...@cisco.com>>
Sent: Thursday, June 28, 2018 3:38 AM
To: Damjan Marion mailto:dmar...@me.com>>; Kingwel Xie 
mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: RE: [vpp-dev] mheap performance issue and fixup

+1.

It would be super-helpful if you were to add test cases to 
.../src/vppinfra/test_mheap.c, and push a draft patch so we can reproduce / fix 
the problem(s).

Thanks... Dave

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
mailto:vpp-dev@lists.fd.io>> On Behalf Of Damjan Marion
Sent: Wednesday, June 27, 2018 3:27 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup


Dear Kingwei,

We finally managed to look at your mheap patches, sorry for delay.

Still we are not 100% convinced that there is a bug(s) in the mheap code.
Please note that that mheap code is stable, not changed frequently and used for 
years.

It will really help if you can provide test vectors for each issue you observed.
It will be much easier to understand the problem and confirm the fix if we are 
able to reproduce it in controlled environment.

thanks,

Damjan


From: mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
Xie mailto:kingwel@ericsson.com>>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" mailto:damar...@cisco.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a fe

Re: [vpp-dev] mheap performance issue and fixup

2018-06-28 Thread Kingwel Xie

No problem. I’ll do that later.

Actually there has been a discussion about mheap performance, which describe 
the issue we are talking about. Please also check it again:

https://lists.fd.io/g/vpp-dev/topic/10642197#6399


From: Dave Barach (dbarach) 
Sent: Thursday, June 28, 2018 3:38 AM
To: Damjan Marion ; Kingwel Xie 
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] mheap performance issue and fixup

+1.

It would be super-helpful if you were to add test cases to 
.../src/vppinfra/test_mheap.c, and push a draft patch so we can reproduce / fix 
the problem(s).

Thanks... Dave

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
mailto:vpp-dev@lists.fd.io>> On Behalf Of Damjan Marion
Sent: Wednesday, June 27, 2018 3:27 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup


Dear Kingwei,

We finally managed to look at your mheap patches, sorry for delay.

Still we are not 100% convinced that there is a bug(s) in the mheap code.
Please note that that mheap code is stable, not changed frequently and used for 
years.

It will really help if you can provide test vectors for each issue you observed.
It will be much easier to understand the problem and confirm the fix if we are 
able to reproduce it in controlled environment.

thanks,

Damjan


From: mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
Xie mailto:kingwel@ericsson.com>>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" mailto:damar...@cisco.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it

Re: [vpp-dev] mheap performance issue and fixup

2018-06-27 Thread Dave Barach via Lists.Fd.Io

+1.

It would be super-helpful if you were to add test cases to 
.../src/vppinfra/test_mheap.c, and push a draft patch so we can reproduce / fix 
the problem(s).

Thanks... Dave

From: vpp-dev@lists.fd.io  On Behalf Of Damjan Marion
Sent: Wednesday, June 27, 2018 3:27 PM
To: Kingwel Xie 
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup


Dear Kingwei,

We finally managed to look at your mheap patches, sorry for delay.

Still we are not 100% convinced that there is a bug(s) in the mheap code.
Please note that that mheap code is stable, not changed frequently and used for 
years.

It will really help if you can provide test vectors for each issue you observed.
It will be much easier to understand the problem and confirm the fix if we are 
able to reproduce it in controlled environment.

thanks,

Damjan



From: mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
Xie mailto:kingwel@ericsson.com>>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" mailto:damar...@cisco.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small

Re: [vpp-dev] mheap performance issue and fixup

2018-06-27 Thread Damjan Marion


Dear Kingwei,

We finally managed to look at your mheap patches, sorry for delay.

Still we are not 100% convinced that there is a bug(s) in the mheap code.
Please note that that mheap code is stable, not changed frequently and used for 
years.

It will really help if you can provide test vectors for each issue you observed.
It will be much easier to understand the problem and confirm the fix if we are 
able to reproduce it in controlled environment.

thanks,

Damjan

 
> From: mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
> Xie mailto:kingwel@ericsson.com>>
> Date: Thursday, 19 April 2018 at 03:19
> To: "Damjan Marion (damarion)"  <mailto:damar...@cisco.com>>
> Cc: "vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>"  <mailto:vpp-dev@lists.fd.io>>
> Subject: Re: [vpp-dev] mheap performance issue and fixup
>  
> Hi Damjan,
>  
> We will do it asap. Actually we are quite new to vPP and even don’t know how 
> to make bug report and code contribution or so.
>  
> Regards,
> Kingwel
>  
> From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> 
> [mailto:vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>] On Behalf Of Damjan 
> Marion
> Sent: Wednesday, April 18, 2018 11:30 PM
> To: Kingwel Xie mailto:kingwel@ericsson.com>>
> Cc: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>
> Subject: Re: [vpp-dev] mheap performance issue and fixup
>  
> Dear Kingwel, 
>  
> Thank you for your email. It will be really appreciated if you can submit 
> your changes to gerrit, preferably each point in separate patch.
> That will be best place to discuss those changes...
>  
> Thanks in Advance,
>  
> -- 
> Damjan
>  
> On 16 Apr 2018, at 10:13, Kingwel Xie  <mailto:kingwel@ericsson.com>> wrote:
>  
> Hi all,
>  
> We recently worked on GTPU tunnel and our target is to create 2M tunnels. It 
> is not as easy as it looks like, and it took us quite some time to figure it 
> out. The biggest problem we found is about mheap, which as you know is the 
> low layer memory management function of vPP. We believe it makes sense to 
> share what we found and what we’ve done to improve the performance of mheap.
>  
> First of all, mheap is fast. It has well-designed small object cache and 
> multi-level free lists, to speed up the get/put. However, as discussed in the 
> mail list before, it has a performance issue when dealing with 
> align/align_offset allocation. We managed to locate the problem is brought by 
> a pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to 
> be aligned to 64B cache line, therefore with 4 bytes align offset. We 
> realized that it is because that the free list must be very long, meaning so 
> many mheap_elts, but unfortunately it doesn’t have an element which fits to 
> all 3 prerequisites: size, align, and align offset. In this case,  each 
> allocation has to traverse all elements till it reaches the end of element. 
> As a result, you might observe each allocation is greater than 10 
> clocks/call with ‘show memory verbose’. It indicates the allocation takes too 
> long, while it should be 200~300 clocks/call in general. Also you should have 
> noticed ‘per-attempt’ is quite high, even more than 100.
>  
> The fix is straight and simple : as discussed int his mail list before, to 
> allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
> looks like a workaround not a real fix, so we spent some time fix the problem 
> thoroughly. The idea is to add a few more bytes to the original required 
> block size so that mheap will always lookup in a bigger free list, then most 
> likely a suitable block can be easily located. Well, now the problem becomes 
> how big is this extra size? It should be at least align+align_offset, not 
> hard to understand. But after careful analysis we think it is better to be 
> like this, see code below:
>  
> Mheap.c:545 
>   word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset 
> + sizeof(mheap_elt_t) : 0);
>   bin = user_data_size_to_bin_index (n_user_bytes + modifier);
>  
> The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
> to hold a complete free element. You will understand it if you really know 
> how mheap_get_search_free_bin is working. I am not going to go through the 
> detail of it. In short, every lookup in free list will always locate a 
> suitable element, in other words, the hit rate of free list will be almost 
> 100%, and the ‘per-attempt’ will be always around 1. The test result looks 
> very promising, please see below after adding 2M gtpu tunnels and 2M routing 
> entries:
>  
> Thread 0 vpp_main
> 1368

Re: [vpp-dev] mheap performance issue and fixup

2018-04-23 Thread Kingwel Xie

Did you delete the handle of the shared memory? These two files:  
/dev/shm/global_vm  vpe-api

The mheap header structure is changed, so you have to ask vPP to re-create the 
shared memory heap.

Sorry, I forgot to mention that before.


From: 薛欣颖 [mailto:xy...@fiberhome.com]
Sent: Monday, April 23, 2018 3:25 PM
To: Kingwel Xie <kingwel@ericsson.com>; Damjan Marion <damar...@cisco.com>; 
nranns <nra...@cisco.com>
Cc: vpp-dev <vpp-dev@lists.fd.io>
Subject: Re: Re: [vpp-dev] mheap performance issue and fixup

Hi Kingwel,

After I merged the three patch ,there is a SIGSEGV when I startup vpp (not 
every time). And the error didn't appear before .
Is there anything I can do to fix it?

Program received signal SIGSEGV, Segmentation fault.
clib_mem_alloc_aligned_at_offset (size=54, align=4, align_offset=4, 
os_out_of_memory_on_failure=1) at /home/vpp/build-data/../src/vppinfra/mem.h:90
90 cpu = os_get_thread_index ();
(gdb) bt
#0 clib_mem_alloc_aligned_at_offset (size=54, align=4, align_offset=4, 
os_out_of_memory_on_failure=1) at /home/vpp/build-data/../src/vppinfra/mem.h:90
#1 0x7697fcde in vec_resize_allocate_memory (v=0x0, 
length_increment=50, data_bytes=54, header_bytes=4, data_align=4)
at /home/vpp/build-data/../src/vppinfra/vec.c:59
#2 0x769313c7 in _vec_resize (v=0x0, length_increment=50, 
data_bytes=50, header_bytes=0, data_align=0)
at /home/vpp/build-data/../src/vppinfra/vec.h:142
#3 0x769322bb in do_percent (_s=0x7fffb6cee348, fmt=0x76995c90 
"%s:%d (%s) assertion `%s' fails", va=0x7fffb6cee3e0)
at /home/vpp/build-data/../src/vppinfra/format.c:339
#4 0x76932703 in va_format (s=0x0, fmt=0x76995c90 "%s:%d (%s) 
assertion `%s' fails", va=0x7fffb6cee3e0)
at /home/vpp/build-data/../src/vppinfra/format.c:402
#5 0x7692ce4e in _clib_error (how_to_die=2, function_name=0x0, 
line_number=0, fmt=0x76995c90 "%s:%d (%s) assertion `%s' fails")
at /home/vpp/build-data/../src/vppinfra/error.c:127
#6 0x769496a3 in mheap_get_search_free_bin (v=0x3000a000, bin=12, 
n_user_data_bytes_arg=0x7fffb6cee6b0, align=4, align_offset=0)
at /home/vpp/build-data/../src/vppinfra/mheap.c:401
#7 0x76949e86 in mheap_get_search_free_list (v=0x3000a000, 
n_user_bytes_arg=0x7fffb6cee6b0, align=4, align_offset=0)
at /home/vpp/build-data/../src/vppinfra/mheap.c:569
#8 0x7694a326 in mheap_get_aligned (v=0x3000a000, n_user_data_bytes=56, 
align=4, align_offset=0, offset_return=0x7fffb6cee758)
at /home/vpp/build-data/../src/vppinfra/mheap.c:700
#9 0x7697f91e in clib_mem_alloc_aligned_at_offset (size=54, align=4, 
align_offset=4, os_out_of_memory_on_failure=1)
at /home/vpp/build-data/../src/vppinfra/mem.h:92
#10 0x7697fcde in vec_resize_allocate_memory (v=0x0, 
length_increment=50, data_bytes=54, header_bytes=4, data_align=4)
at /home/vpp/build-data/../src/vppinfra/vec.c:59
#11 0x769313c7 in _vec_resize (v=0x0, length_increment=50, 
data_bytes=50, header_bytes=0, data_align=0)
at /home/vpp/build-data/../src/vppinfra/vec.h:142
#12 0x769322bb in do_percent (_s=0x7fffb6ceea78, fmt=0x76995c90 
"%s:%d (%s) assertion `%s' fails", va=0x7fffb6ceeb10)
at /home/vpp/build-data/../src/vppinfra/format.c:339
#13 0x76932703 in va_format (s=0x0, fmt=0x76995c90 "%s:%d (%s) 
assertion `%s' fails", va=0x7fffb6ceeb10)
at /home/vpp/build-data/../src/vppinfra/format.c:402
#14 0x7692ce4e in _clib_error (how_to_die=2, function_name=0x0, 
line_number=0, fmt=0x76995c90 "%s:%d (%s) assertion `%s' fails")
at /home/vpp/build-data/../src/vppinfra/error.c:127
#15 0x769496a3 in mheap_get_search_free_bin (v=0x3000a000, bin=12, 
n_user_data_bytes_arg=0x7fffb6ceede0, align=4, align_offset=0)
at /home/vpp/build-data/../src/vppinfra/mheap.c:401
#16 0x76949e86 in mheap_get_search_free_list (v=0x3000a000, 
n_user_bytes_arg=0x7fffb6ceede0, align=4, align_offset=0)
at /home/vpp/build-data/../src/vppinfra/mheap.c:569
#17 0x7694a326 in mheap_get_aligned (v=0x3000a000, 
n_user_data_bytes=56, align=4, align_offset=0, offset_return=0x7fffb6ceee88)
at /home/vpp/build-data/../src/vppinfra/mheap.c:700
#18 0x7697f91e in clib_mem_alloc_aligned_at_offset (size=54, align=4, 
align_offset=4, os_out_of_memory_on_failure=1)
at /home/vpp/build-data/../src/vppinfra/mem.h:92
#19 0x7697fcde in vec_resize_allocate_memory (v=0x0, 
length_increment=50, data_bytes=54, header_bytes=4, data_align=4)
---Type  to continue, or q  to quit---
at /home/vpp/build-data/../src/vppinfra/vec.c:59
#20 0x769313c7 in _vec_resize (v=0x0, length_increment=50, 
data_bytes=50, header_bytes=0, data_align=0)
at /home/vpp/build-data/../src/vppinfra/vec.h:142
#21 0x769322bb in do_percent (_s=0x7fffb6cef1a8, fmt=0x76995c90 
"%s:%d (%s) assertion `%s' fails", va=0x7fffb6cef240)
at /home/vpp/build-data/../src/vppinfra/format.c:339
#22 0x0

Re: [vpp-dev] mheap performance issue and fixup

2018-04-23 Thread xyxue

./src/vppinfra/format.c:402 
#132 0x76932876 in format (s=0x0, fmt=0x75f1b18b "%s%c") at 
/home/vpp/build-data/../src/vppinfra/format.c:421 
#133 0x75f0ce10 in shm_name_from_svm_map_region_args (a=0x7fffb6cf4cf0) 
at /home/vpp/build-data/../src/svm/svm.c:525 
#134 0x75f0d60d in svm_map_region (a=0x7fffb6cf4cf0) at 
/home/vpp/build-data/../src/svm/svm.c:658 
#135 0x75f0e663 in svm_region_find_or_create (a=0x7fffb6cf4cf0) at 
/home/vpp/build-data/../src/svm/svm.c:995 
#136 0x77938e9a in vl_map_shmem (region_name=0x779554c7 "/vpe-api", 
is_vlib=1) at /home/vpp/build-data/../src/vlibmemory/memory_shared.c:514 
#137 0x779413fc in memory_api_init (region_name=0x779554c7 
"/vpe-api") at /home/vpp/build-data/../src/vlibmemory/memory_vlib.c:651 
---Type  to continue, or q  to quit--- 
#138 0x77942ed0 in memclnt_process (vm=0x77926840 
, node=0x7fffb6cec000, f=0x0) 
at /home/vpp/build-data/../src/vlibmemory/memory_vlib.c:952 
#139 0x776a603c in vlib_process_bootstrap (_a=140736237530192) at 
/home/vpp/build-data/../src/vlib/main.c:1253 
#140 0x76941570 in clib_calljmp () at 
/home/vpp/build-data/../src/vppinfra/longjmp.S:128 
#141 0x7fffb571ec20 in ?? () 
#142 0x776a6179 in vlib_process_startup (vm=0x77926840 
, p=0x7fffb6cec000, f=0x0) 
at /home/vpp/build-data/../src/vlib/main.c:1275 
Backtrace stopped: previous frame inner to this frame (corrupt stack?) 
(gdb) 
(gdb)

Thanks,
Xyxue



 
From: Kingwel Xie
Date: 2018-04-20 17:29
To: Damjan Marion; Neale Ranns (nranns); 薛欣颖
CC: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup
Hi,
 
Finally I managed to create 3 patches to include all modifications to mheap. 
Please check below for details. I’ll do some other patches later…
 
https://gerrit.fd.io/r/11950
https://gerrit.fd.io/r/11952
https://gerrit.fd.io/r/11957
 
Hi Xue, you need at least the first one for your test.
 
Regards,
Kingwel
 
From: Kingwel Xie 
Sent: Thursday, April 19, 2018 9:20 AM
To: Damjan Marion <damar...@cisco.com>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] mheap performance issue and fixup
 
Hi Damjan,
 
We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so. 
 
Regards,
Kingwel
 
From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan 
Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup
 
Dear Kingwel, 
 
Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...
 
Thanks in Advance,
 
-- 
Damjan
 
On 16 Apr 2018, at 10:13, Kingwel Xie <kingwel@ericsson.com> wrote:
 
Hi all,
 
We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.
 
First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.
 
The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like t

Re: [vpp-dev] mheap performance issue and fixup

2018-04-20 Thread Damjan Marion

Thanks,

i added -2 before it is discussed. Dave is back from vacation next week and he 
is most familiar with that code...

--
Damjan

On 20 Apr 2018, at 11:29, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi,

Finally I managed to create 3 patches to include all modifications to mheap. 
Please check below for details. I’ll do some other patches later…

https://gerrit.fd.io/r/11950
https://gerrit.fd.io/r/11952
https://gerrit.fd.io/r/11957

Hi Xue, you need at least the first one for your test.

Regards,
Kingwel

From: Kingwel Xie
Sent: Thursday, April 19, 2018 9:20 AM
To: Damjan Marion <damar...@cisco.com<mailto:damar...@cisco.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: RE: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com<mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166

Re: [vpp-dev] mheap performance issue and fixup

2018-04-20 Thread xyxue

Hi Kingwel,

Thank you very much for your help. 

Thanks,
Xyxue


 
From: Kingwel Xie
Date: 2018-04-20 17:29
To: Damjan Marion; Neale Ranns (nranns); 薛欣颖
CC: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup
Hi,
 
Finally I managed to create 3 patches to include all modifications to mheap. 
Please check below for details. I’ll do some other patches later…
 
https://gerrit.fd.io/r/11950
https://gerrit.fd.io/r/11952
https://gerrit.fd.io/r/11957
 
Hi Xue, you need at least the first one for your test.
 
Regards,
Kingwel
 
From: Kingwel Xie 
Sent: Thursday, April 19, 2018 9:20 AM
To: Damjan Marion <damar...@cisco.com>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] mheap performance issue and fixup
 
Hi Damjan,
 
We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so. 
 
Regards,
Kingwel
 
From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan 
Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup
 
Dear Kingwel, 
 
Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...
 
Thanks in Advance,
 
-- 
Damjan
 
On 16 Apr 2018, at 10:13, Kingwel Xie <kingwel@ericsson.com> wrote:
 
Hi all,
 
We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.
 
First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.
 
The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:
 
Mheap.c:545 
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);
 
The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:
 
Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3
 
You can see, as pointed out before, the hit rat

Re: [vpp-dev] mheap performance issue and fixup

2018-04-20 Thread Kingwel Xie

Hi,

Finally I managed to create 3 patches to include all modifications to mheap. 
Please check below for details. I’ll do some other patches later…

https://gerrit.fd.io/r/11950
https://gerrit.fd.io/r/11952
https://gerrit.fd.io/r/11957

Hi Xue, you need at least the first one for your test.

Regards,
Kingwel

From: Kingwel Xie
Sent: Thursday, April 19, 2018 9:20 AM
To: Damjan Marion <damar...@cisco.com>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com<mailto:kingwel@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3

You can see, as pointed out before, the hit rate is very high, >99.9%, and 
per-attempt is ~1. Furthermore, the total elements in free list is only 3.

Apart

Re: [vpp-dev] mheap performance issue and fixup

2018-04-19 Thread Kingwel Xie

Get it. Will look into it. It will take a few days…

I’ll ask someone in the team to commit the code, then ask for your review.

From: Neale Ranns (nranns) [mailto:nra...@cisco.com]
Sent: Thursday, April 19, 2018 4:30 PM
To: Kingwel Xie <kingwel@ericsson.com>; Damjan Marion (damarion) 
<damar...@cisco.com>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Kingwei,

The instructions are here:

https://wiki.fd.io/view/VPP/Pulling,_Building,_Running,_Hacking_and_Pushing_VPP_Code#Pushing

you can also file a bug report here:
  https://jira.fd.io
but we don’t insist on bug reports when making changes to code on the master 
branch.

Regards,
Neale

From: <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> on behalf of Kingwel 
Xie <kingwel@ericsson.com<mailto:kingwel@ericsson.com>>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" <damar...@cisco.com<mailto:damar...@cisco.com>>
Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com<mailto:kingwel....@ericsson.com>>
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 419

Re: [vpp-dev] mheap performance issue and fixup

2018-04-19 Thread Neale Ranns

Hi Kingwei,

The instructions are here:
  
https://wiki.fd.io/view/VPP/Pulling,_Building,_Running,_Hacking_and_Pushing_VPP_Code#Pushing

you can also file a bug report here:
  https://jira.fd.io
but we don’t insist on bug reports when making changes to code on the master 
branch.

Regards,
Neale


From: <vpp-dev@lists.fd.io> on behalf of Kingwel Xie <kingwel@ericsson.com>
Date: Thursday, 19 April 2018 at 03:19
To: "Damjan Marion (damarion)" <damar...@cisco.com>
Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] mheap performance issue and fixup

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan 
Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3

You can see, as pointed out before, the hit rate is very high, >99.9%, and 
per-attempt is ~1. Furthermore, the total elements in free lis

Re: [vpp-dev] mheap performance issue and fixup

2018-04-18 Thread Kingwel Xie

Hi Damjan,

We will do it asap. Actually we are quite new to vPP and even don’t know how to 
make bug report and code contribution or so.

Regards,
Kingwel

From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Damjan 
Marion
Sent: Wednesday, April 18, 2018 11:30 PM
To: Kingwel Xie <kingwel@ericsson.com>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] mheap performance issue and fixup

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
<kingwel@ericsson.com<mailto:kingwel@ericsson.com>> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3

You can see, as pointed out before, the hit rate is very high, >99.9%, and 
per-attempt is ~1. Furthermore, the total elements in free list is only 3.

Apart from we discussed above, we also made some other improvements/bug fixes 
to mheap:


  1.  Bug fix: macros MHEAP_ELT_OVERHEAD_BYTES & MHEAP_MIN_USER_DATA_BYTES are 
wrongly defined. In fact MHEAP_ELT_OVERHEAD_BYTES should be (STRUCT_OFFSET_OF 
(mheap_elt_t, user_data))
  2.  mheap_bytes_overhead is wrongly calculating the total overhead – should 
be number of elements * MHEAP_ELT_OVERHEAD_BYTES
  3.  Do not make an element if hi_free_size is smaller than 4 times of 
MHEAP_MIN_USER_DATA_BYTES. This is to avoid memory fragmentation
  4.  Bug fix: register_node.c:336 is wrongly using vec

Re: [vpp-dev] mheap performance issue and fixup

2018-04-18 Thread Damjan Marion

Dear Kingwel,

Thank you for your email. It will be really appreciated if you can submit your 
changes to gerrit, preferably each point in separate patch.
That will be best place to discuss those changes...

Thanks in Advance,

--
Damjan

On 16 Apr 2018, at 10:13, Kingwel Xie 
> wrote:

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we’ve done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer ‘rewrite’ in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn’t have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with ‘show 
memory verbose’. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed ‘per-attempt’ is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate ‘rewrite’ from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
‘per-attempt’ will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3

You can see, as pointed out before, the hit rate is very high, >99.9%, and 
per-attempt is ~1. Furthermore, the total elements in free list is only 3.

Apart from we discussed above, we also made some other improvements/bug fixes 
to mheap:


  1.  Bug fix: macros MHEAP_ELT_OVERHEAD_BYTES & MHEAP_MIN_USER_DATA_BYTES are 
wrongly defined. In fact MHEAP_ELT_OVERHEAD_BYTES should be (STRUCT_OFFSET_OF 
(mheap_elt_t, user_data))
  2.  mheap_bytes_overhead is wrongly calculating the total overhead – should 
be number of elements * MHEAP_ELT_OVERHEAD_BYTES
  3.  Do not make an element if hi_free_size is smaller than 4 times of 
MHEAP_MIN_USER_DATA_BYTES. This is to avoid memory fragmentation
  4.  Bug fix: register_node.c:336 is wrongly using vector memory,  should be 
like this: clib_mem_is_heap_object (vec_header (r->name, 0))
  5.  Bug fix: dpo_stack_from_node in dpo.c: memory leak, of parent_indices
  6.  Some fixes and improvements of format_mheap to show more information of 
heap


The code including all fixes is tentatively in our private code base. It can be 
of course shared if wanted.

Really appreciate any comments!

Regards,
Kingwel

[vpp-dev] mheap performance issue and fixup

2018-04-16 Thread Kingwel Xie

Hi all,

We recently worked on GTPU tunnel and our target is to create 2M tunnels. It is 
not as easy as it looks like, and it took us quite some time to figure it out. 
The biggest problem we found is about mheap, which as you know is the low layer 
memory management function of vPP. We believe it makes sense to share what we 
found and what we've done to improve the performance of mheap.

First of all, mheap is fast. It has well-designed small object cache and 
multi-level free lists, to speed up the get/put. However, as discussed in the 
mail list before, it has a performance issue when dealing with 
align/align_offset allocation. We managed to locate the problem is brought by a 
pointer 'rewrite' in gtp_tunnel_t. This rewrite is a vector and required to be 
aligned to 64B cache line, therefore with 4 bytes align offset. We realized 
that it is because that the free list must be very long, meaning so many 
mheap_elts, but unfortunately it doesn't have an element which fits to all 3 
prerequisites: size, align, and align offset. In this case,  each allocation 
has to traverse all elements till it reaches the end of element. As a result, 
you might observe each allocation is greater than 10 clocks/call with 'show 
memory verbose'. It indicates the allocation takes too long, while it should be 
200~300 clocks/call in general. Also you should have noticed 'per-attempt' is 
quite high, even more than 100.

The fix is straight and simple : as discussed int his mail list before, to 
allocate 'rewrite' from a pool, instead of from mheap. Frankly speaking, it 
looks like a workaround not a real fix, so we spent some time fix the problem 
thoroughly. The idea is to add a few more bytes to the original required block 
size so that mheap will always lookup in a bigger free list, then most likely a 
suitable block can be easily located. Well, now the problem becomes how big is 
this extra size? It should be at least align+align_offset, not hard to 
understand. But after careful analysis we think it is better to be like this, 
see code below:

Mheap.c:545
  word modifier = (align > MHEAP_USER_DATA_WORD_BYTES ? align + align_offset + 
sizeof(mheap_elt_t) : 0);
  bin = user_data_size_to_bin_index (n_user_bytes + modifier);

The reason of extra sizeof(mheap_elt_t) is to avoid lo_free_size is too small 
to hold a complete free element. You will understand it if you really know how 
mheap_get_search_free_bin is working. I am not going to go through the detail 
of it. In short, every lookup in free list will always locate a suitable 
element, in other words, the hit rate of free list will be almost 100%, and the 
'per-attempt' will be always around 1. The test result looks very promising, 
please see below after adding 2M gtpu tunnels and 2M routing entries:

Thread 0 vpp_main
13689507 objects, 3048367k of 3505932k used, 243663k free, 243656k reclaimed, 
106951k overhead, 4194300k capacity
  alloc. from small object cache: 47325868 hits 65271210 attempts (72.51%) 
replacements 8266122
  alloc. from free-list: 21879233 attempts, 21877898 hits (99.99%), 21882794 
considered (per-attempt 1.00)
  alloc. low splits: 13355414, high splits: 512984, combined: 281968
  alloc. from vector-expand: 81907
  allocs: 69285673 276.00 clocks/call
  frees: 55596166 173.09 clocks/call
Free list:
bin 3:
20(82220170 48)
total 1
bin 273:
28340k(80569efc 60)
total 1
bin 276:
215323k(8c88df6c 44)
total 1
Total count in free bin: 3

You can see, as pointed out before, the hit rate is very high, >99.9%, and 
per-attempt is ~1. Furthermore, the total elements in free list is only 3.

Apart from we discussed above, we also made some other improvements/bug fixes 
to mheap:


  1.  Bug fix: macros MHEAP_ELT_OVERHEAD_BYTES & MHEAP_MIN_USER_DATA_BYTES are 
wrongly defined. In fact MHEAP_ELT_OVERHEAD_BYTES should be (STRUCT_OFFSET_OF 
(mheap_elt_t, user_data))
  2.  mheap_bytes_overhead is wrongly calculating the total overhead - should 
be number of elements * MHEAP_ELT_OVERHEAD_BYTES
  3.  Do not make an element if hi_free_size is smaller than 4 times of 
MHEAP_MIN_USER_DATA_BYTES. This is to avoid memory fragmentation
  4.  Bug fix: register_node.c:336 is wrongly using vector memory,  should be 
like this: clib_mem_is_heap_object (vec_header (r->name, 0))
  5.  Bug fix: dpo_stack_from_node in dpo.c: memory leak, of parent_indices
  6.  Some fixes and improvements of format_mheap to show more information of 
heap

The code including all fixes is tentatively in our private code base. It can be 
of course shared if wanted.

Really appreciate any comments!

Regards,
Kingwel

Re: [vpp-dev] mheap performance

2017-09-08 Thread Dave Barach (dbarach)

Dear Jacek,

 

Oh, heck, we don’t need to use a sledgehammer to kill a fly. It will take five 
minutes to fix this problem. Copying Ole Troan for his input, and / or to 
simply fix the problem as follows:

 

Make a set of pools whose elements are n * CLIB_CACHE_LINE BYTES in size. It’s 
easy enough to dynamically create a fresh pool if [all of a sudden] you need k 
* CLIB_CACHE_LINE BYTES

 

Allocate d->rules from the appropriate pool by rounding 1<psid_length to a 
multiple of a cache line.

 

At that point, the memory allocator will instantly behave itself. If necessary, 
you can preallocate the rule pools, see also pool.h.

 

Absent data to the contrary, it’s reasonably likely that cache-line alignment 
of d->rules is unnecessary in the first place. Have you tried dropping the 
alignment constraint? 

 

Thanks… Dave

 

From: Jacek Siuda [mailto:j...@semihalf.com] 
Sent: Friday, September 8, 2017 10:39 AM
To: Dave Barach (dbarach) <dbar...@cisco.com>
Cc: vpp-dev@lists.fd.io; Michał Dubiel <m...@semihalf.com>
Subject: Re: [vpp-dev] mheap performance

 

Hi Dave,

The perf backtrace (taken from "control-only" lcore 0) is as follows:
-  91.87% vpp_main  libvppinfra.so.0.0.0[.] mheap_get_aligned
   - mheap_get_aligned
  - 99.48% map_add_del_psid
   vl_api_map_add_del_rule_t_handler
   vl_msg_api_handler_with_vm_node
   memclnt_process
   vlib_process_bootstrap
   clib_calljmp

Using DPDK's rte_malloc_socket(), CPU consumption drops to around 0,5%.

>From my (somewhat brief) mheap code analysis, it looks like mheap might not 
>take into account alignment when looking for free space to allocate structure. 
>So, in my case, when I keep allocating 16B objects with 64B alignment, it 
>starts to examine each hole it left by previous object's allocation alignment 
>and only then realize it cannot be used because of alignment. But of course I 
>might be wrong and the root cause is entirely elsewhere...

In my test, I'm just adding 300,000 tunnels (one domain+one rule).

Unfortunately, rte_malloc() provides only aligned memory allocation, not 
aligned-at-offset. Theoretically we could provide wrapper around it, but that 
would need some careful coding and a lot of testing. I made an attempt to 
quickly replace mheap globally, but of course it ended up in utter failure.

 

Right now, I added a concept of external allocator to clib (via function 
pointers), I'm enabling it only upon DPDK plugin initialization. However, such 
approach requires using it directly instead of clib alloc, (e.g. I did it upon 
rule adding). While it does not add dependency on DPDK, I'm not fully 
satisfied, because it would need manual replacement of all allocation calls. If 
you want, I can share the patch.

Best Regards,

Jacek.

 

2017-09-05 15:30 GMT+02:00 Dave Barach (dbarach) <dbar...@cisco.com 
<mailto:dbar...@cisco.com> >:

Dear Jacek,

 

Use of the clib memory allocator is mainly historical. It’s elegant in a couple 
of ways - including built-in leak-finding - but it has been known to backfire 
in terms of performance. Individual mheaps are limited to 4gb in a [typical] 
32-bit vector length image. 

 

Note that the idiosyncratic mheap API functions “tell me how long this object 
really is” and “allocate N bytes aligned to a boundary at a certain offset” are 
used all over the place.

 

I wouldn’t mind replacing it - so long as we don’t create a hard dependency on 
the dpdk - but before we go there...: Tell me a bit about the scenario at hand. 
What are we repeatedly allocating / freeing? That’s almost never necessary...

 

Can you easily share the offending backtrace?  

 

Thanks… Dave

 

From: vpp-dev-boun...@lists.fd.io <mailto:vpp-dev-boun...@lists.fd.io>  
[mailto:vpp-dev-boun...@lists.fd.io <mailto:vpp-dev-boun...@lists.fd.io> ] On 
Behalf Of Jacek Siuda
Sent: Tuesday, September 5, 2017 9:08 AM
To: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> 
Subject: [vpp-dev] mheap performance

 

Hi,

I'm conducting a tunnel test using VPP (vnet) map with the following parameters:

ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; total 
number of tunnels: 30, total number of control messages: 600k.

My problem is with simple adding tunnels. After adding more than ~150k-200k, 
performance drops significantly: first 100k is added in ~3s (on asynchronous C 
client), next 100k in another ~5s, but the last 100k takes ~37s to add; in 
total: ~45s. Python clients are performing even worse: 32 minutes(!) for 300k 
tunnels with synchronous (blocking) version and ~95s with asynchronous. The 
python clients are expected to perform a bit worse according to vpp docs, but I 
was worried by non-linear time of single tunnel addition that is visible even 
on C client.

While investigating this using perf, I found the culprit: it is the memory 
allocation done for ip address b

Re: [vpp-dev] mheap performance

2017-09-08 Thread Jacek Siuda

Hi Ole,

I need roughly the amount of tunnels I ended up testing with - ~300-500k.

I thought about extending generic VPP API with a (blocking?) method
accepting burst (array) of messages and returning burst of responses.

One of the challenges in map is that to configure a tunnel I need two
request messages, with the second one depending on first's result (index).
It would be much smoother if I could just send a single message to do both
domain and rule(/s) addition...

Best Regards,
Jacek.

2017-09-05 17:06 GMT+02:00 Ole Troan :

> Jacek,
>
> It's also been on my list for a while to add a better bulk add for MAP
> domains / rules.
> Any idea of the scale you are looking at here?
>
> Best regards,
> Ole
>
>
> > On 5 Sep 2017, at 15:07, Jacek Siuda  wrote:
> >
> > Hi,
> >
> > I'm conducting a tunnel test using VPP (vnet) map with the following
> parameters:
> > ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain;
> total number of tunnels: 30, total number of control messages: 600k.
> >
> > My problem is with simple adding tunnels. After adding more than
> ~150k-200k, performance drops significantly: first 100k is added in ~3s (on
> asynchronous C client), next 100k in another ~5s, but the last 100k takes
> ~37s to add; in total: ~45s. Python clients are performing even worse: 32
> minutes(!) for 300k tunnels with synchronous (blocking) version and ~95s
> with asynchronous. The python clients are expected to perform a bit worse
> according to vpp docs, but I was worried by non-linear time of single
> tunnel addition that is visible even on C client.
> >
> > While investigating this using perf, I found the culprit: it is the
> memory allocation done for ip address by rule addition request.
> > The memory is allocated by clib, which is using mheap library (~98% of
> cpu consumption). I looked into mheap and it looks a bit complicated for
> allocating a short object.
> > I've done a short experiment by replacing (in vnet/map/ only) clib
> allocation with DPDK rte_malloc() and achieved a way better performance:
> 300k tunnels in ~5-6s with the same C-client, and respectively ~70s and
> ~30-40s with Python clients. Also, I haven't noticed any negative impact on
> packet throughput with my experimental allocator.
> >
> > So, here are my questions:
> > 1) Did someone other reported performance penalties for using mheap
> library? I've searched the list archive and could not find any related
> questions.
> > 2) Why mheap library was chosen to be used in clib? Are there any
> performance benefits in some scenarios?
> > 3) Are there any (long- or short-term) plans to replace memory
> management in clib with some other library?
> > 4) I wonder, if I'd like to upstream my solution, how should I approach
> customization of memory allocation, so it would be accepted by community.
> Installable function pointers defaulting to clib?
> >
> > Best Regards,
> > Jacek Siuda.
> >
> >
> > ___
> > vpp-dev mailing list
> > vpp-dev@lists.fd.io
> > https://lists.fd.io/mailman/listinfo/vpp-dev
>
>
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] mheap performance

2017-09-08 Thread Jacek Siuda

Hi Dave,

The perf backtrace (taken from "control-only" lcore 0) is as follows:
-  91.87% vpp_main  libvppinfra.so.0.0.0[.] mheap_get_aligned
   - mheap_get_aligned
  - 99.48% map_add_del_psid
   vl_api_map_add_del_rule_t_handler
   vl_msg_api_handler_with_vm_node
   memclnt_process
   vlib_process_bootstrap
   clib_calljmp

Using DPDK's rte_malloc_socket(), CPU consumption drops to around 0,5%.

>From my (somewhat brief) mheap code analysis, it looks like mheap might not
take into account alignment when looking for free space to allocate
structure. So, in my case, when I keep allocating 16B objects with 64B
alignment, it starts to examine each hole it left by previous object's
allocation alignment and only then realize it cannot be used because of
alignment. But of course I might be wrong and the root cause is entirely
elsewhere...

In my test, I'm just adding 300,000 tunnels (one domain+one rule).

Unfortunately, rte_malloc() provides only aligned memory allocation, not
aligned-at-offset. Theoretically we could provide wrapper around it, but
that would need some careful coding and a lot of testing. I made an attempt
to quickly replace mheap globally, but of course it ended up in utter
failure.

Right now, I added a concept of external allocator to clib (via function
pointers), I'm enabling it only upon DPDK plugin initialization. However,
such approach requires using it directly instead of clib alloc, (e.g. I did
it upon rule adding). While it does not add dependency on DPDK, I'm not
fully satisfied, because it would need manual replacement of all allocation
calls. If you want, I can share the patch.

Best Regards,
Jacek.

2017-09-05 15:30 GMT+02:00 Dave Barach (dbarach) <dbar...@cisco.com>:

> Dear Jacek,
>
>
>
> Use of the clib memory allocator is mainly historical. It’s elegant in a
> couple of ways - including built-in leak-finding - but it has been known to
> backfire in terms of performance. Individual mheaps are limited to 4gb in a
> [typical] 32-bit vector length image.
>
>
>
> Note that the idiosyncratic mheap API functions “tell me how long this
> object really is” and “allocate N bytes aligned to a boundary at a certain
> offset” are used all over the place.
>
>
>
> I wouldn’t mind replacing it - so long as we don’t create a hard
> dependency on the dpdk - but before we go there...: Tell me a bit about the
> scenario at hand. What are we repeatedly allocating / freeing? That’s
> almost never necessary...
>
>
>
> Can you easily share the offending backtrace?
>
>
>
> Thanks… Dave
>
>
>
> *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On
> Behalf Of *Jacek Siuda
> *Sent:* Tuesday, September 5, 2017 9:08 AM
> *To:* vpp-dev@lists.fd.io
> *Subject:* [vpp-dev] mheap performance
>
>
>
> Hi,
>
> I'm conducting a tunnel test using VPP (vnet) map with the following
> parameters:
>
> ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain;
> total number of tunnels: 30, total number of control messages: 600k.
>
> My problem is with simple adding tunnels. After adding more than
> ~150k-200k, performance drops significantly: first 100k is added in ~3s (on
> asynchronous C client), next 100k in another ~5s, but the last 100k takes
> ~37s to add; in total: ~45s. Python clients are performing even worse: 32
> minutes(!) for 300k tunnels with synchronous (blocking) version and ~95s
> with asynchronous. The python clients are expected to perform a bit worse
> according to vpp docs, but I was worried by non-linear time of single
> tunnel addition that is visible even on C client.
>
> While investigating this using perf, I found the culprit: it is the memory
> allocation done for ip address by rule addition request.
> The memory is allocated by clib, which is using mheap library (~98% of cpu
> consumption). I looked into mheap and it looks a bit complicated for
> allocating a short object.
> I've done a short experiment by replacing (in vnet/map/ only) clib
> allocation with DPDK rte_malloc() and achieved a way better performance:
> 300k tunnels in ~5-6s with the same C-client, and respectively ~70s and
> ~30-40s with Python clients. Also, I haven't noticed any negative impact on
> packet throughput with my experimental allocator.
>
> So, here are my questions:
>
> 1) Did someone other reported performance penalties for using mheap
> library? I've searched the list archive and could not find any related
> questions.
>
> 2) Why mheap library was chosen to be used in clib? Are there any
> performance benefits in some scenarios?
>
> 3) Are there any (long- or short-term) plans to replace memory management
> in clib with some

Re: [vpp-dev] mheap performance

2017-09-05 Thread Ole Troan

Jacek,

It's also been on my list for a while to add a better bulk add for MAP domains 
/ rules.
Any idea of the scale you are looking at here?

Best regards,
Ole


> On 5 Sep 2017, at 15:07, Jacek Siuda  wrote:
> 
> Hi,
> 
> I'm conducting a tunnel test using VPP (vnet) map with the following 
> parameters:
> ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; 
> total number of tunnels: 30, total number of control messages: 600k.
> 
> My problem is with simple adding tunnels. After adding more than ~150k-200k, 
> performance drops significantly: first 100k is added in ~3s (on asynchronous 
> C client), next 100k in another ~5s, but the last 100k takes ~37s to add; in 
> total: ~45s. Python clients are performing even worse: 32 minutes(!) for 300k 
> tunnels with synchronous (blocking) version and ~95s with asynchronous. The 
> python clients are expected to perform a bit worse according to vpp docs, but 
> I was worried by non-linear time of single tunnel addition that is visible 
> even on C client.
> 
> While investigating this using perf, I found the culprit: it is the memory 
> allocation done for ip address by rule addition request.
> The memory is allocated by clib, which is using mheap library (~98% of cpu 
> consumption). I looked into mheap and it looks a bit complicated for 
> allocating a short object.
> I've done a short experiment by replacing (in vnet/map/ only) clib allocation 
> with DPDK rte_malloc() and achieved a way better performance: 300k tunnels in 
> ~5-6s with the same C-client, and respectively ~70s and ~30-40s with Python 
> clients. Also, I haven't noticed any negative impact on packet throughput 
> with my experimental allocator.
> 
> So, here are my questions:
> 1) Did someone other reported performance penalties for using mheap library? 
> I've searched the list archive and could not find any related questions.
> 2) Why mheap library was chosen to be used in clib? Are there any performance 
> benefits in some scenarios?
> 3) Are there any (long- or short-term) plans to replace memory management in 
> clib with some other library?
> 4) I wonder, if I'd like to upstream my solution, how should I approach 
> customization of memory allocation, so it would be accepted by community. 
> Installable function pointers defaulting to clib?
> 
> Best Regards,
> Jacek Siuda.
> 
> 
> ___
> vpp-dev mailing list
> vpp-dev@lists.fd.io
> https://lists.fd.io/mailman/listinfo/vpp-dev



signature.asc
Description: Message signed with OpenPGP
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] mheap performance

2017-09-05 Thread Dave Barach (dbarach)

Dear Jacek,

Use of the clib memory allocator is mainly historical. It’s elegant in a couple 
of ways - including built-in leak-finding - but it has been known to backfire 
in terms of performance. Individual mheaps are limited to 4gb in a [typical] 
32-bit vector length image.

Note that the idiosyncratic mheap API functions “tell me how long this object 
really is” and “allocate N bytes aligned to a boundary at a certain offset” are 
used all over the place.

I wouldn’t mind replacing it - so long as we don’t create a hard dependency on 
the dpdk - but before we go there...: Tell me a bit about the scenario at hand. 
What are we repeatedly allocating / freeing? That’s almost never necessary...

Can you easily share the offending backtrace?

Thanks… Dave

From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Jacek Siuda
Sent: Tuesday, September 5, 2017 9:08 AM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] mheap performance

Hi,
I'm conducting a tunnel test using VPP (vnet) map with the following parameters:
ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; total 
number of tunnels: 30, total number of control messages: 600k.
My problem is with simple adding tunnels. After adding more than ~150k-200k, 
performance drops significantly: first 100k is added in ~3s (on asynchronous C 
client), next 100k in another ~5s, but the last 100k takes ~37s to add; in 
total: ~45s. Python clients are performing even worse: 32 minutes(!) for 300k 
tunnels with synchronous (blocking) version and ~95s with asynchronous. The 
python clients are expected to perform a bit worse according to vpp docs, but I 
was worried by non-linear time of single tunnel addition that is visible even 
on C client.
While investigating this using perf, I found the culprit: it is the memory 
allocation done for ip address by rule addition request.
The memory is allocated by clib, which is using mheap library (~98% of cpu 
consumption). I looked into mheap and it looks a bit complicated for allocating 
a short object.
I've done a short experiment by replacing (in vnet/map/ only) clib allocation 
with DPDK rte_malloc() and achieved a way better performance: 300k tunnels in 
~5-6s with the same C-client, and respectively ~70s and ~30-40s with Python 
clients. Also, I haven't noticed any negative impact on packet throughput with 
my experimental allocator.
So, here are my questions:
1) Did someone other reported performance penalties for using mheap library? 
I've searched the list archive and could not find any related questions.
2) Why mheap library was chosen to be used in clib? Are there any performance 
benefits in some scenarios?
3) Are there any (long- or short-term) plans to replace memory management in 
clib with some other library?
4) I wonder, if I'd like to upstream my solution, how should I approach 
customization of memory allocation, so it would be accepted by community. 
Installable function pointers defaulting to clib?

Best Regards,
Jacek Siuda.


___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance issue and fixup

[vpp-dev] mheap performance issue and fixup

Re: [vpp-dev] mheap performance

Re: [vpp-dev] mheap performance

Re: [vpp-dev] mheap performance

Re: [vpp-dev] mheap performance

Re: [vpp-dev] mheap performance

20 matches

Site Navigation

Mail list logo

Footer information