[vpp-dev] VPP map fragmentation issue
Hi, We have a problem with fragmentation in map while running the following test: 1. Create an IPv4 packet that won't fit into MTU (but is smaller than 2*MTU). 2. Send it through interface that will do automatic fragmentation - this produces two fragments - one that fits exactly MTU and another that is smaller. 3. VPP seems to receive both and is expected to encapsulate them into IPv6 then forward further. However what we see forwarded is only a smaller packet (encapsulated properly, though). What I suspect is that the larger fragment gets encapsulated too, but it does not fit MTU then and gets dropped. This happens regardless of any map fragmentation/reassembly setting. Any Idea how I could debug it further? Any logs? Maybe we still need to configure something more - not on map level? Best Regards, Jacek Siuda. ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
Re: [vpp-dev] mheap performance
Hi Ole, I need roughly the amount of tunnels I ended up testing with - ~300-500k. I thought about extending generic VPP API with a (blocking?) method accepting burst (array) of messages and returning burst of responses. One of the challenges in map is that to configure a tunnel I need two request messages, with the second one depending on first's result (index). It would be much smoother if I could just send a single message to do both domain and rule(/s) addition... Best Regards, Jacek. 2017-09-05 17:06 GMT+02:00 Ole Troan : > Jacek, > > It's also been on my list for a while to add a better bulk add for MAP > domains / rules. > Any idea of the scale you are looking at here? > > Best regards, > Ole > > > > On 5 Sep 2017, at 15:07, Jacek Siuda wrote: > > > > Hi, > > > > I'm conducting a tunnel test using VPP (vnet) map with the following > parameters: > > ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; > total number of tunnels: 30, total number of control messages: 600k. > > > > My problem is with simple adding tunnels. After adding more than > ~150k-200k, performance drops significantly: first 100k is added in ~3s (on > asynchronous C client), next 100k in another ~5s, but the last 100k takes > ~37s to add; in total: ~45s. Python clients are performing even worse: 32 > minutes(!) for 300k tunnels with synchronous (blocking) version and ~95s > with asynchronous. The python clients are expected to perform a bit worse > according to vpp docs, but I was worried by non-linear time of single > tunnel addition that is visible even on C client. > > > > While investigating this using perf, I found the culprit: it is the > memory allocation done for ip address by rule addition request. > > The memory is allocated by clib, which is using mheap library (~98% of > cpu consumption). I looked into mheap and it looks a bit complicated for > allocating a short object. > > I've done a short experiment by replacing (in vnet/map/ only) clib > allocation with DPDK rte_malloc() and achieved a way better performance: > 300k tunnels in ~5-6s with the same C-client, and respectively ~70s and > ~30-40s with Python clients. Also, I haven't noticed any negative impact on > packet throughput with my experimental allocator. > > > > So, here are my questions: > > 1) Did someone other reported performance penalties for using mheap > library? I've searched the list archive and could not find any related > questions. > > 2) Why mheap library was chosen to be used in clib? Are there any > performance benefits in some scenarios? > > 3) Are there any (long- or short-term) plans to replace memory > management in clib with some other library? > > 4) I wonder, if I'd like to upstream my solution, how should I approach > customization of memory allocation, so it would be accepted by community. > Installable function pointers defaulting to clib? > > > > Best Regards, > > Jacek Siuda. > > > > > > ___ > > vpp-dev mailing list > > vpp-dev@lists.fd.io > > https://lists.fd.io/mailman/listinfo/vpp-dev > > ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
Re: [vpp-dev] mheap performance
Hi Dave, The perf backtrace (taken from "control-only" lcore 0) is as follows: - 91.87% vpp_main libvppinfra.so.0.0.0[.] mheap_get_aligned - mheap_get_aligned - 99.48% map_add_del_psid vl_api_map_add_del_rule_t_handler vl_msg_api_handler_with_vm_node memclnt_process vlib_process_bootstrap clib_calljmp Using DPDK's rte_malloc_socket(), CPU consumption drops to around 0,5%. >From my (somewhat brief) mheap code analysis, it looks like mheap might not take into account alignment when looking for free space to allocate structure. So, in my case, when I keep allocating 16B objects with 64B alignment, it starts to examine each hole it left by previous object's allocation alignment and only then realize it cannot be used because of alignment. But of course I might be wrong and the root cause is entirely elsewhere... In my test, I'm just adding 300,000 tunnels (one domain+one rule). Unfortunately, rte_malloc() provides only aligned memory allocation, not aligned-at-offset. Theoretically we could provide wrapper around it, but that would need some careful coding and a lot of testing. I made an attempt to quickly replace mheap globally, but of course it ended up in utter failure. Right now, I added a concept of external allocator to clib (via function pointers), I'm enabling it only upon DPDK plugin initialization. However, such approach requires using it directly instead of clib alloc, (e.g. I did it upon rule adding). While it does not add dependency on DPDK, I'm not fully satisfied, because it would need manual replacement of all allocation calls. If you want, I can share the patch. Best Regards, Jacek. 2017-09-05 15:30 GMT+02:00 Dave Barach (dbarach) : > Dear Jacek, > > > > Use of the clib memory allocator is mainly historical. It’s elegant in a > couple of ways - including built-in leak-finding - but it has been known to > backfire in terms of performance. Individual mheaps are limited to 4gb in a > [typical] 32-bit vector length image. > > > > Note that the idiosyncratic mheap API functions “tell me how long this > object really is” and “allocate N bytes aligned to a boundary at a certain > offset” are used all over the place. > > > > I wouldn’t mind replacing it - so long as we don’t create a hard > dependency on the dpdk - but before we go there...: Tell me a bit about the > scenario at hand. What are we repeatedly allocating / freeing? That’s > almost never necessary... > > > > Can you easily share the offending backtrace? > > > > Thanks… Dave > > > > *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On > Behalf Of *Jacek Siuda > *Sent:* Tuesday, September 5, 2017 9:08 AM > *To:* vpp-dev@lists.fd.io > *Subject:* [vpp-dev] mheap performance > > > > Hi, > > I'm conducting a tunnel test using VPP (vnet) map with the following > parameters: > > ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; > total number of tunnels: 30, total number of control messages: 600k. > > My problem is with simple adding tunnels. After adding more than > ~150k-200k, performance drops significantly: first 100k is added in ~3s (on > asynchronous C client), next 100k in another ~5s, but the last 100k takes > ~37s to add; in total: ~45s. Python clients are performing even worse: 32 > minutes(!) for 300k tunnels with synchronous (blocking) version and ~95s > with asynchronous. The python clients are expected to perform a bit worse > according to vpp docs, but I was worried by non-linear time of single > tunnel addition that is visible even on C client. > > While investigating this using perf, I found the culprit: it is the memory > allocation done for ip address by rule addition request. > The memory is allocated by clib, which is using mheap library (~98% of cpu > consumption). I looked into mheap and it looks a bit complicated for > allocating a short object. > I've done a short experiment by replacing (in vnet/map/ only) clib > allocation with DPDK rte_malloc() and achieved a way better performance: > 300k tunnels in ~5-6s with the same C-client, and respectively ~70s and > ~30-40s with Python clients. Also, I haven't noticed any negative impact on > packet throughput with my experimental allocator. > > So, here are my questions: > > 1) Did someone other reported performance penalties for using mheap > library? I've searched the list archive and could not find any related > questions. > > 2) Why mheap library was chosen to be used in clib? Are there any > performance benefits in some scenarios? > > 3) Are there any (long- or short-term) plans to replace me
[vpp-dev] mheap performance
Hi, I'm conducting a tunnel test using VPP (vnet) map with the following parameters: ea_bits_len=0, psid_offset=16, psid=length, single rule for each domain; total number of tunnels: 30, total number of control messages: 600k. My problem is with simple adding tunnels. After adding more than ~150k-200k, performance drops significantly: first 100k is added in ~3s (on asynchronous C client), next 100k in another ~5s, but the last 100k takes ~37s to add; in total: ~45s. Python clients are performing even worse: 32 minutes(!) for 300k tunnels with synchronous (blocking) version and ~95s with asynchronous. The python clients are expected to perform a bit worse according to vpp docs, but I was worried by non-linear time of single tunnel addition that is visible even on C client. While investigating this using perf, I found the culprit: it is the memory allocation done for ip address by rule addition request. The memory is allocated by clib, which is using mheap library (~98% of cpu consumption). I looked into mheap and it looks a bit complicated for allocating a short object. I've done a short experiment by replacing (in vnet/map/ only) clib allocation with DPDK rte_malloc() and achieved a way better performance: 300k tunnels in ~5-6s with the same C-client, and respectively ~70s and ~30-40s with Python clients. Also, I haven't noticed any negative impact on packet throughput with my experimental allocator. So, here are my questions: 1) Did someone other reported performance penalties for using mheap library? I've searched the list archive and could not find any related questions. 2) Why mheap library was chosen to be used in clib? Are there any performance benefits in some scenarios? 3) Are there any (long- or short-term) plans to replace memory management in clib with some other library? 4) I wonder, if I'd like to upstream my solution, how should I approach customization of memory allocation, so it would be accepted by community. Installable function pointers defaulting to clib? Best Regards, Jacek Siuda. ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev