Re: [vpp-dev] Packet loss on use of API & cmdline
Dear Colin, That makes total sense... Tunnels are modelled as "magic interfaces" [especially] in the encap direction. Each tunnel has an output node, which means that the [first] FIB entry will need to add a graph arc. A bit of "show vlib graph" action will confirm that... Thanks… Dave -Original Message- From: Colin Tregenza Dancer [mailto:c...@metaswitch.com] Sent: Friday, September 1, 2017 11:01 AM To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org>; Neale Ranns (nranns) <nra...@cisco.com> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline I think there is something special in this case related to the fact that we're adding of a new tunnel / subnet, before we issue our 63 ip_neighbor_add_del calls, because it is only the first call ip_neighbor_add_del which updates the nodes, with all of the other just doing a rewrite. I'll mail you guys the full (long) trace offline so you can see the overall sequence. Cheers, Colin. -Original Message- From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] Sent: 01 September 2017 15:15 To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan <otr...@employees.org>; Neale Ranns (nranns) <nra...@cisco.com> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Dear Colin, Of all of these, ip_neighbor_add_del seems like the one to tune right away. The API message handler itself can be marked mp-safe right away. Both the ip4 and the ip6 underlying routines are thread-aware (mumble RPC mumble). We should figure out why the FIB thinks it needs to pull the node runtime update lever. AFAICT, adding ip arp / ip6 neighbor adjacencies shouldn’t require a node runtime update, at least not in the typical case. Copying Neale Ranns. I don't expect to hear back immediately; he's on PTO until 9/11. Thanks… Dave -Original Message- From: Colin Tregenza Dancer [mailto:c...@metaswitch.com] Sent: Friday, September 1, 2017 8:51 AM To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi Dave, Thanks for looking at this. I get repeated vlib_node_runtime_update() calls when I use the API functions: ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may be others which I’m not currently calling). To illustrate, I've included below a formatted version of my barrier trace from when I make an ip_neighbor_add_del API call (raw traces for the other commands are included at the end). At the point this call was made there were 3 worker threads, ~425nodes in the system, and a load of ~3Mpps saturating two 10G NICs. It shows the API function name, followed by a tree of the recursive calls to barrier_sync/release. On each line I show the calling function name, current recursion depth, and elapsed timing from the point the barrier was actually closed. [50]: ip_neighbor_add_del <2(80us)adj_nbr_update_rewrite_internal <3(82us)vlib_node_runtime_update{(86us)} (86us)> <3(87us)vlib_node_runtime_update{(90us)} (90us)> <3(91us)vlib_node_runtime_update{(94us)} (95us)> (95us)> <2(119us)adj_nbr_update_rewrite_internal (120us)> (135us)> (136us)> {(137us)vlib_worker_thread_node_runtime_update [179us] [256us] worker=1 worker=2 worker=3 (480us)} This trace is taken on my dev branch, where I am delaying the worker thread updates till just before the barrier release. In the vlib_node_runtime_update functions, the time stamp within the {} braces show the point as which the rework_required flag is set (instead of the mainline behaviour of repeatedly invoking vlib_worker_thread_node_runtime_update()) At the end you can also see the additional profiling stamps I've added at various points within vlib_worker_thread_node_runtime_update(). The first two stamps are after the two stats sync loops, then there are three lines of tracing for the invocations of the function I've added to contain the code for the per worker re-fork. Those functions calls are further profiled at various points, where the gap between B & C is where the clone node alloc/copying is occurring, and between C & D is where the old clone nodes are being freed. As you might guess from the short C-D gap, this branch also included my optimization to allocate/free all the clone nodes in a single block. Having successfully tested the move of the per thread re-fork into a separate function, I'm about try the "
Re: [vpp-dev] Packet loss on use of API & cmdline
I think there is something special in this case related to the fact that we're adding of a new tunnel / subnet, before we issue our 63 ip_neighbor_add_del calls, because it is only the first call ip_neighbor_add_del which updates the nodes, with all of the other just doing a rewrite. I'll mail you guys the full (long) trace offline so you can see the overall sequence. Cheers, Colin. -Original Message- From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] Sent: 01 September 2017 15:15 To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan <otr...@employees.org>; Neale Ranns (nranns) <nra...@cisco.com> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Dear Colin, Of all of these, ip_neighbor_add_del seems like the one to tune right away. The API message handler itself can be marked mp-safe right away. Both the ip4 and the ip6 underlying routines are thread-aware (mumble RPC mumble). We should figure out why the FIB thinks it needs to pull the node runtime update lever. AFAICT, adding ip arp / ip6 neighbor adjacencies shouldn’t require a node runtime update, at least not in the typical case. Copying Neale Ranns. I don't expect to hear back immediately; he's on PTO until 9/11. Thanks… Dave -Original Message- From: Colin Tregenza Dancer [mailto:c...@metaswitch.com] Sent: Friday, September 1, 2017 8:51 AM To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi Dave, Thanks for looking at this. I get repeated vlib_node_runtime_update() calls when I use the API functions: ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may be others which I’m not currently calling). To illustrate, I've included below a formatted version of my barrier trace from when I make an ip_neighbor_add_del API call (raw traces for the other commands are included at the end). At the point this call was made there were 3 worker threads, ~425nodes in the system, and a load of ~3Mpps saturating two 10G NICs. It shows the API function name, followed by a tree of the recursive calls to barrier_sync/release. On each line I show the calling function name, current recursion depth, and elapsed timing from the point the barrier was actually closed. [50]: ip_neighbor_add_del <2(80us)adj_nbr_update_rewrite_internal <3(82us)vlib_node_runtime_update{(86us)} (86us)> <3(87us)vlib_node_runtime_update{(90us)} (90us)> <3(91us)vlib_node_runtime_update{(94us)} (95us)> (95us)> <2(119us)adj_nbr_update_rewrite_internal (120us)> (135us)> (136us)> {(137us)vlib_worker_thread_node_runtime_update [179us] [256us] worker=1 worker=2 worker=3 (480us)} This trace is taken on my dev branch, where I am delaying the worker thread updates till just before the barrier release. In the vlib_node_runtime_update functions, the time stamp within the {} braces show the point as which the rework_required flag is set (instead of the mainline behaviour of repeatedly invoking vlib_worker_thread_node_runtime_update()) At the end you can also see the additional profiling stamps I've added at various points within vlib_worker_thread_node_runtime_update(). The first two stamps are after the two stats sync loops, then there are three lines of tracing for the invocations of the function I've added to contain the code for the per worker re-fork. Those functions calls are further profiled at various points, where the gap between B & C is where the clone node alloc/copying is occurring, and between C & D is where the old clone nodes are being freed. As you might guess from the short C-D gap, this branch also included my optimization to allocate/free all the clone nodes in a single block. Having successfully tested the move of the per thread re-fork into a separate function, I'm about try the "collective brainsurgery" version, where I will get the workers to re-fork their own clone (with the barrier still held) rather than having in done sequentially by main. I'll let you know how it goes... Colin. _Raw traces of other calls_ Sep 1 12:57:38 pocvmhost vpp[6315]: [155]: gre_add_del_tunnel Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(53us)vlib_node_runtime_update{(86us)}(87us)><1(96us)vlib_node_runtime_update{(99us)}(99us)><1(100us)vlib_node_runtime_update{(102us)}(103us)><1(227us)vlib_node_runtime_update{(232us)}(233us)><1(235us)vlib_node_runtime_update{(237us)}(238us)><1(308us)vli
Re: [vpp-dev] Packet loss on use of API & cmdline
Dear Colin, Of all of these, ip_neighbor_add_del seems like the one to tune right away. The API message handler itself can be marked mp-safe right away. Both the ip4 and the ip6 underlying routines are thread-aware (mumble RPC mumble). We should figure out why the FIB thinks it needs to pull the node runtime update lever. AFAICT, adding ip arp / ip6 neighbor adjacencies shouldn’t require a node runtime update, at least not in the typical case. Copying Neale Ranns. I don't expect to hear back immediately; he's on PTO until 9/11. Thanks… Dave -Original Message- From: Colin Tregenza Dancer [mailto:c...@metaswitch.com] Sent: Friday, September 1, 2017 8:51 AM To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi Dave, Thanks for looking at this. I get repeated vlib_node_runtime_update() calls when I use the API functions: ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may be others which I’m not currently calling). To illustrate, I've included below a formatted version of my barrier trace from when I make an ip_neighbor_add_del API call (raw traces for the other commands are included at the end). At the point this call was made there were 3 worker threads, ~425nodes in the system, and a load of ~3Mpps saturating two 10G NICs. It shows the API function name, followed by a tree of the recursive calls to barrier_sync/release. On each line I show the calling function name, current recursion depth, and elapsed timing from the point the barrier was actually closed. [50]: ip_neighbor_add_del <2(80us)adj_nbr_update_rewrite_internal <3(82us)vlib_node_runtime_update{(86us)} (86us)> <3(87us)vlib_node_runtime_update{(90us)} (90us)> <3(91us)vlib_node_runtime_update{(94us)} (95us)> (95us)> <2(119us)adj_nbr_update_rewrite_internal (120us)> (135us)> (136us)> {(137us)vlib_worker_thread_node_runtime_update [179us] [256us] worker=1 worker=2 worker=3 (480us)} This trace is taken on my dev branch, where I am delaying the worker thread updates till just before the barrier release. In the vlib_node_runtime_update functions, the time stamp within the {} braces show the point as which the rework_required flag is set (instead of the mainline behaviour of repeatedly invoking vlib_worker_thread_node_runtime_update()) At the end you can also see the additional profiling stamps I've added at various points within vlib_worker_thread_node_runtime_update(). The first two stamps are after the two stats sync loops, then there are three lines of tracing for the invocations of the function I've added to contain the code for the per worker re-fork. Those functions calls are further profiled at various points, where the gap between B & C is where the clone node alloc/copying is occurring, and between C & D is where the old clone nodes are being freed. As you might guess from the short C-D gap, this branch also included my optimization to allocate/free all the clone nodes in a single block. Having successfully tested the move of the per thread re-fork into a separate function, I'm about try the "collective brainsurgery" version, where I will get the workers to re-fork their own clone (with the barrier still held) rather than having in done sequentially by main. I'll let you know how it goes... Colin. _Raw traces of other calls_ Sep 1 12:57:38 pocvmhost vpp[6315]: [155]: gre_add_del_tunnel Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(53us)vlib_node_runtime_update{(86us)}(87us)><1(96us)vlib_node_runtime_update{(99us)}(99us)><1(100us)vlib_node_runtime_update{(102us)}(103us)><1(227us)vlib_node_runtime_update{(232us)}(233us)><1(235us)vlib_node_runtime_update{(237us)}(238us)><1(308us)vlib_node_runtime_update{(313us)}(314us)><1(316us)adj_nbr_update_rewrite_internal(317us)><1(349us)adj_nbr_update_rewrite_internal(350us)>(353us)>{(354us)vlib_worker_thread_node_runtime_update[394us][462us]worker=1[423:425]worker=2[423:425]worker=3[423:425](708us)} Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42822 - O 300 D 5 C 708 U 0 - nested 8 Sep 1 12:57:38 pocvmhost vpp[6315]: [13]: sw_interface_set_flags Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync(45us)> Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42823 - O 1143 D70 C 46 U 0 - nested 0 Sep 1 12:57:38 pocvmhost vpp[6315]: [85]: create_loopback Sep 1 12:57:38 pocvmhost vpp
Re: [vpp-dev] Packet loss on use of API & cmdline
t vpp[6315]: Barrier(us) # 42825 - O 1140 D10 C 259 U 0 - nested 1 Sep 1 12:57:38 pocvmhost vpp[6315]: [16]: sw_interface_add_del_address Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(15us)vlib_node_runtime_update{(20us)}(21us)>(70us)>{(71us)vlib_worker_thread_node_runtime_update[87us][115us]worker=1[427:427]worker=2[427:427]worker=3[427:427](307us)} -Original Message- From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] Sent: 01 September 2017 13:00 To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Dear Colin, Please describe the scenario which leads to vlib_node_runtime_update(). I wouldn't mind having a good long stare at the situation. I do like the parallel data structure update approach that you've described, tempered with the realization that it amounts to "collective brain surgery." I had more than enough trouble making the data structure fork-and-update code work reliably in the first place. Thanks… Dave -Original Message- From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Colin Tregenza Dancer via vpp-dev Sent: Friday, September 1, 2017 6:12 AM To: Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Ole, Thanks for the quick reply. I did think about making all the commands we use is_mp_safe, but was both concerned about the extent of the work, and the potential for introducing subtle bugs. I also didn't think it would help my key problem, which is the multi-ms commands which make multiple calls to vlib_node_runtime_update(), not least because it seemed likely that I'd need to hold the barrier across the multiple node changes in a single API call (to avoid inconsistent intermediate states). Do you have any thoughts on the change to call vlib_worker_thread_node_runtime_update() a single time just before releasing the barrier? It seems to work fine, but I'm keen to get input from someone who has been working on the codebase for longer. More generally, even with my changes, vlib_worker_thread_node_runtime_update() is the single function which holds the barrier for longer than all other elements, and is the one which therefore most runs the risk of causing Rx overflow. Detailed profiling showed that for my setup, ~40-50% of the time is taken in "/* re-fork nodes */" with the memory functions used to allocate the new clone nodes, and free the old clones. Given that we know the number of nodes at the start of the loop, and given that (as far as I can tell) new clone nodes aren't altered between calls to the update function, I tried a change to allocate/free all the nodes as a single block (whilst still cloning and inserting them as before). I needed to make a matching change in the "/* fork nodes */" code in start_workers(), (and probably need to make a matching change in the termination code,) but in testing this almost halves the execution time of vlib_worker_thread_node_runtime_update() without any obvious problems. Having said that, the execution time of the node cloning remains O(M.N), where M is the number of threads and N the number of nodes. This is reflected in the fact that when I try on larger system (i.e. more workers and more nodes) I again suffer packet loss because this one function is holding the barrier for multiple ms. The change I'm currently working on is to try and reduce to delay to O(N) by getting the worker threads to clone their own data structures in parallel. I'm doing this by extending their busy wait on the barrier, to also include looking for a flag telling them to rebuild their data structures. When the main thread is about to release the barrier, and decides it needs a rebuild, I was going to get it to do the relatively quick stats scraping, then sets the flag telling the workers to rebuild their clones. The rebuild will then happen on all the workers in parallel (which looking at the code seems to be safe), and only when all the cloning is done, will the main thread actually release the barrier. I hope to get results from this soon, and will let you know how it goes, but again I'm very keen to get other people's views. Cheers, Colin. ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
Re: [vpp-dev] Packet loss on use of API & cmdline
Dear Colin, Please describe the scenario which leads to vlib_node_runtime_update(). I wouldn't mind having a good long stare at the situation. I do like the parallel data structure update approach that you've described, tempered with the realization that it amounts to "collective brain surgery." I had more than enough trouble making the data structure fork-and-update code work reliably in the first place. Thanks… Dave -Original Message- From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Colin Tregenza Dancer via vpp-dev Sent: Friday, September 1, 2017 6:12 AM To: Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Ole, Thanks for the quick reply. I did think about making all the commands we use is_mp_safe, but was both concerned about the extent of the work, and the potential for introducing subtle bugs. I also didn't think it would help my key problem, which is the multi-ms commands which make multiple calls to vlib_node_runtime_update(), not least because it seemed likely that I'd need to hold the barrier across the multiple node changes in a single API call (to avoid inconsistent intermediate states). Do you have any thoughts on the change to call vlib_worker_thread_node_runtime_update() a single time just before releasing the barrier? It seems to work fine, but I'm keen to get input from someone who has been working on the codebase for longer. More generally, even with my changes, vlib_worker_thread_node_runtime_update() is the single function which holds the barrier for longer than all other elements, and is the one which therefore most runs the risk of causing Rx overflow. Detailed profiling showed that for my setup, ~40-50% of the time is taken in "/* re-fork nodes */" with the memory functions used to allocate the new clone nodes, and free the old clones. Given that we know the number of nodes at the start of the loop, and given that (as far as I can tell) new clone nodes aren't altered between calls to the update function, I tried a change to allocate/free all the nodes as a single block (whilst still cloning and inserting them as before). I needed to make a matching change in the "/* fork nodes */" code in start_workers(), (and probably need to make a matching change in the termination code,) but in testing this almost halves the execution time of vlib_worker_thread_node_runtime_update() without any obvious problems. Having said that, the execution time of the node cloning remains O(M.N), where M is the number of threads and N the number of nodes. This is reflected in the fact that when I try on larger system (i.e. more workers and more nodes) I again suffer packet loss because this one function is holding the barrier for multiple ms. The change I'm currently working on is to try and reduce to delay to O(N) by getting the worker threads to clone their own data structures in parallel. I'm doing this by extending their busy wait on the barrier, to also include looking for a flag telling them to rebuild their data structures. When the main thread is about to release the barrier, and decides it needs a rebuild, I was going to get it to do the relatively quick stats scraping, then sets the flag telling the workers to rebuild their clones. The rebuild will then happen on all the workers in parallel (which looking at the code seems to be safe), and only when all the cloning is done, will the main thread actually release the barrier. I hope to get results from this soon, and will let you know how it goes, but again I'm very keen to get other people's views. Cheers, Colin. -Original Message- From: Ole Troan [mailto:otr...@employees.org] Sent: 01 September 2017 09:37 To: Colin Tregenza Dancer <c...@metaswitch.com> Cc: Neale Ranns (nranns) <nra...@cisco.com>; Florin Coras <fcoras.li...@gmail.com>; vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Colin, Good investigation! A good first step would be to make all APIs and CLIs thread safe. When an API/CLI is thread safe, that must be flagged through the is_mp_safe flag. It is quite likely that many already are, but haven't been flagged as such. Best regards, Ole > On 31 Aug 2017, at 19:07, Colin Tregenza Dancer via vpp-dev > <vpp-dev@lists.fd.io> wrote: > > I’ve been doing quite a bit of investigation since my last email, in > particular adding instrumentation on barrier calls to report > open/lowering/closed/raising times, along with calling trees and nesting > levels. > > As a result I believe I now have a clearer understanding of what’s leading to > the packet loss I’m observing when using the API, along with some code > changes which in my testing reliably eliminate the 500K packet loss I was > previously observing. >
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi Ole, Thanks for the quick reply. I did think about making all the commands we use is_mp_safe, but was both concerned about the extent of the work, and the potential for introducing subtle bugs. I also didn't think it would help my key problem, which is the multi-ms commands which make multiple calls to vlib_node_runtime_update(), not least because it seemed likely that I'd need to hold the barrier across the multiple node changes in a single API call (to avoid inconsistent intermediate states). Do you have any thoughts on the change to call vlib_worker_thread_node_runtime_update() a single time just before releasing the barrier? It seems to work fine, but I'm keen to get input from someone who has been working on the codebase for longer. More generally, even with my changes, vlib_worker_thread_node_runtime_update() is the single function which holds the barrier for longer than all other elements, and is the one which therefore most runs the risk of causing Rx overflow. Detailed profiling showed that for my setup, ~40-50% of the time is taken in "/* re-fork nodes */" with the memory functions used to allocate the new clone nodes, and free the old clones. Given that we know the number of nodes at the start of the loop, and given that (as far as I can tell) new clone nodes aren't altered between calls to the update function, I tried a change to allocate/free all the nodes as a single block (whilst still cloning and inserting them as before). I needed to make a matching change in the "/* fork nodes */" code in start_workers(), (and probably need to make a matching change in the termination code,) but in testing this almost halves the execution time of vlib_worker_thread_node_runtime_update() without any obvious problems. Having said that, the execution time of the node cloning remains O(M.N), where M is the number of threads and N the number of nodes. This is reflected in the fact that when I try on larger system (i.e. more workers and more nodes) I again suffer packet loss because this one function is holding the barrier for multiple ms. The change I'm currently working on is to try and reduce to delay to O(N) by getting the worker threads to clone their own data structures in parallel. I'm doing this by extending their busy wait on the barrier, to also include looking for a flag telling them to rebuild their data structures. When the main thread is about to release the barrier, and decides it needs a rebuild, I was going to get it to do the relatively quick stats scraping, then sets the flag telling the workers to rebuild their clones. The rebuild will then happen on all the workers in parallel (which looking at the code seems to be safe), and only when all the cloning is done, will the main thread actually release the barrier. I hope to get results from this soon, and will let you know how it goes, but again I'm very keen to get other people's views. Cheers, Colin. -Original Message- From: Ole Troan [mailto:otr...@employees.org] Sent: 01 September 2017 09:37 To: Colin Tregenza Dancer <c...@metaswitch.com> Cc: Neale Ranns (nranns) <nra...@cisco.com>; Florin Coras <fcoras.li...@gmail.com>; vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Colin, Good investigation! A good first step would be to make all APIs and CLIs thread safe. When an API/CLI is thread safe, that must be flagged through the is_mp_safe flag. It is quite likely that many already are, but haven't been flagged as such. Best regards, Ole > On 31 Aug 2017, at 19:07, Colin Tregenza Dancer via vpp-dev > <vpp-dev@lists.fd.io> wrote: > > I’ve been doing quite a bit of investigation since my last email, in > particular adding instrumentation on barrier calls to report > open/lowering/closed/raising times, along with calling trees and nesting > levels. > > As a result I believe I now have a clearer understanding of what’s leading to > the packet loss I’m observing when using the API, along with some code > changes which in my testing reliably eliminate the 500K packet loss I was > previously observing. > > Would either of you (or anyone else on the list) be able to offer their > opinions on my understanding of the causes, along with my proposed solutions? > > Thanks in advance, > > Colin. > - > In terms of observed barrier hold times, I’m seeing two main issues related > to API calls: > > • When I issue a long string of async API commands, there is no logic > (at least in the version of VPP I’m using) to space out their processing. As > a result, if there is a queue of requests, the barrier is opened for just a > few us between API calls, before lowering again. This is enough to start one > burst of packet processing per worker thread (I can see the barrier lower > ends
Re: [vpp-dev] Packet loss on use of API & cmdline
With my current setup (a fairly modest 2Mpps of background traffic each way between a pair of 10G ports on an Intel X520 NIC, with baremetal Ubuntu 16, vpp 17.01 and a couple of cores per NIC), I observed a range of different packet loss scenarios: * 1K-80K packets lost if I issue any of a range of stats/info commands from the telnet command line: “show hard”, “show int”, “show ip arp”, “show ip fib”, “show fib path”. (I haven’t yet tried the same calls via the API, but from code reading would expect similar results.) * Issuing an “ip route add” / “ip route del” pair from the telnet command line, I see 0.5K-30K packets dropped, mainly on the del. * Using the API, if I issue a close sequence of commands to create a new GRE tunnel and setup individual forwarding entries for 64 endpoints at the other end of that tunnel, I see 100K-500K packets dropped. Cheers, Colin. P.S. Have fun on the beach! From: Neale Ranns (nranns) [mailto:nra...@cisco.com] Sent: 22 August 2017 14:35 To: Colin Tregenza Dancer <c...@metaswitch.com>; Florin Coras <fcoras.li...@gmail.com> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, Your comments were not taken as criticism ☺ constructive comments are always greatly appreciated. Apart from the non-MP safe APIs Florin mentioned, and the route add/del cases I covered, the consensus is certainly that packet loss should not occur during a ‘typical’ update and we will do what we can to address it. Could you give us* some specific examples of the operations you do where you see packet loss? Thanks, Neale *I say us not me as I’m about to hit the beach for a couple of weeks. From: Colin Tregenza Dancer <c...@metaswitch.com<mailto:c...@metaswitch.com>> Date: Tuesday, 22 August 2017 at 14:24 To: "Neale Ranns (nranns)" <nra...@cisco.com<mailto:nra...@cisco.com>>, Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi neale, Thanks for the reply, and please don’t take my comments as a criticism of what I think is a great project. I’m just trying to understand whether the packet loss I’m observing when I do thinks like add new tunnels, setup routes, etc, is generally viewed as acceptable, or whether it’s an area where there is an interest in changing. Specifically I’m looking at a range of tunnel/gateway applications, and am finding that whilst static operation is great from a packet loss perspective, when I add/remove tunnels, routes, etc (something which in my application is to be expected on a regular basis) the existing flows undergo significant packet loss. For comparison, with most hardware based router/gateway this doesn’t occur, and existing flows continue unaffected. Cheers, Colin. From: Neale Ranns (nranns) [mailto:nra...@cisco.com] Sent: 22 August 2017 13:44 To: Colin Tregenza Dancer <c...@metaswitch.com<mailto:c...@metaswitch.com>>; Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, The instances of barrier syncs you have correctly identified, occur only in the exceptional cases of route addition/deletion and not in the typical case. * adj_last_lock_gone () is called when that adjacency is no longer required, i.e. we are removing the last route, or probably the ARP entry, for a neighbour we presumably no longer have * adj_nbr_update_rewrite_internal() is called when the adjacency transitions from incomplete (not associated MAC rewrite) to complete. * The fix for 892 occurs when a route is added that is the first to create a new edge/arc in the VLIB node graph. In the case of that JIRA ticket, it was the first recursive route. Edges are never removed, so this is a once per-reboot event. But in the typical case of adding routes, e.g. a BGP/OSPF convergence event, the adjacencies are present and complete and the VLIB graph is already setup, so the routes will be added in a lock/barrier free manner. Pre-building the VLIB graph of all possibilities is wasteful IMHO, and given the one-time only lock, an acceptable trade off. Adjacencies are more complicated. The state of the adjacency, incomplete or complete, determines the VLIB node the packet should go to. So one needs to atomically change the state of the adjacency and the state of the routes that use it - hence the barrier. We could solve that with indirection, but it would be indirection in the data-path and that costs cycles. So, again, given the relatively rarity of such an adjacency state change, the trade-off was to barrier sync. Hth, neale F
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi Colin, Your comments were not taken as criticism ☺ constructive comments are always greatly appreciated. Apart from the non-MP safe APIs Florin mentioned, and the route add/del cases I covered, the consensus is certainly that packet loss should not occur during a ‘typical’ update and we will do what we can to address it. Could you give us* some specific examples of the operations you do where you see packet loss? Thanks, Neale *I say us not me as I’m about to hit the beach for a couple of weeks. From: Colin Tregenza Dancer <c...@metaswitch.com> Date: Tuesday, 22 August 2017 at 14:24 To: "Neale Ranns (nranns)" <nra...@cisco.com>, Florin Coras <fcoras.li...@gmail.com> Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io> Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi neale, Thanks for the reply, and please don’t take my comments as a criticism of what I think is a great project. I’m just trying to understand whether the packet loss I’m observing when I do thinks like add new tunnels, setup routes, etc, is generally viewed as acceptable, or whether it’s an area where there is an interest in changing. Specifically I’m looking at a range of tunnel/gateway applications, and am finding that whilst static operation is great from a packet loss perspective, when I add/remove tunnels, routes, etc (something which in my application is to be expected on a regular basis) the existing flows undergo significant packet loss. For comparison, with most hardware based router/gateway this doesn’t occur, and existing flows continue unaffected. Cheers, Colin. From: Neale Ranns (nranns) [mailto:nra...@cisco.com] Sent: 22 August 2017 13:44 To: Colin Tregenza Dancer <c...@metaswitch.com>; Florin Coras <fcoras.li...@gmail.com> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, The instances of barrier syncs you have correctly identified, occur only in the exceptional cases of route addition/deletion and not in the typical case. - adj_last_lock_gone () is called when that adjacency is no longer required, i.e. we are removing the last route, or probably the ARP entry, for a neighbour we presumably no longer have - adj_nbr_update_rewrite_internal() is called when the adjacency transitions from incomplete (not associated MAC rewrite) to complete. - The fix for 892 occurs when a route is added that is the first to create a new edge/arc in the VLIB node graph. In the case of that JIRA ticket, it was the first recursive route. Edges are never removed, so this is a once per-reboot event. But in the typical case of adding routes, e.g. a BGP/OSPF convergence event, the adjacencies are present and complete and the VLIB graph is already setup, so the routes will be added in a lock/barrier free manner. Pre-building the VLIB graph of all possibilities is wasteful IMHO, and given the one-time only lock, an acceptable trade off. Adjacencies are more complicated. The state of the adjacency, incomplete or complete, determines the VLIB node the packet should go to. So one needs to atomically change the state of the adjacency and the state of the routes that use it - hence the barrier. We could solve that with indirection, but it would be indirection in the data-path and that costs cycles. So, again, given the relatively rarity of such an adjacency state change, the trade-off was to barrier sync. Hth, neale From: <vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io>> on behalf of Colin Tregenza Dancer via vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> Reply-To: Colin Tregenza Dancer <c...@metaswitch.com<mailto:c...@metaswitch.com>> Date: Tuesday, 22 August 2017 at 12:25 To: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Florin, Thanks for the quick, and very useful reply. I’d been looking at the mp_safe flags, and had concluded that I’d need the calls I was interested in to be at least marked mp_safe. However, I was thinking that wasn’t sufficient, as it appeared that some calls marked as mp_safe invoke barrier_sync lower down the call stacks. For instance the internal functions adj_last_lock_gone(), adj_nbr_update_rewrite_internal() and vlib_node_serialize() all seem to call vlib_worker_thread_barrier_sync(), and the fix for defect 892 https://jira.fd.io/browse/VPP-892?gerritReviewStatus=All#gerrit-reviews-left-panel involves adding barrier calls in code related to the mp_safe ADD_DEL_ROUTE (which fits with packet loss I’d observed during testing of deleting routes). I think the raw lossless packet processing which vpp has achieved on static config
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi neale, Thanks for the reply, and please don’t take my comments as a criticism of what I think is a great project. I’m just trying to understand whether the packet loss I’m observing when I do thinks like add new tunnels, setup routes, etc, is generally viewed as acceptable, or whether it’s an area where there is an interest in changing. Specifically I’m looking at a range of tunnel/gateway applications, and am finding that whilst static operation is great from a packet loss perspective, when I add/remove tunnels, routes, etc (something which in my application is to be expected on a regular basis) the existing flows undergo significant packet loss. For comparison, with most hardware based router/gateway this doesn’t occur, and existing flows continue unaffected. Cheers, Colin. From: Neale Ranns (nranns) [mailto:nra...@cisco.com] Sent: 22 August 2017 13:44 To: Colin Tregenza Dancer <c...@metaswitch.com>; Florin Coras <fcoras.li...@gmail.com> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, The instances of barrier syncs you have correctly identified, occur only in the exceptional cases of route addition/deletion and not in the typical case. * adj_last_lock_gone () is called when that adjacency is no longer required, i.e. we are removing the last route, or probably the ARP entry, for a neighbour we presumably no longer have * adj_nbr_update_rewrite_internal() is called when the adjacency transitions from incomplete (not associated MAC rewrite) to complete. * The fix for 892 occurs when a route is added that is the first to create a new edge/arc in the VLIB node graph. In the case of that JIRA ticket, it was the first recursive route. Edges are never removed, so this is a once per-reboot event. But in the typical case of adding routes, e.g. a BGP/OSPF convergence event, the adjacencies are present and complete and the VLIB graph is already setup, so the routes will be added in a lock/barrier free manner. Pre-building the VLIB graph of all possibilities is wasteful IMHO, and given the one-time only lock, an acceptable trade off. Adjacencies are more complicated. The state of the adjacency, incomplete or complete, determines the VLIB node the packet should go to. So one needs to atomically change the state of the adjacency and the state of the routes that use it - hence the barrier. We could solve that with indirection, but it would be indirection in the data-path and that costs cycles. So, again, given the relatively rarity of such an adjacency state change, the trade-off was to barrier sync. Hth, neale From: <vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io>> on behalf of Colin Tregenza Dancer via vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> Reply-To: Colin Tregenza Dancer <c...@metaswitch.com<mailto:c...@metaswitch.com>> Date: Tuesday, 22 August 2017 at 12:25 To: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> Cc: "vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Florin, Thanks for the quick, and very useful reply. I’d been looking at the mp_safe flags, and had concluded that I’d need the calls I was interested in to be at least marked mp_safe. However, I was thinking that wasn’t sufficient, as it appeared that some calls marked as mp_safe invoke barrier_sync lower down the call stacks. For instance the internal functions adj_last_lock_gone(), adj_nbr_update_rewrite_internal() and vlib_node_serialize() all seem to call vlib_worker_thread_barrier_sync(), and the fix for defect 892 https://jira.fd.io/browse/VPP-892?gerritReviewStatus=All#gerrit-reviews-left-panel involves adding barrier calls in code related to the mp_safe ADD_DEL_ROUTE (which fits with packet loss I’d observed during testing of deleting routes). I think the raw lossless packet processing which vpp has achieved on static configs is truly amazing, but I guess what I’m trying to understand is whether it is viewed as important to achieve similar behaviour when the system is being reconfigured. Personally I think many of the potential uses of a software dataplane include the need to do limited impact dynamic reconfiguration, however, maybe the kind of applications I have in mind are in a minority? More than anything, given the number of areas which would likely be touched by the required changes, I wanted to understand if there is a consensus that such change was even needed? Thanks in advance for any insight you (or others) can offer. Cheers, Colin. From: Florin Coras [mailto:fcoras.li...@gmail.com] Sent: 22 August 2017 09:40 To: Colin Tregenza Dancer <c...@metaswitch.com<mailto:c...@metaswitch.com>> Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@list
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi Colin, The instances of barrier syncs you have correctly identified, occur only in the exceptional cases of route addition/deletion and not in the typical case. - adj_last_lock_gone () is called when that adjacency is no longer required, i.e. we are removing the last route, or probably the ARP entry, for a neighbour we presumably no longer have - adj_nbr_update_rewrite_internal() is called when the adjacency transitions from incomplete (not associated MAC rewrite) to complete. - The fix for 892 occurs when a route is added that is the first to create a new edge/arc in the VLIB node graph. In the case of that JIRA ticket, it was the first recursive route. Edges are never removed, so this is a once per-reboot event. But in the typical case of adding routes, e.g. a BGP/OSPF convergence event, the adjacencies are present and complete and the VLIB graph is already setup, so the routes will be added in a lock/barrier free manner. Pre-building the VLIB graph of all possibilities is wasteful IMHO, and given the one-time only lock, an acceptable trade off. Adjacencies are more complicated. The state of the adjacency, incomplete or complete, determines the VLIB node the packet should go to. So one needs to atomically change the state of the adjacency and the state of the routes that use it - hence the barrier. We could solve that with indirection, but it would be indirection in the data-path and that costs cycles. So, again, given the relatively rarity of such an adjacency state change, the trade-off was to barrier sync. Hth, neale From: <vpp-dev-boun...@lists.fd.io> on behalf of Colin Tregenza Dancer via vpp-dev <vpp-dev@lists.fd.io> Reply-To: Colin Tregenza Dancer <c...@metaswitch.com> Date: Tuesday, 22 August 2017 at 12:25 To: Florin Coras <fcoras.li...@gmail.com> Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Florin, Thanks for the quick, and very useful reply. I’d been looking at the mp_safe flags, and had concluded that I’d need the calls I was interested in to be at least marked mp_safe. However, I was thinking that wasn’t sufficient, as it appeared that some calls marked as mp_safe invoke barrier_sync lower down the call stacks. For instance the internal functions adj_last_lock_gone(), adj_nbr_update_rewrite_internal() and vlib_node_serialize() all seem to call vlib_worker_thread_barrier_sync(), and the fix for defect 892 https://jira.fd.io/browse/VPP-892?gerritReviewStatus=All#gerrit-reviews-left-panel involves adding barrier calls in code related to the mp_safe ADD_DEL_ROUTE (which fits with packet loss I’d observed during testing of deleting routes). I think the raw lossless packet processing which vpp has achieved on static configs is truly amazing, but I guess what I’m trying to understand is whether it is viewed as important to achieve similar behaviour when the system is being reconfigured. Personally I think many of the potential uses of a software dataplane include the need to do limited impact dynamic reconfiguration, however, maybe the kind of applications I have in mind are in a minority? More than anything, given the number of areas which would likely be touched by the required changes, I wanted to understand if there is a consensus that such change was even needed? Thanks in advance for any insight you (or others) can offer. Cheers, Colin. From: Florin Coras [mailto:fcoras.li...@gmail.com] Sent: 22 August 2017 09:40 To: Colin Tregenza Dancer <c...@metaswitch.com> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, Your assumption was right. Most often than not, a binary API/CLI call results in a vlib_worker_thread_barrier_sync because most handlers and cli are not mp safe. As a consequence, vpp may experience packet loss. One way around this issue, for binary APIs, is to make sure the handler you’re interested in is thread safe and then mark it is_mp_safe in api_main. See, for instance, VL_API_IP_ADD_DEL_ROUTE. Hope this helps, Florin On Aug 22, 2017, at 1:11 AM, Colin Tregenza Dancer via vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> wrote: I might have just missed it, but looking through the ongoing regression tests I can’t see anything that explicitly tests for packet loss during CLI/API commands, so I’m wondering whether minimization of packet loss during configuration is viewed as a goal for vpp? Many/most of the real world applications I’ve been exploring require the ability to reconfigure live systems without impacting the existing flows related to stable elements (route updates, tunnel add/remove, VM addition/removal), and it would be great to understand how this fit with vpp use cases. Thanks again, Colin. From: vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io> [mailto:vpp-dev-b
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi Florin, Thanks for the quick, and very useful reply. I’d been looking at the mp_safe flags, and had concluded that I’d need the calls I was interested in to be at least marked mp_safe. However, I was thinking that wasn’t sufficient, as it appeared that some calls marked as mp_safe invoke barrier_sync lower down the call stacks. For instance the internal functions adj_last_lock_gone(), adj_nbr_update_rewrite_internal() and vlib_node_serialize() all seem to call vlib_worker_thread_barrier_sync(), and the fix for defect 892 https://jira.fd.io/browse/VPP-892?gerritReviewStatus=All#gerrit-reviews-left-panel involves adding barrier calls in code related to the mp_safe ADD_DEL_ROUTE (which fits with packet loss I’d observed during testing of deleting routes). I think the raw lossless packet processing which vpp has achieved on static configs is truly amazing, but I guess what I’m trying to understand is whether it is viewed as important to achieve similar behaviour when the system is being reconfigured. Personally I think many of the potential uses of a software dataplane include the need to do limited impact dynamic reconfiguration, however, maybe the kind of applications I have in mind are in a minority? More than anything, given the number of areas which would likely be touched by the required changes, I wanted to understand if there is a consensus that such change was even needed? Thanks in advance for any insight you (or others) can offer. Cheers, Colin. From: Florin Coras [mailto:fcoras.li...@gmail.com] Sent: 22 August 2017 09:40 To: Colin Tregenza Dancer <c...@metaswitch.com> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Colin, Your assumption was right. Most often than not, a binary API/CLI call results in a vlib_worker_thread_barrier_sync because most handlers and cli are not mp safe. As a consequence, vpp may experience packet loss. One way around this issue, for binary APIs, is to make sure the handler you’re interested in is thread safe and then mark it is_mp_safe in api_main. See, for instance, VL_API_IP_ADD_DEL_ROUTE. Hope this helps, Florin On Aug 22, 2017, at 1:11 AM, Colin Tregenza Dancer via vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> wrote: I might have just missed it, but looking through the ongoing regression tests I can’t see anything that explicitly tests for packet loss during CLI/API commands, so I’m wondering whether minimization of packet loss during configuration is viewed as a goal for vpp? Many/most of the real world applications I’ve been exploring require the ability to reconfigure live systems without impacting the existing flows related to stable elements (route updates, tunnel add/remove, VM addition/removal), and it would be great to understand how this fit with vpp use cases. Thanks again, Colin. From: vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io> [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Colin Tregenza Dancer via vpp-dev Sent: 19 August 2017 12:17 To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> Subject: [vpp-dev] Packet loss on use of API & cmdline Hi, I’ve been doing some prototyping and load testing of the vpp dataplane, and have observed packet loss when I issue API requests or use the debug command line. Is this to be expected given the use of the worker_thread_barrier, or might there be some way I could improve matters? Currently I’m running a fairly modest 2Mpps throughput between a pair of 10G ports on an Intel X520 NIC, with baremetal Ubuntu 16, & vpp 17.01. Thanks in advance, Colin. ___ vpp-dev mailing list vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> https://lists.fd.io/mailman/listinfo/vpp-dev ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
Re: [vpp-dev] Packet loss on use of API & cmdline
Hi Colin, Your assumption was right. Most often than not, a binary API/CLI call results in a vlib_worker_thread_barrier_sync because most handlers and cli are not mp safe. As a consequence, vpp may experience packet loss. One way around this issue, for binary APIs, is to make sure the handler you’re interested in is thread safe and then mark it is_mp_safe in api_main. See, for instance, VL_API_IP_ADD_DEL_ROUTE. Hope this helps, Florin > On Aug 22, 2017, at 1:11 AM, Colin Tregenza Dancer via vpp-dev > <vpp-dev@lists.fd.io> wrote: > > I might have just missed it, but looking through the ongoing regression tests > I can’t see anything that explicitly tests for packet loss during CLI/API > commands, so I’m wondering whether minimization of packet loss during > configuration is viewed as a goal for vpp? > > Many/most of the real world applications I’ve been exploring require the > ability to reconfigure live systems without impacting the existing flows > related to stable elements (route updates, tunnel add/remove, VM > addition/removal), and it would be great to understand how this fit with vpp > use cases. > > Thanks again, > > Colin. > <> > From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On > Behalf Of Colin Tregenza Dancer via vpp-dev > Sent: 19 August 2017 12:17 > To: vpp-dev@lists.fd.io > Subject: [vpp-dev] Packet loss on use of API & cmdline > > Hi, > > I’ve been doing some prototyping and load testing of the vpp dataplane, and > have observed packet loss when I issue API requests or use the debug command > line. Is this to be expected given the use of the worker_thread_barrier, or > might there be some way I could improve matters? > > Currently I’m running a fairly modest 2Mpps throughput between a pair of 10G > ports on an Intel X520 NIC, with baremetal Ubuntu 16, & vpp 17.01. > > Thanks in advance, > > Colin. > ___ > vpp-dev mailing list > vpp-dev@lists.fd.io > https://lists.fd.io/mailman/listinfo/vpp-dev ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
Re: [vpp-dev] Packet loss on use of API & cmdline
I might have just missed it, but looking through the ongoing regression tests I can't see anything that explicitly tests for packet loss during CLI/API commands, so I'm wondering whether minimization of packet loss during configuration is viewed as a goal for vpp? Many/most of the real world applications I've been exploring require the ability to reconfigure live systems without impacting the existing flows related to stable elements (route updates, tunnel add/remove, VM addition/removal), and it would be great to understand how this fit with vpp use cases. Thanks again, Colin. From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Colin Tregenza Dancer via vpp-dev Sent: 19 August 2017 12:17 To: vpp-dev@lists.fd.io Subject: [vpp-dev] Packet loss on use of API & cmdline Hi, I've been doing some prototyping and load testing of the vpp dataplane, and have observed packet loss when I issue API requests or use the debug command line. Is this to be expected given the use of the worker_thread_barrier, or might there be some way I could improve matters? Currently I'm running a fairly modest 2Mpps throughput between a pair of 10G ports on an Intel X520 NIC, with baremetal Ubuntu 16, & vpp 17.01. Thanks in advance, Colin. ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev
[vpp-dev] Packet loss on use of API & cmdline
Hi, I've been doing some prototyping and load testing of the vpp dataplane, and have observed packet loss when I issue API requests or use the debug command line. Is this to be expected given the use of the worker_thread_barrier, or might there be some way I could improve matters? Currently I'm running a fairly modest 2Mpps throughput between a pair of 10G ports on an Intel X520 NIC, with baremetal Ubuntu 16, & vpp 17.01. Thanks in advance, Colin. ___ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev