I think there is something special in this case related to the fact that we're adding of a new tunnel / subnet, before we issue our 63 ip_neighbor_add_del calls, because it is only the first call ip_neighbor_add_del which updates the nodes, with all of the other just doing a rewrite.
I'll mail you guys the full (long) trace offline so you can see the overall sequence. Cheers, Colin. -----Original Message----- From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] Sent: 01 September 2017 15:15 To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan <otr...@employees.org>; Neale Ranns (nranns) <nra...@cisco.com> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Dear Colin, Of all of these, ip_neighbor_add_del seems like the one to tune right away. The API message handler itself can be marked mp-safe right away. Both the ip4 and the ip6 underlying routines are thread-aware (mumble RPC mumble). We should figure out why the FIB thinks it needs to pull the node runtime update lever. AFAICT, adding ip arp / ip6 neighbor adjacencies shouldn’t require a node runtime update, at least not in the typical case. Copying Neale Ranns. I don't expect to hear back immediately; he's on PTO until 9/11. Thanks… Dave -----Original Message----- From: Colin Tregenza Dancer [mailto:c...@metaswitch.com] Sent: Friday, September 1, 2017 8:51 AM To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Hi Dave, Thanks for looking at this. I get repeated vlib_node_runtime_update() calls when I use the API functions: ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may be others which I’m not currently calling). To illustrate, I've included below a formatted version of my barrier trace from when I make an ip_neighbor_add_del API call (raw traces for the other commands are included at the end). At the point this call was made there were 3 worker threads, ~425nodes in the system, and a load of ~3Mpps saturating two 10G NICs. It shows the API function name, followed by a tree of the recursive calls to barrier_sync/release. On each line I show the calling function name, current recursion depth, and elapsed timing from the point the barrier was actually closed. [50]: ip_neighbor_add_del <vl_msg_api_barrier_sync <1(3us)vl_api_rpc_call_main_thread <2(50us)vlib_node_runtime_update{(69us)} (69us)> <2(80us)adj_nbr_update_rewrite_internal <3(82us)vlib_node_runtime_update{(86us)} (86us)> <3(87us)vlib_node_runtime_update{(90us)} (90us)> <3(91us)vlib_node_runtime_update{(94us)} (95us)> (95us)> <2(119us)adj_nbr_update_rewrite_internal (120us)> (135us)> (136us)> {(137us)vlib_worker_thread_node_runtime_update [179us] [256us] worker=1<A256us><B267us><C338us><D339us><E346us><F348us><G348us> worker=2<A349us><B356us><C406us><D407us><E416us><F416us><G417us> worker=3<A418us><B426us><C472us><D472us><E478us><F479us><G479us> (480us)} This trace is taken on my dev branch, where I am delaying the worker thread updates till just before the barrier release. In the vlib_node_runtime_update functions, the time stamp within the {} braces show the point as which the rework_required flag is set (instead of the mainline behaviour of repeatedly invoking vlib_worker_thread_node_runtime_update()) At the end you can also see the additional profiling stamps I've added at various points within vlib_worker_thread_node_runtime_update(). The first two stamps are after the two stats sync loops, then there are three lines of tracing for the invocations of the function I've added to contain the code for the per worker re-fork. Those functions calls are further profiled at various points, where the gap between B & C is where the clone node alloc/copying is occurring, and between C & D is where the old clone nodes are being freed. As you might guess from the short C-D gap, this branch also included my optimization to allocate/free all the clone nodes in a single block. Having successfully tested the move of the per thread re-fork into a separate function, I'm about try the "collective brainsurgery" version, where I will get the workers to re-fork their own clone (with the barrier still held) rather than having in done sequentially by main. I'll let you know how it goes... Colin. _Raw traces of other calls_ Sep 1 12:57:38 pocvmhost vpp[6315]: [155]: gre_add_del_tunnel Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(53us)vlib_node_runtime_update{(86us)}(87us)><1(96us)vlib_node_runtime_update{(99us)}(99us)><1(100us)vlib_node_runtime_update{(102us)}(103us)><1(227us)vlib_node_runtime_update{(232us)}(233us)><1(235us)vlib_node_runtime_update{(237us)}(238us)><1(308us)vlib_node_runtime_update{(313us)}(314us)><1(316us)adj_nbr_update_rewrite_internal(317us)><1(349us)adj_nbr_update_rewrite_internal(350us)>(353us)>{(354us)vlib_worker_thread_node_runtime_update[394us][462us]worker=1<A463us><B479us>[423:425]<C550us><D551us><E562us><F563us><G563us>worker=2<A564us><B577us>[423:425]<C618us><D618us><E626us><F626us><G627us>worker=3<A628us><B643us>[423:425]<C698us><D698us><E706us><F707us><G707us>(708us)} Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42822 - O 300 D 5 C 708 U 0 - nested 8 Sep 1 12:57:38 pocvmhost vpp[6315]: [13]: sw_interface_set_flags Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync(45us)> Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42823 - O 1143 D 70 C 46 U 0 - nested 0 Sep 1 12:57:38 pocvmhost vpp[6315]: [85]: create_loopback Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(13us)vlib_node_runtime_update{(17us)}(17us)><1(25us)vlib_node_runtime_update{(27us)}(27us)><1(28us)vlib_node_runtime_update{(30us)}(30us)><1(44us)vlib_node_runtime_update{(47us)}(48us)><1(50us)vlib_node_runtime_update{(51us)}(52us)>(54us)>{(54us)vlib_worker_thread_node_runtime_update[70us][103us]worker=1<A104us><B114us>[425:427]<C168us><D169us><E181us><F182us><G183us>worker=2<A184us><B190us>[425:427]<C243us><D244us><E251us><F252us><G252us>worker=3<A253us><B260us>[425:427]<C299us><D299us><E305us><F306us><G306us>(307us)} Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42824 - O 300 D 67 C 307 U 0 - nested 5 Sep 1 12:57:38 pocvmhost vpp[6315]: [113]: sw_interface_set_l2_bridge Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(17us)vlib_node_runtime_update{(19us)}(19us)>(25us)>{(25us)vlib_worker_thread_node_runtime_update[43us][76us]worker=1<A77us><B85us>[427:427]<C138us><D139us><E146us><F147us><G148us>worker=2<A149us><B157us>[427:427]<C193us><D194us><E202us><F203us><G203us>worker=3<A204us><B211us>[427:427]<C251us><D251us><E257us><F258us><G258us>(259us)} Sep 1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42825 - O 1140 D 10 C 259 U 0 - nested 1 Sep 1 12:57:38 pocvmhost vpp[6315]: [16]: sw_interface_add_del_address Sep 1 12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync<1(15us)vlib_node_runtime_update{(20us)}(21us)>(70us)>{(71us)vlib_worker_thread_node_runtime_update[87us][115us]worker=1<A116us><B123us>[427:427]<C180us><D181us><E187us><F188us><G188us>worker=2<A189us><B197us>[427:427]<C230us><D231us><E238us><F238us><G239us>worker=3<A239us><B246us>[427:427]<C297us><D298us><E305us><F305us><G306us>(307us)} -----Original Message----- From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] Sent: 01 September 2017 13:00 To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: RE: [vpp-dev] Packet loss on use of API & cmdline Dear Colin, Please describe the scenario which leads to vlib_node_runtime_update(). I wouldn't mind having a good long stare at the situation. I do like the parallel data structure update approach that you've described, tempered with the realization that it amounts to "collective brain surgery." I had more than enough trouble making the data structure fork-and-update code work reliably in the first place. Thanks… Dave -----Original Message----- From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Colin Tregenza Dancer via vpp-dev Sent: Friday, September 1, 2017 6:12 AM To: Ole Troan <otr...@employees.org> Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Packet loss on use of API & cmdline Hi Ole, Thanks for the quick reply. I did think about making all the commands we use is_mp_safe, but was both concerned about the extent of the work, and the potential for introducing subtle bugs. I also didn't think it would help my key problem, which is the multi-ms commands which make multiple calls to vlib_node_runtime_update(), not least because it seemed likely that I'd need to hold the barrier across the multiple node changes in a single API call (to avoid inconsistent intermediate states). Do you have any thoughts on the change to call vlib_worker_thread_node_runtime_update() a single time just before releasing the barrier? It seems to work fine, but I'm keen to get input from someone who has been working on the codebase for longer. More generally, even with my changes, vlib_worker_thread_node_runtime_update() is the single function which holds the barrier for longer than all other elements, and is the one which therefore most runs the risk of causing Rx overflow. Detailed profiling showed that for my setup, ~40-50% of the time is taken in "/* re-fork nodes */" with the memory functions used to allocate the new clone nodes, and free the old clones. Given that we know the number of nodes at the start of the loop, and given that (as far as I can tell) new clone nodes aren't altered between calls to the update function, I tried a change to allocate/free all the nodes as a single block (whilst still cloning and inserting them as before). I needed to make a matching change in the "/* fork nodes */" code in start_workers(), (and probably need to make a matching change in the termination code,) but in testing this almost halves the execution time of vlib_worker_thread_node_runtime_update() without any obvious problems. Having said that, the execution time of the node cloning remains O(M.N), where M is the number of threads and N the number of nodes. This is reflected in the fact that when I try on larger system (i.e. more workers and more nodes) I again suffer packet loss because this one function is holding the barrier for multiple ms. The change I'm currently working on is to try and reduce to delay to O(N) by getting the worker threads to clone their own data structures in parallel. I'm doing this by extending their busy wait on the barrier, to also include looking for a flag telling them to rebuild their data structures. When the main thread is about to release the barrier, and decides it needs a rebuild, I was going to get it to do the relatively quick stats scraping, then sets the flag telling the workers to rebuild their clones. The rebuild will then happen on all the workers in parallel (which looking at the code seems to be safe), and only when all the cloning is done, will the main thread actually release the barrier. I hope to get results from this soon, and will let you know how it goes, but again I'm very keen to get other people's views. Cheers, Colin. _______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev