Re: [vpp-dev] Packet loss on use of API & cmdline

Colin Tregenza Dancer via vpp-dev Fri, 01 Sep 2017 08:01:53 -0700

I think there is something special in this case related to the fact that we're 
adding of a new tunnel / subnet, before we issue our 63 ip_neighbor_add_del 
calls, because it is only the first call ip_neighbor_add_del which updates the 
nodes, with all of the other just doing a rewrite.


I'll mail you guys the full (long) trace offline so you can see the overall 
sequence.

Cheers,

Colin. 
-----Original Message-----
From: Dave Barach (dbarach) [mailto:dbar...@cisco.com] 
Sent: 01 September 2017 15:15
To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan 
<otr...@employees.org>; Neale Ranns (nranns) <nra...@cisco.com>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] Packet loss on use of API & cmdline

Dear Colin, 

Of all of these, ip_neighbor_add_del seems like the one to tune right away.

The API message handler itself can be marked mp-safe right away. Both the ip4 
and the ip6 underlying routines are thread-aware (mumble RPC mumble).

We should figure out why the FIB thinks it needs to pull the node runtime 
update lever. AFAICT, adding ip arp / ip6 neighbor adjacencies shouldn’t 
require a node runtime update, at least not in the typical case. 

Copying Neale Ranns. I don't expect to hear back immediately; he's on PTO until 
9/11. 

Thanks… Dave

-----Original Message-----
From: Colin Tregenza Dancer [mailto:c...@metaswitch.com]
Sent: Friday, September 1, 2017 8:51 AM
To: Dave Barach (dbarach) <dbar...@cisco.com>; Ole Troan <otr...@employees.org>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] Packet loss on use of API & cmdline

Hi Dave,

Thanks for looking at this.

I get repeated vlib_node_runtime_update() calls when I use the API functions:  
ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, 
sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may be 
others which I’m not currently calling).

To illustrate, I've included below a formatted version of my barrier trace from 
when I make an ip_neighbor_add_del API call (raw traces for the other commands 
are included at the end).  

At the point this call was made there were 3 worker threads, ~425nodes in the 
system, and a load of ~3Mpps saturating two 10G NICs.

It shows the API function name, followed by a tree of the recursive calls to 
barrier_sync/release.  On each line I show the calling function name, current 
recursion depth, and elapsed timing from the point the barrier was actually 
closed.  

[50]: ip_neighbor_add_del
<vl_msg_api_barrier_sync
        <1(3us)vl_api_rpc_call_main_thread
                <2(50us)vlib_node_runtime_update{(69us)}
                (69us)>
                <2(80us)adj_nbr_update_rewrite_internal
                        <3(82us)vlib_node_runtime_update{(86us)}
                        (86us)>
                        <3(87us)vlib_node_runtime_update{(90us)}
                        (90us)>
                        <3(91us)vlib_node_runtime_update{(94us)}
                        (95us)>
                (95us)>
                <2(119us)adj_nbr_update_rewrite_internal
                (120us)>
        (135us)>
(136us)>
{(137us)vlib_worker_thread_node_runtime_update
        [179us]
        [256us]
        worker=1<A256us><B267us><C338us><D339us><E346us><F348us><G348us>
        worker=2<A349us><B356us><C406us><D407us><E416us><F416us><G417us>
        worker=3<A418us><B426us><C472us><D472us><E478us><F479us><G479us>
(480us)}

This trace is taken on my dev branch, where I am delaying the worker thread 
updates till just before the barrier release.  In the vlib_node_runtime_update 
functions, the time stamp within the {} braces show the point as which the 
rework_required flag is set (instead of the mainline behaviour of repeatedly 
invoking vlib_worker_thread_node_runtime_update()) 

At the end you can also see the additional profiling stamps I've added at 
various points within vlib_worker_thread_node_runtime_update().  The first two 
stamps are after the two stats sync loops, then there are three lines of 
tracing for the invocations of the function I've added to contain the code for 
the per worker re-fork.  Those functions calls are further profiled at various 
points, where the gap between B & C is where the clone node alloc/copying is 
occurring, and between C & D is where the old clone nodes are being freed.  As 
you might guess from the short C-D gap, this branch also included my 
optimization to allocate/free all the clone nodes in a single block.

Having successfully tested the move of the per thread re-fork into a separate 
function, I'm about try the "collective brainsurgery" version, where I will get 
the workers to re-fork their own clone (with the barrier still held) rather 
than having in done sequentially by main.

I'll let you know how it goes...

Colin.

_Raw traces of other calls_

Sep  1 12:57:38 pocvmhost vpp[6315]: [155]: gre_add_del_tunnel Sep  1 12:57:38 
pocvmhost vpp[6315]: 
<vl_msg_api_barrier_sync<1(53us)vlib_node_runtime_update{(86us)}(87us)><1(96us)vlib_node_runtime_update{(99us)}(99us)><1(100us)vlib_node_runtime_update{(102us)}(103us)><1(227us)vlib_node_runtime_update{(232us)}(233us)><1(235us)vlib_node_runtime_update{(237us)}(238us)><1(308us)vlib_node_runtime_update{(313us)}(314us)><1(316us)adj_nbr_update_rewrite_internal(317us)><1(349us)adj_nbr_update_rewrite_internal(350us)>(353us)>{(354us)vlib_worker_thread_node_runtime_update[394us][462us]worker=1<A463us><B479us>[423:425]<C550us><D551us><E562us><F563us><G563us>worker=2<A564us><B577us>[423:425]<C618us><D618us><E626us><F626us><G627us>worker=3<A628us><B643us>[423:425]<C698us><D698us><E706us><F707us><G707us>(708us)}
Sep  1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42822  - O   300  D     5  C 
  708  U     0 - nested   8
Sep  1 12:57:38 pocvmhost vpp[6315]: [13]: sw_interface_set_flags Sep  1 
12:57:38 pocvmhost vpp[6315]: <vl_msg_api_barrier_sync(45us)>
Sep  1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42823  - O  1143  D    70  C 
   46  U     0 - nested   0
Sep  1 12:57:38 pocvmhost vpp[6315]: [85]: create_loopback Sep  1 12:57:38 
pocvmhost vpp[6315]: 
<vl_msg_api_barrier_sync<1(13us)vlib_node_runtime_update{(17us)}(17us)><1(25us)vlib_node_runtime_update{(27us)}(27us)><1(28us)vlib_node_runtime_update{(30us)}(30us)><1(44us)vlib_node_runtime_update{(47us)}(48us)><1(50us)vlib_node_runtime_update{(51us)}(52us)>(54us)>{(54us)vlib_worker_thread_node_runtime_update[70us][103us]worker=1<A104us><B114us>[425:427]<C168us><D169us><E181us><F182us><G183us>worker=2<A184us><B190us>[425:427]<C243us><D244us><E251us><F252us><G252us>worker=3<A253us><B260us>[425:427]<C299us><D299us><E305us><F306us><G306us>(307us)}
Sep  1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42824  - O   300  D    67  C 
  307  U     0 - nested   5
Sep  1 12:57:38 pocvmhost vpp[6315]: [113]: sw_interface_set_l2_bridge Sep  1 
12:57:38 pocvmhost vpp[6315]: 
<vl_msg_api_barrier_sync<1(17us)vlib_node_runtime_update{(19us)}(19us)>(25us)>{(25us)vlib_worker_thread_node_runtime_update[43us][76us]worker=1<A77us><B85us>[427:427]<C138us><D139us><E146us><F147us><G148us>worker=2<A149us><B157us>[427:427]<C193us><D194us><E202us><F203us><G203us>worker=3<A204us><B211us>[427:427]<C251us><D251us><E257us><F258us><G258us>(259us)}
Sep  1 12:57:38 pocvmhost vpp[6315]: Barrier(us) # 42825  - O  1140  D    10  C 
  259  U     0 - nested   1
Sep  1 12:57:38 pocvmhost vpp[6315]: [16]: sw_interface_add_del_address Sep  1 
12:57:38 pocvmhost vpp[6315]: 
<vl_msg_api_barrier_sync<1(15us)vlib_node_runtime_update{(20us)}(21us)>(70us)>{(71us)vlib_worker_thread_node_runtime_update[87us][115us]worker=1<A116us><B123us>[427:427]<C180us><D181us><E187us><F188us><G188us>worker=2<A189us><B197us>[427:427]<C230us><D231us><E238us><F238us><G239us>worker=3<A239us><B246us>[427:427]<C297us><D298us><E305us><F305us><G306us>(307us)}

-----Original Message-----
From: Dave Barach (dbarach) [mailto:dbar...@cisco.com]
Sent: 01 September 2017 13:00
To: Colin Tregenza Dancer <c...@metaswitch.com>; Ole Troan 
<otr...@employees.org>
Cc: vpp-dev@lists.fd.io
Subject: RE: [vpp-dev] Packet loss on use of API & cmdline

Dear Colin,

Please describe the scenario which leads to vlib_node_runtime_update(). I 
wouldn't mind having a good long stare at the situation. 

I do like the parallel data structure update approach that you've described, 
tempered with the realization that it amounts to "collective brain surgery." I 
had more than enough trouble making the data structure fork-and-update code 
work reliably in the first place. 

Thanks… Dave

-----Original Message-----
From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Colin Tregenza Dancer via vpp-dev
Sent: Friday, September 1, 2017 6:12 AM
To: Ole Troan <otr...@employees.org>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Packet loss on use of API & cmdline

Hi Ole,

Thanks for the quick reply.

I did think about making all the commands we use is_mp_safe, but was both 
concerned about the extent of the work, and the potential for introducing 
subtle bugs.  I also didn't think it would help my key problem, which is the 
multi-ms commands which make multiple calls to vlib_node_runtime_update(), not 
least because it seemed likely that I'd need to hold the barrier across the 
multiple node changes in a single API call (to avoid inconsistent intermediate 
states).

Do you have any thoughts on the change to call 
vlib_worker_thread_node_runtime_update() a single time just before releasing 
the barrier?  It seems to work fine, but I'm keen to get input from someone who 
has been working on the codebase for longer.


More generally, even with my changes, vlib_worker_thread_node_runtime_update() 
is the single function which holds the barrier for longer than all other 
elements, and is the one which therefore most runs the risk of causing Rx 
overflow.  

Detailed profiling showed that for my setup, ~40-50% of the time is taken in 
"/* re-fork nodes */" with the memory functions used to allocate the new clone 
nodes, and free the old clones.  Given that we know the number of nodes at the 
start of the loop, and given that (as far as I can tell) new clone nodes aren't 
altered between calls to the update function, I tried a change to allocate/free 
all the nodes as a single block (whilst still cloning and inserting them as 
before). I needed to make a matching change in the "/* fork nodes */" code in 
start_workers(), (and probably need to make a matching change in the 
termination code,) but in testing this almost halves the execution time of 
vlib_worker_thread_node_runtime_update() without any obvious problems. 

Having said that, the execution time of the node cloning remains O(M.N), where 
M is the number of threads and N the number of nodes.  This is reflected in the 
fact that when I try on larger system (i.e. more workers and more nodes) I 
again suffer packet loss because this one function is holding the barrier for 
multiple ms.

The change I'm currently working on is to try and reduce to delay to O(N) by 
getting the worker threads to clone their own data structures in parallel.  I'm 
doing this by extending their busy wait on the barrier, to also include looking 
for a flag telling them to rebuild their data structures.  When the main thread 
is about to release the barrier, and decides it needs a rebuild, I was going to 
get it to do the relatively quick stats scraping, then sets the flag telling 
the workers to rebuild their clones.  The rebuild will then happen on all the 
workers in parallel (which looking at the code seems to be safe), and only when 
all the cloning is done, will the main thread actually release the barrier.  

I hope to get results from this soon, and will let you know how it goes, but 
again I'm very keen to get other people's views.

Cheers,

Colin.
_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] Packet loss on use of API & cmdline

Reply via email to