Re: [vpp-dev] Packet loss on use of API & cmdline

Dave Barach (dbarach) Fri, 01 Sep 2017 05:01:23 -0700

Dear Colin,

Please describe the scenario which leads to vlib_node_runtime_update(). I 
wouldn't mind having a good long stare at the situation.


I do like the parallel data structure update approach that you've described, 
tempered with the realization that it amounts to "collective brain surgery." I 
had more than enough trouble making the data structure fork-and-update code 
work reliably in the first place. 

Thanks… Dave

-----Original Message-----
From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Colin Tregenza Dancer via vpp-dev
Sent: Friday, September 1, 2017 6:12 AM
To: Ole Troan <otr...@employees.org>
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Packet loss on use of API & cmdline

Hi Ole,

Thanks for the quick reply.

I did think about making all the commands we use is_mp_safe, but was both 
concerned about the extent of the work, and the potential for introducing 
subtle bugs.  I also didn't think it would help my key problem, which is the 
multi-ms commands which make multiple calls to vlib_node_runtime_update(), not 
least because it seemed likely that I'd need to hold the barrier across the 
multiple node changes in a single API call (to avoid inconsistent intermediate 
states).

Do you have any thoughts on the change to call 
vlib_worker_thread_node_runtime_update() a single time just before releasing 
the barrier?  It seems to work fine, but I'm keen to get input from someone who 
has been working on the codebase for longer.


More generally, even with my changes, vlib_worker_thread_node_runtime_update() 
is the single function which holds the barrier for longer than all other 
elements, and is the one which therefore most runs the risk of causing Rx 
overflow.  

Detailed profiling showed that for my setup, ~40-50% of the time is taken in 
"/* re-fork nodes */" with the memory functions used to allocate the new clone 
nodes, and free the old clones.  Given that we know the number of nodes at the 
start of the loop, and given that (as far as I can tell) new clone nodes aren't 
altered between calls to the update function, I tried a change to allocate/free 
all the nodes as a single block (whilst still cloning and inserting them as 
before). I needed to make a matching change in the "/* fork nodes */" code in 
start_workers(), (and probably need to make a matching change in the 
termination code,) but in testing this almost halves the execution time of 
vlib_worker_thread_node_runtime_update() without any obvious problems. 

Having said that, the execution time of the node cloning remains O(M.N), where 
M is the number of threads and N the number of nodes.  This is reflected in the 
fact that when I try on larger system (i.e. more workers and more nodes) I 
again suffer packet loss because this one function is holding the barrier for 
multiple ms.

The change I'm currently working on is to try and reduce to delay to O(N) by 
getting the worker threads to clone their own data structures in parallel.  I'm 
doing this by extending their busy wait on the barrier, to also include looking 
for a flag telling them to rebuild their data structures.  When the main thread 
is about to release the barrier, and decides it needs a rebuild, I was going to 
get it to do the relatively quick stats scraping, then sets the flag telling 
the workers to rebuild their clones.  The rebuild will then happen on all the 
workers in parallel (which looking at the code seems to be safe), and only when 
all the cloning is done, will the main thread actually release the barrier.  

I hope to get results from this soon, and will let you know how it goes, but 
again I'm very keen to get other people's views.

Cheers,

Colin.

-----Original Message-----
From: Ole Troan [mailto:otr...@employees.org] 
Sent: 01 September 2017 09:37
To: Colin Tregenza Dancer <c...@metaswitch.com>
Cc: Neale Ranns (nranns) <nra...@cisco.com>; Florin Coras 
<fcoras.li...@gmail.com>; vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Packet loss on use of API & cmdline

Colin,

Good investigation!

A good first step would be to make all APIs and CLIs thread safe.
When an API/CLI is thread safe, that must be flagged through the is_mp_safe 
flag.
It is quite likely that many already are, but haven't been flagged as such.

Best regards,
Ole


> On 31 Aug 2017, at 19:07, Colin Tregenza Dancer via vpp-dev 
> <vpp-dev@lists.fd.io> wrote:
> 
> I’ve been doing quite a bit of investigation since my last email, in 
> particular adding instrumentation on barrier calls to report 
> open/lowering/closed/raising times, along with calling trees and nesting 
> levels.
> 
> As a result I believe I now have a clearer understanding of what’s leading to 
> the packet loss I’m observing when using the API, along with some code 
> changes which in my testing reliably eliminate the 500K packet loss I was 
> previously observing.
> 
> Would either of you (or anyone else on the list) be able to offer their 
> opinions on my understanding of the causes, along with my proposed solutions?
> 
> Thanks in advance,
> 
> Colin.
> ---------
> In terms of observed barrier hold times, I’m seeing two main issues related 
> to API calls:
> 
>       • When I issue a long string of async API commands, there is no logic 
> (at least in the version of VPP I’m using) to space out their processing.  As 
> a result, if there is a queue of requests, the barrier is opened for just a 
> few us between API calls, before lowering again.  This is enough to start one 
> burst of packet processing per worker thread (I can see the barrier lower 
> ends up taking ~100us), but over time not enough to keep up with the input 
> traffic.
> 
>       • Whilst many API calls close the barrier for between a few 10’s of 
> microseconds and a few hundred microseconds, there are a number of calls 
> where this extends from 500us+ into the multiple ms range (which obviously 
> causes the Rx ring buffers to overflow).  The particular API calls where I’ve 
> seen this include:  ip_neighbor_add_del, gre_add_del_tunnel, create_loopback, 
> sw_interface_set_l2_bridge & sw_interface_add_del_address (thought there may 
> be others which I’m not currently calling).
> 
> Digging into the call stacks, I can see that in each case there are multiple 
> calls to vlib_node_runtime_update()  (I assume one for each node changed), 
> and each of these calls invokes vlib_worker_thread_node_runtime_update() just 
> before returning (I assume to sync the per thread datastructures with the 
> updated graph).  The observed execution time for 
> vlib_worker_thread_node_runtime_update() seems to vary with load, config 
> size, etc, but times of between 400us and 800us per call are not atypical in 
> my setup.  If there are 5 or 6 invocations of this function per API call, we 
> therefore rapidly get to a situation where the barrier is held for multiple 
> ms.
> 
> The two workarounds I’ve been using are both changes to vlib/vlib/threads.c :
> 
>       • When closing the barrier in vlib_worker_thread_barrier_sync (but not 
> for recursive invocations), if it hasn’t been open for at least a certain 
> minimum period of time (I’ve been running with 300us), then spin until this 
> minimum is reached, before closing.  This ensures that whatever the source of 
> the barrier sync (API, command line, etc), the datapath is always allowed a 
> fair fraction of time to run. (I’ve got in mind various adaptive ways to 
> setting the delay, including a rolling measure of open period over say the 
> last 1ms, and/or Rx ring state, but for initial testing a fixed value seemed 
> easiest.)
> 
>       • From my (potentially superficial) code read, it looks like 
> vlib_worker_thread_node_runtime_update() could be called once to update the 
> workers with multiple node changes (as long as the barrier remains closed 
> between changes), rather than having to be called for each individual change.
> 
> I have therefore tweaked vlib_worker_thread_node_runtime_update(), so that 
> instead of doing the update to the per thread data structures, by default it 
> simply increments a count and returns.  The count is cleared each time the 
> barrier is closed in vlib_worker_thread_barrier_sync()  (but not for 
> recursive invocations), and if it is non-zero when 
> vlib_worker_thread_barrier_release() is about to open the barrier, then 
> vlib_worker_thread_barrier_release() is called with a flag which causes it to 
> actually do the updating.  This means that the per thread data structures are 
> only updated once per API call, rather than for each individual node change.
> 
> In my testing this change has reduced the period for which the problem API 
> calls close the barrier, from mutiple ms, to sub-ms (generally under 500us).  
> I have not yet observed any negative consequences (though I fully accept I 
> might well have missed something).
> 
> Together these two changes eliminate the packet loss I was seeing when using 
> the API under load.
> 
> Views?
> 
> (Whilst the API packet loss is currently most important to me, I believe I 
> may have also tracked down the cause of the packet loss when issuing debug 
> commands.  I seems as if the debug commands which produce output can block 
> whilst the data is flushed, and if this occurs with the barrier down, then we 
> get similar overflow on the Rx rings.  Having said that, because the API 
> problems are more critical, I’ve not yet tried any workarounds.)
> 
> From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
> Behalf Of Colin Tregenza Dancer via vpp-dev
> Sent: 22 August 2017 15:05
> To: Neale Ranns (nranns) <nra...@cisco.com>
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline
> 
> With my current setup (a fairly modest 2Mpps of background traffic each way 
> between a pair of 10G ports on an Intel X520 NIC, with baremetal Ubuntu 16, 
> vpp 17.01 and a couple of cores per NIC), I observed a range of different 
> packet loss scenarios:
> 
>       • 1K-80K packets lost if I issue any of a range of stats/info commands 
> from the telnet command line: “show hard”, “show int”, “show ip arp”, “show 
> ip fib”, “show fib path”.   (I haven’t yet tried the same calls via the API, 
> but from code reading would expect similar results.)
>       • Issuing an “ip route add” / “ip route del” pair from the telnet 
> command line, I see 0.5K-30K packets dropped, mainly on the del.
>       • Using the API, if I issue a close sequence of commands to create a 
> new GRE tunnel and setup individual forwarding entries for 64 endpoints at 
> the other end of that tunnel, I see 100K-500K packets dropped.
> 
> Cheers,
> 
> Colin.
> 
> P.S. Have fun on the beach!
> 
> 
> From: Neale Ranns (nranns) [mailto:nra...@cisco.com]
> Sent: 22 August 2017 14:35
> To: Colin Tregenza Dancer <c...@metaswitch.com>; Florin Coras 
> <fcoras.li...@gmail.com>
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Packet loss on use of API & cmdline
> 
> 
> Hi Colin,
> 
> Your comments were not taken as criticism J constructive comments are always 
> greatly appreciated.
> 
> Apart from the non-MP safe APIs Florin mentioned, and the route add/del cases 
> I covered, the consensus is certainly that packet loss should not occur 
> during a ‘typical’ update and we will do what we can to address it.
> Could you give us* some specific examples of the operations you do where you 
> see packet loss?
> 
> Thanks,
> Neale
> 
> *I say us not me as I’m about to hit the beach for a couple of weeks.
> _______________________________________________
> vpp-dev mailing list
> vpp-dev@lists.fd.io
> https://lists.fd.io/mailman/listinfo/vpp-dev

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev
_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] Packet loss on use of API & cmdline

Reply via email to