On 3/26/19 3:39 PM, David Ahern wrote: > On 3/26/19 9:30 AM, Dmitry Safonov wrote: >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). >> The limit is 128 pages or 512Kb (considering common case with 4Kb >> pages). >> >> Unfortunately, at Arista we have use-scenarios with full view software >> forwarding. At the scale of 100K and more routes even on 2 core boxes >> the hard-coded limit starts actively shooting in the leg: lockup >> detector notices that rtnl_lock is held for seconds. >> First reason is previously broken MAX_WORK, that didn't limit pending >> balancing work. While fixing it, I've noticed that the bottle-neck is >> actually in the number of synchronise_rcu() calls. >> >> I've tried to fix it with a patch to decrement number of tnodes in rcu >> callback, but it hasn't much affected performance. >> >> One possible way to "fix" it - provide another sysctl to control >> sync_pages, but in my POV it's nasty - exposing another realisation >> detail into user-space. > > well, that was accepted last week. ;-) > > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f > Author: David Ahern <dsah...@gmail.com> > Date: Wed Mar 20 09:18:59 2019 -0700 > > ipv4: Allow amount of dirty memory from fib resizing to be controllable > > > Can you see how that change (should backport easily) affects your test > case? From my perspective 16MB was the sweet spot.
FWIW, I would like to +Cc Paul here. TLDR; we're looking with David into ways to improve a hardcoded limit tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low (512Kb). David created a patch to provide sysctl that controls the limit and it would solve a problem for both of us. In parallel, I thought that exposing this to userspace is not much fun and added a shrinker with synchronize_rcu(). I'm not any sure that the latter is actually a sane solution.. Is there any guarantee that memory to-be freed by call_rcu() will get freed in OOM conditions? Might there be a chance that we don't need any limit here at all? Worth to mention that I don't argue David's patch as I pointed that it would (will) solve the problem for us both, but with good intentions wondering if we can do something here rather a new sysctl knob. Thanks, Dmitry