Re: Long delays creating a netns after deleting one (possibly RCU related)
On Fri, 2016-11-18 at 16:38 -0800, Jarno Rajahalme wrote: > This fixes the problem for me, so for whatever it’s worth: > > Tested-by: Jarno Rajahalme> Thanks for testing ! https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=e88a2766143a27bfe6704b4493b214de4094cf29
Re: Long delays creating a netns after deleting one (possibly RCU related)
> On Nov 14, 2016, at 3:09 PM, Eric Dumazetwrote: > > On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: >> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: >> >>> synchronize_rcu_expidited is not enough if you have multiple network >>> devices in play. >>> >>> Looking at the code it comes down to this commit, and it appears there >>> is a promise add rcu grace period combining by Eric Dumazet. >>> >>> Eric since people are hitting noticable stalls because of the rcu grace >>> period taking a long time do you think you could look at this code path >>> a bit more? >>> >>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 >>> Author: Eric Dumazet >>> Date: Wed Nov 18 06:31:03 2015 -0800 >> >> Absolutely, I will take a loop asap. > > The worst offender should be fixed by the following patch. > > busy poll needs to poll the physical device, not a virtual one... > > diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h > index > d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 > 100644 > --- a/include/net/gro_cells.h > +++ b/include/net/gro_cells.h > @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, > struct net_device *de > struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); > > __skb_queue_head_init(>napi_skbs); > + > + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state); > + > netif_napi_add(dev, >napi, gro_cell_poll, 64); > napi_enable(>napi); > } > > > > > This fixes the problem for me, so for whatever it’s worth: Tested-by: Jarno Rajahalme
Re: Long delays creating a netns after deleting one (possibly RCU related)
> On Nov 14, 2016, at 3:09 PM, Eric Dumazetwrote: > > On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: >> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: >> >>> synchronize_rcu_expidited is not enough if you have multiple network >>> devices in play. >>> >>> Looking at the code it comes down to this commit, and it appears there >>> is a promise add rcu grace period combining by Eric Dumazet. >>> >>> Eric since people are hitting noticable stalls because of the rcu grace >>> period taking a long time do you think you could look at this code path >>> a bit more? >>> >>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 >>> Author: Eric Dumazet >>> Date: Wed Nov 18 06:31:03 2015 -0800 >> >> Absolutely, I will take a loop asap. > > The worst offender should be fixed by the following patch. > > busy poll needs to poll the physical device, not a virtual one... > > diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h > index > d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 > 100644 > --- a/include/net/gro_cells.h > +++ b/include/net/gro_cells.h > @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, > struct net_device *de > struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); > > __skb_queue_head_init(>napi_skbs); > + > + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state); > + > netif_napi_add(dev, >napi, gro_cell_poll, 64); > napi_enable(>napi); > } > This solved a ~20 second slowdown between OVS datapath unit tests for me. Jarno
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: > On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: > > > synchronize_rcu_expidited is not enough if you have multiple network > > devices in play. > > > > Looking at the code it comes down to this commit, and it appears there > > is a promise add rcu grace period combining by Eric Dumazet. > > > > Eric since people are hitting noticable stalls because of the rcu grace > > period taking a long time do you think you could look at this code path > > a bit more? > > > > commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 > > Author: Eric Dumazet> > Date: Wed Nov 18 06:31:03 2015 -0800 > > Absolutely, I will take a loop asap. The worst offender should be fixed by the following patch. busy poll needs to poll the physical device, not a virtual one... diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644 --- a/include/net/gro_cells.h +++ b/include/net/gro_cells.h @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); __skb_queue_head_init(>napi_skbs); + + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state); + netif_napi_add(dev, >napi, gro_cell_poll, 64); napi_enable(>napi); }
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: > synchronize_rcu_expidited is not enough if you have multiple network > devices in play. > > Looking at the code it comes down to this commit, and it appears there > is a promise add rcu grace period combining by Eric Dumazet. > > Eric since people are hitting noticable stalls because of the rcu grace > period taking a long time do you think you could look at this code path > a bit more? > > commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 > Author: Eric Dumazet> Date: Wed Nov 18 06:31:03 2015 -0800 Absolutely, I will take a loop asap. Thanks Eric.
Re: Long delays creating a netns after deleting one (possibly RCU related)
"Paul E. McKenney"writes: > On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote: >> On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney >> wrote: >> > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote: >> >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang >> >> wrote: >> >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney >> >> > wrote: >> >> >> >> >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() >> >> >> be >> >> >> modified to check for net_mutex being held in addition to the current >> >> >> checks for RTNL being held? >> >> >> >> >> > >> >> > Good point! >> >> > >> >> > Like commit be3fc413da9eb17cce0991f214ab0, checking >> >> > for net_mutex for this case seems to be an optimization, I assume >> >> > synchronize_rcu_expedited() and synchronize_rcu() have the same >> >> > behavior... >> >> >> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f >> >> gets wrong on rtnl_is_locked(), the lock could be locked by other >> >> process not by the current one, therefore it should be >> >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP >> >> is enabled... Sigh. >> >> >> >> I don't see any better way than letting callers decide if they want the >> >> expedited version or not, but this requires changes of all callers of >> >> synchronize_net(). Hm. >> > >> > I must confess that I don't understand how it would help to use an >> > expedited grace period when some other process is holding RTNL. >> > In contrast, I do well understand how it helps when the current process >> > is holding RTNL. >> >> Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks >> rtnl_is_locked(), clearly we need to assert "it is held by the current >> process" >> rather than "it is locked by whatever process". >> >> But given *_is_held() is always defined by LOCKDEP, so we probably need >> mutex to provide such a helper directly, mutex->owner is not always defined >> either. :-/ > > There is always the option of making acquisition and release set a per-task > variable that can be tested. (Where did I put that asbestos suit, anyway?) > > Thanx, Paul synchronize_rcu_expidited is not enough if you have multiple network devices in play. Looking at the code it comes down to this commit, and it appears there is a promise add rcu grace period combining by Eric Dumazet. Eric since people are hitting noticable stalls because of the rcu grace period taking a long time do you think you could look at this code path a bit more? commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 Author: Eric Dumazet Date: Wed Nov 18 06:31:03 2015 -0800 net: provide generic busy polling to all NAPI drivers NAPI drivers no longer need to observe a particular protocol to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y) napi_hash_add() and napi_hash_del() are automatically called from core networking stack, respectively from netif_napi_add() and netif_napi_del() This patch depends on free_netdev() and netif_napi_del() being called from process context, which seems to be the norm. Drivers might still prefer to call napi_hash_del() on their own, since they might combine all the rcu grace periods into a single one, knowing their NAPI structures lifetime, while core networking stack has no idea of a possible combining. Once this patch proves to not bring serious regressions, we will cleanup drivers to either remove napi_hash_del() or provide appropriate rcu grace periods combining. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Eric
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote: > On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney >wrote: > > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote: > >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang > >> wrote: > >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney > >> > wrote: > >> >> > >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be > >> >> modified to check for net_mutex being held in addition to the current > >> >> checks for RTNL being held? > >> >> > >> > > >> > Good point! > >> > > >> > Like commit be3fc413da9eb17cce0991f214ab0, checking > >> > for net_mutex for this case seems to be an optimization, I assume > >> > synchronize_rcu_expedited() and synchronize_rcu() have the same > >> > behavior... > >> > >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f > >> gets wrong on rtnl_is_locked(), the lock could be locked by other > >> process not by the current one, therefore it should be > >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP > >> is enabled... Sigh. > >> > >> I don't see any better way than letting callers decide if they want the > >> expedited version or not, but this requires changes of all callers of > >> synchronize_net(). Hm. > > > > I must confess that I don't understand how it would help to use an > > expedited grace period when some other process is holding RTNL. > > In contrast, I do well understand how it helps when the current process > > is holding RTNL. > > Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks > rtnl_is_locked(), clearly we need to assert "it is held by the current > process" > rather than "it is locked by whatever process". > > But given *_is_held() is always defined by LOCKDEP, so we probably need > mutex to provide such a helper directly, mutex->owner is not always defined > either. :-/ There is always the option of making acquisition and release set a per-task variable that can be tested. (Where did I put that asbestos suit, anyway?) Thanx, Paul
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenneywrote: > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote: >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang wrote: >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney >> > wrote: >> >> >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be >> >> modified to check for net_mutex being held in addition to the current >> >> checks for RTNL being held? >> >> >> > >> > Good point! >> > >> > Like commit be3fc413da9eb17cce0991f214ab0, checking >> > for net_mutex for this case seems to be an optimization, I assume >> > synchronize_rcu_expedited() and synchronize_rcu() have the same >> > behavior... >> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f >> gets wrong on rtnl_is_locked(), the lock could be locked by other >> process not by the current one, therefore it should be >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP >> is enabled... Sigh. >> >> I don't see any better way than letting callers decide if they want the >> expedited version or not, but this requires changes of all callers of >> synchronize_net(). Hm. > > I must confess that I don't understand how it would help to use an > expedited grace period when some other process is holding RTNL. > In contrast, I do well understand how it helps when the current process > is holding RTNL. Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks rtnl_is_locked(), clearly we need to assert "it is held by the current process" rather than "it is locked by whatever process". But given *_is_held() is always defined by LOCKDEP, so we probably need mutex to provide such a helper directly, mutex->owner is not always defined either. :-/
Re: Long delays creating a netns after deleting one (possibly RCU related)
Hi Cong, On Sat, Nov 12, 2016, at 01:55, Cong Wang wrote: > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney >wrote: > > > > Ah! This net_mutex is different than RTNL. Should synchronize_net() be > > modified to check for net_mutex being held in addition to the current > > checks for RTNL being held? > > > > Good point! > > Like commit be3fc413da9eb17cce0991f214ab0, checking > for net_mutex for this case seems to be an optimization, I assume > synchronize_rcu_expedited() and synchronize_rcu() have the same > behavior... > > diff --git a/net/core/dev.c b/net/core/dev.c > index eaad4c2..3415b6b 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev); > void synchronize_net(void) > { > might_sleep(); > - if (rtnl_is_locked()) > + if (rtnl_is_locked() || lockdep_is_held(_mutex)) > synchronize_rcu_expedited(); I don't think we should depend on lockdep for this check but rather use mutex_is_locked here (I think it would fail to build like this without CONFIG_LOCKDEP). Bye, Hannes
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote: > On Fri, Nov 11, 2016 at 4:55 PM, Cong Wangwrote: > > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney > > wrote: > >> > >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be > >> modified to check for net_mutex being held in addition to the current > >> checks for RTNL being held? > >> > > > > Good point! > > > > Like commit be3fc413da9eb17cce0991f214ab0, checking > > for net_mutex for this case seems to be an optimization, I assume > > synchronize_rcu_expedited() and synchronize_rcu() have the same > > behavior... > > Thinking a bit more, I think commit be3fc413da9eb17cce0991f > gets wrong on rtnl_is_locked(), the lock could be locked by other > process not by the current one, therefore it should be > lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP > is enabled... Sigh. > > I don't see any better way than letting callers decide if they want the > expedited version or not, but this requires changes of all callers of > synchronize_net(). Hm. I must confess that I don't understand how it would help to use an expedited grace period when some other process is holding RTNL. In contrast, I do well understand how it helps when the current process is holding RTNL. So what am I missing here? Thanx, Paul
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Fri, Nov 11, 2016 at 4:55 PM, Cong Wangwrote: > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney > wrote: >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be >> modified to check for net_mutex being held in addition to the current >> checks for RTNL being held? >> > > Good point! > > Like commit be3fc413da9eb17cce0991f214ab0, checking > for net_mutex for this case seems to be an optimization, I assume > synchronize_rcu_expedited() and synchronize_rcu() have the same > behavior... Thinking a bit more, I think commit be3fc413da9eb17cce0991f gets wrong on rtnl_is_locked(), the lock could be locked by other process not by the current one, therefore it should be lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP is enabled... Sigh. I don't see any better way than letting callers decide if they want the expedited version or not, but this requires changes of all callers of synchronize_net(). Hm.
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenneywrote: > > Ah! This net_mutex is different than RTNL. Should synchronize_net() be > modified to check for net_mutex being held in addition to the current > checks for RTNL being held? > Good point! Like commit be3fc413da9eb17cce0991f214ab0, checking for net_mutex for this case seems to be an optimization, I assume synchronize_rcu_expedited() and synchronize_rcu() have the same behavior... diff --git a/net/core/dev.c b/net/core/dev.c index eaad4c2..3415b6b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev); void synchronize_net(void) { might_sleep(); - if (rtnl_is_locked()) + if (rtnl_is_locked() || lockdep_is_held(_mutex)) synchronize_rcu_expedited(); else synchronize_rcu();
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Fri, Nov 11, 2016 at 01:11:01PM +, Rolf Neugebauer wrote: > On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney >wrote: > > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote: > >> (Cc'ing Paul) > >> > >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer > >> wrote: > >> > Hi > >> > > >> > We noticed some long delays starting docker containers on some newer > >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is > >> > fine). We narrowed this down to the creation of a network namespace > >> > being delayed directly after removing another one (details and > >> > reproduction below). We have seen delays of up to 60s on some systems. > >> > > >> > - The delay is proportional to the number of CPUs (online or offline). > >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up > >> > to 240 offline vCPUs even if one configures the VM with only, say 2 > >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the > >> > kernel config. > >> > > >> > - The delay is also dependent on some tunnel network interfaces being > >> > present (which we had compiled in in one of our kernel configs). > >> > > >> > - We can reproduce this issue with stock kernels from > >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs > >> > as well as other hypervisors like qemu and hyperkit where we have good > >> > control over the number of CPUs. > >> > > >> > A simple test is: > >> > modprobe ipip > >> > moprobe ip_gre > >> > modprobe ip_vti > >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo > >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo > >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar > >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar > >> > > >> > with an output like: > >> > add netns foo ===> 0:00.00 > >> > del netns foo ===> 0:00.01 > >> > add netns bar ===> 0:08.53 > >> > del netns bar ===> 0:00.01 > >> > > >> > This is on a 4.9-rc4 kernel from the above URL configured with > >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached). > >> > > >> > Below is a dump of the work queues while the second 'ip add netns' is > >> > hanging. The state of the work queues does not seem to change while > >> > the command is delayed and the pattern shown is consistent across > >> > different kernel versions. > >> > > >> > Is this a known issue and/or is someone working on a fix? > >> > >> Not to me. > >> > >> > >> > > >> > [ 610.356272] sysrq: SysRq : Show Blocked State > >> > [ 610.356742] taskPC stack pid father > >> > [ 610.357252] kworker/u480:1 D0 1994 2 0x > >> > [ 610.357752] Workqueue: netns cleanup_net > >> > [ 610.358239] 9892f1065800 9892ee1e1e00 > >> > 9892f8e59340 > >> > [ 610.358705] 9892f4526900 bf0104b5ba88 be486df3 > >> > bf0104b5ba60 > >> > [ 610.359168] 00ffbdcbe663 9892f8e59340 000100012e70 > >> > 9892ee1e1e00 > >> > [ 610.359677] Call Trace: > >> > [ 610.360169] [] ? __schedule+0x233/0x6e0 > >> > [ 610.360723] [] schedule+0x36/0x80 > >> > [ 610.361194] [] schedule_timeout+0x22a/0x3f0 > >> > [ 610.361789] [] ? __schedule+0x23b/0x6e0 > >> > [ 610.362260] [] wait_for_completion+0xb4/0x140 > >> > [ 610.362736] [] ? wake_up_q+0x80/0x80 > >> > [ 610.363306] [] __wait_rcu_gp+0xc8/0xf0 > >> > [ 610.363782] [] synchronize_sched+0x5c/0x80 > >> > [ 610.364137] [] ? call_rcu_bh+0x20/0x20 > >> > [ 610.364742] [] ? > >> > trace_raw_output_rcu_utilization+0x60/0x60 > >> > [ 610.365337] [] synchronize_net+0x1c/0x30 > >> > >> This is a worker which holds the net_mutex and is waiting for > >> a RCU grace period to elapse. Ah! This net_mutex is different than RTNL. Should synchronize_net() be modified to check for net_mutex being held in addition to the current checks for RTNL being held? Thanx, Paul > >> > [ 610.365846] [] netif_napi_del+0x23/0x80 > >> > [ 610.367494] [] ip_tunnel_dev_free+0x68/0xf0 > >> > [ip_tunnel] > >> > [ 610.368007] [] netdev_run_todo+0x230/0x330 > >> > [ 610.368454] [] rtnl_unlock+0xe/0x10 > >> > [ 610.369001] [] ip_tunnel_delete_net+0xdf/0x120 > >> > [ip_tunnel] > >> > [ 610.369500] [] ipip_exit_net+0x2c/0x30 [ipip] > >> > [ 610.369997] [] ops_exit_list.isra.4+0x38/0x60 > >> > [ 610.370636] [] cleanup_net+0x1c4/0x2b0 > >> > [ 610.371130] [] process_one_work+0x1fc/0x4b0 > >> > [ 610.371812] [] worker_thread+0x4b/0x500 > >> > [ 610.373074] [] ? process_one_work+0x4b0/0x4b0 > >> > [ 610.373622] [] ? process_one_work+0x4b0/0x4b0 > >> > [ 610.374100] [] kthread+0xd9/0xf0 > >> > [ 610.374574] [] ? kthread_park+0x60/0x60 > >> > [ 610.375198] [] ret_from_fork+0x25/0x30 > >> > [ 610.375678] ip D0 2149 2148
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenneywrote: > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote: >> (Cc'ing Paul) >> >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer >> wrote: >> > Hi >> > >> > We noticed some long delays starting docker containers on some newer >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is >> > fine). We narrowed this down to the creation of a network namespace >> > being delayed directly after removing another one (details and >> > reproduction below). We have seen delays of up to 60s on some systems. >> > >> > - The delay is proportional to the number of CPUs (online or offline). >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up >> > to 240 offline vCPUs even if one configures the VM with only, say 2 >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the >> > kernel config. >> > >> > - The delay is also dependent on some tunnel network interfaces being >> > present (which we had compiled in in one of our kernel configs). >> > >> > - We can reproduce this issue with stock kernels from >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs >> > as well as other hypervisors like qemu and hyperkit where we have good >> > control over the number of CPUs. >> > >> > A simple test is: >> > modprobe ipip >> > moprobe ip_gre >> > modprobe ip_vti >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar >> > >> > with an output like: >> > add netns foo ===> 0:00.00 >> > del netns foo ===> 0:00.01 >> > add netns bar ===> 0:08.53 >> > del netns bar ===> 0:00.01 >> > >> > This is on a 4.9-rc4 kernel from the above URL configured with >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached). >> > >> > Below is a dump of the work queues while the second 'ip add netns' is >> > hanging. The state of the work queues does not seem to change while >> > the command is delayed and the pattern shown is consistent across >> > different kernel versions. >> > >> > Is this a known issue and/or is someone working on a fix? >> >> Not to me. >> >> >> > >> > [ 610.356272] sysrq: SysRq : Show Blocked State >> > [ 610.356742] taskPC stack pid father >> > [ 610.357252] kworker/u480:1 D0 1994 2 0x >> > [ 610.357752] Workqueue: netns cleanup_net >> > [ 610.358239] 9892f1065800 9892ee1e1e00 >> > 9892f8e59340 >> > [ 610.358705] 9892f4526900 bf0104b5ba88 be486df3 >> > bf0104b5ba60 >> > [ 610.359168] 00ffbdcbe663 9892f8e59340 000100012e70 >> > 9892ee1e1e00 >> > [ 610.359677] Call Trace: >> > [ 610.360169] [] ? __schedule+0x233/0x6e0 >> > [ 610.360723] [] schedule+0x36/0x80 >> > [ 610.361194] [] schedule_timeout+0x22a/0x3f0 >> > [ 610.361789] [] ? __schedule+0x23b/0x6e0 >> > [ 610.362260] [] wait_for_completion+0xb4/0x140 >> > [ 610.362736] [] ? wake_up_q+0x80/0x80 >> > [ 610.363306] [] __wait_rcu_gp+0xc8/0xf0 >> > [ 610.363782] [] synchronize_sched+0x5c/0x80 >> > [ 610.364137] [] ? call_rcu_bh+0x20/0x20 >> > [ 610.364742] [] ? >> > trace_raw_output_rcu_utilization+0x60/0x60 >> > [ 610.365337] [] synchronize_net+0x1c/0x30 >> >> This is a worker which holds the net_mutex and is waiting for >> a RCU grace period to elapse. >> >> >> > [ 610.365846] [] netif_napi_del+0x23/0x80 >> > [ 610.367494] [] ip_tunnel_dev_free+0x68/0xf0 >> > [ip_tunnel] >> > [ 610.368007] [] netdev_run_todo+0x230/0x330 >> > [ 610.368454] [] rtnl_unlock+0xe/0x10 >> > [ 610.369001] [] ip_tunnel_delete_net+0xdf/0x120 >> > [ip_tunnel] >> > [ 610.369500] [] ipip_exit_net+0x2c/0x30 [ipip] >> > [ 610.369997] [] ops_exit_list.isra.4+0x38/0x60 >> > [ 610.370636] [] cleanup_net+0x1c4/0x2b0 >> > [ 610.371130] [] process_one_work+0x1fc/0x4b0 >> > [ 610.371812] [] worker_thread+0x4b/0x500 >> > [ 610.373074] [] ? process_one_work+0x4b0/0x4b0 >> > [ 610.373622] [] ? process_one_work+0x4b0/0x4b0 >> > [ 610.374100] [] kthread+0xd9/0xf0 >> > [ 610.374574] [] ? kthread_park+0x60/0x60 >> > [ 610.375198] [] ret_from_fork+0x25/0x30 >> > [ 610.375678] ip D0 2149 2148 0x >> > [ 610.376185] 9892f0a99000 9892f0a66900 >> > [ 610.376185] 9892f8e59340 >> > [ 610.376185] 9892f4526900 bf0101173db8 be486df3 >> > [ 610.376753] 0005fecffd76 >> > [ 610.376762] 00ff9892f11d9820 9892f8e59340 9892 >> > 9892f0a66900 >> > [ 610.377274] Call Trace: >> > [ 610.377789] [] ? __schedule+0x233/0x6e0 >> > [ 610.378306] [] schedule+0x36/0x80 >> > [ 610.378992] [] schedule_preempt_disabled+0xe/0x10
Re: Long delays creating a netns after deleting one (possibly RCU related)
On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote: > (Cc'ing Paul) > > On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer >wrote: > > Hi > > > > We noticed some long delays starting docker containers on some newer > > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is > > fine). We narrowed this down to the creation of a network namespace > > being delayed directly after removing another one (details and > > reproduction below). We have seen delays of up to 60s on some systems. > > > > - The delay is proportional to the number of CPUs (online or offline). > > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up > > to 240 offline vCPUs even if one configures the VM with only, say 2 > > vCPUs. We see linear increase in delay when we change NR_CPUS in the > > kernel config. > > > > - The delay is also dependent on some tunnel network interfaces being > > present (which we had compiled in in one of our kernel configs). > > > > - We can reproduce this issue with stock kernels from > > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs > > as well as other hypervisors like qemu and hyperkit where we have good > > control over the number of CPUs. > > > > A simple test is: > > modprobe ipip > > moprobe ip_gre > > modprobe ip_vti > > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo > > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo > > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar > > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar > > > > with an output like: > > add netns foo ===> 0:00.00 > > del netns foo ===> 0:00.01 > > add netns bar ===> 0:08.53 > > del netns bar ===> 0:00.01 > > > > This is on a 4.9-rc4 kernel from the above URL configured with > > NR_CPUS=256 running in a Hyper-V VM (kernel config attached). > > > > Below is a dump of the work queues while the second 'ip add netns' is > > hanging. The state of the work queues does not seem to change while > > the command is delayed and the pattern shown is consistent across > > different kernel versions. > > > > Is this a known issue and/or is someone working on a fix? > > Not to me. > > > > > > [ 610.356272] sysrq: SysRq : Show Blocked State > > [ 610.356742] taskPC stack pid father > > [ 610.357252] kworker/u480:1 D0 1994 2 0x > > [ 610.357752] Workqueue: netns cleanup_net > > [ 610.358239] 9892f1065800 9892ee1e1e00 > > 9892f8e59340 > > [ 610.358705] 9892f4526900 bf0104b5ba88 be486df3 > > bf0104b5ba60 > > [ 610.359168] 00ffbdcbe663 9892f8e59340 000100012e70 > > 9892ee1e1e00 > > [ 610.359677] Call Trace: > > [ 610.360169] [] ? __schedule+0x233/0x6e0 > > [ 610.360723] [] schedule+0x36/0x80 > > [ 610.361194] [] schedule_timeout+0x22a/0x3f0 > > [ 610.361789] [] ? __schedule+0x23b/0x6e0 > > [ 610.362260] [] wait_for_completion+0xb4/0x140 > > [ 610.362736] [] ? wake_up_q+0x80/0x80 > > [ 610.363306] [] __wait_rcu_gp+0xc8/0xf0 > > [ 610.363782] [] synchronize_sched+0x5c/0x80 > > [ 610.364137] [] ? call_rcu_bh+0x20/0x20 > > [ 610.364742] [] ? > > trace_raw_output_rcu_utilization+0x60/0x60 > > [ 610.365337] [] synchronize_net+0x1c/0x30 > > This is a worker which holds the net_mutex and is waiting for > a RCU grace period to elapse. > > > > [ 610.365846] [] netif_napi_del+0x23/0x80 > > [ 610.367494] [] ip_tunnel_dev_free+0x68/0xf0 > > [ip_tunnel] > > [ 610.368007] [] netdev_run_todo+0x230/0x330 > > [ 610.368454] [] rtnl_unlock+0xe/0x10 > > [ 610.369001] [] ip_tunnel_delete_net+0xdf/0x120 > > [ip_tunnel] > > [ 610.369500] [] ipip_exit_net+0x2c/0x30 [ipip] > > [ 610.369997] [] ops_exit_list.isra.4+0x38/0x60 > > [ 610.370636] [] cleanup_net+0x1c4/0x2b0 > > [ 610.371130] [] process_one_work+0x1fc/0x4b0 > > [ 610.371812] [] worker_thread+0x4b/0x500 > > [ 610.373074] [] ? process_one_work+0x4b0/0x4b0 > > [ 610.373622] [] ? process_one_work+0x4b0/0x4b0 > > [ 610.374100] [] kthread+0xd9/0xf0 > > [ 610.374574] [] ? kthread_park+0x60/0x60 > > [ 610.375198] [] ret_from_fork+0x25/0x30 > > [ 610.375678] ip D0 2149 2148 0x > > [ 610.376185] 9892f0a99000 9892f0a66900 > > [ 610.376185] 9892f8e59340 > > [ 610.376185] 9892f4526900 bf0101173db8 be486df3 > > [ 610.376753] 0005fecffd76 > > [ 610.376762] 00ff9892f11d9820 9892f8e59340 9892 > > 9892f0a66900 > > [ 610.377274] Call Trace: > > [ 610.377789] [] ? __schedule+0x233/0x6e0 > > [ 610.378306] [] schedule+0x36/0x80 > > [ 610.378992] [] schedule_preempt_disabled+0xe/0x10 > > [ 610.379514] [] __mutex_lock_slowpath+0xb9/0x130 > > [ 610.380031] [] ? __kmalloc+0x162/0x1e0 > > [ 610.380556] [] mutex_lock+0x1f/0x30 > > [ 610.381135] []
Re: Long delays creating a netns after deleting one (possibly RCU related)
(Cc'ing Paul) On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauerwrote: > Hi > > We noticed some long delays starting docker containers on some newer > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is > fine). We narrowed this down to the creation of a network namespace > being delayed directly after removing another one (details and > reproduction below). We have seen delays of up to 60s on some systems. > > - The delay is proportional to the number of CPUs (online or offline). > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up > to 240 offline vCPUs even if one configures the VM with only, say 2 > vCPUs. We see linear increase in delay when we change NR_CPUS in the > kernel config. > > - The delay is also dependent on some tunnel network interfaces being > present (which we had compiled in in one of our kernel configs). > > - We can reproduce this issue with stock kernels from > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs > as well as other hypervisors like qemu and hyperkit where we have good > control over the number of CPUs. > > A simple test is: > modprobe ipip > moprobe ip_gre > modprobe ip_vti > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar > > with an output like: > add netns foo ===> 0:00.00 > del netns foo ===> 0:00.01 > add netns bar ===> 0:08.53 > del netns bar ===> 0:00.01 > > This is on a 4.9-rc4 kernel from the above URL configured with > NR_CPUS=256 running in a Hyper-V VM (kernel config attached). > > Below is a dump of the work queues while the second 'ip add netns' is > hanging. The state of the work queues does not seem to change while > the command is delayed and the pattern shown is consistent across > different kernel versions. > > Is this a known issue and/or is someone working on a fix? Not to me. > > [ 610.356272] sysrq: SysRq : Show Blocked State > [ 610.356742] taskPC stack pid father > [ 610.357252] kworker/u480:1 D0 1994 2 0x > [ 610.357752] Workqueue: netns cleanup_net > [ 610.358239] 9892f1065800 9892ee1e1e00 > 9892f8e59340 > [ 610.358705] 9892f4526900 bf0104b5ba88 be486df3 > bf0104b5ba60 > [ 610.359168] 00ffbdcbe663 9892f8e59340 000100012e70 > 9892ee1e1e00 > [ 610.359677] Call Trace: > [ 610.360169] [] ? __schedule+0x233/0x6e0 > [ 610.360723] [] schedule+0x36/0x80 > [ 610.361194] [] schedule_timeout+0x22a/0x3f0 > [ 610.361789] [] ? __schedule+0x23b/0x6e0 > [ 610.362260] [] wait_for_completion+0xb4/0x140 > [ 610.362736] [] ? wake_up_q+0x80/0x80 > [ 610.363306] [] __wait_rcu_gp+0xc8/0xf0 > [ 610.363782] [] synchronize_sched+0x5c/0x80 > [ 610.364137] [] ? call_rcu_bh+0x20/0x20 > [ 610.364742] [] ? > trace_raw_output_rcu_utilization+0x60/0x60 > [ 610.365337] [] synchronize_net+0x1c/0x30 This is a worker which holds the net_mutex and is waiting for a RCU grace period to elapse. > [ 610.365846] [] netif_napi_del+0x23/0x80 > [ 610.367494] [] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel] > [ 610.368007] [] netdev_run_todo+0x230/0x330 > [ 610.368454] [] rtnl_unlock+0xe/0x10 > [ 610.369001] [] ip_tunnel_delete_net+0xdf/0x120 > [ip_tunnel] > [ 610.369500] [] ipip_exit_net+0x2c/0x30 [ipip] > [ 610.369997] [] ops_exit_list.isra.4+0x38/0x60 > [ 610.370636] [] cleanup_net+0x1c4/0x2b0 > [ 610.371130] [] process_one_work+0x1fc/0x4b0 > [ 610.371812] [] worker_thread+0x4b/0x500 > [ 610.373074] [] ? process_one_work+0x4b0/0x4b0 > [ 610.373622] [] ? process_one_work+0x4b0/0x4b0 > [ 610.374100] [] kthread+0xd9/0xf0 > [ 610.374574] [] ? kthread_park+0x60/0x60 > [ 610.375198] [] ret_from_fork+0x25/0x30 > [ 610.375678] ip D0 2149 2148 0x > [ 610.376185] 9892f0a99000 9892f0a66900 > [ 610.376185] 9892f8e59340 > [ 610.376185] 9892f4526900 bf0101173db8 be486df3 > [ 610.376753] 0005fecffd76 > [ 610.376762] 00ff9892f11d9820 9892f8e59340 9892 > 9892f0a66900 > [ 610.377274] Call Trace: > [ 610.377789] [] ? __schedule+0x233/0x6e0 > [ 610.378306] [] schedule+0x36/0x80 > [ 610.378992] [] schedule_preempt_disabled+0xe/0x10 > [ 610.379514] [] __mutex_lock_slowpath+0xb9/0x130 > [ 610.380031] [] ? __kmalloc+0x162/0x1e0 > [ 610.380556] [] mutex_lock+0x1f/0x30 > [ 610.381135] [] copy_net_ns+0x9f/0x170 > [ 610.381647] [] create_new_namespaces+0x11b/0x200 > [ 610.382249] [] unshare_nsproxy_namespaces+0x5a/0xb0 > [ 610.382818] [] SyS_unshare+0x1cd/0x360 > [ 610.383319] [] entry_SYSCALL_64_fastpath+0x1e/0xad This process is apparently waiting for the net_mutex held by the previous