It looks like this: Message from syslogd@flamingo at Apr 26 00:45:00 ... kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
They seem to coincide with net namespace teardown. The message is emitted by netdev_wait_allrefs(). Forced a kdump in netdev_run_todo, but found that the refcount on the lo device was already 0 at the time we got to the panic. Used bcc to check the blocking in netdev_run_todo. The only places where we're off cpu there are in the rcu_barrier() and msleep() calls. That behavior is expected. The msleep time coincides with the amount of time we spend waiting for the refcount to reach zero; the rcu_barrier() wait times are not excessive. After looking through the list of callbacks that the netdevice notifiers invoke in this path, it appears that the dst_dev_event is the most interesting. The dst_ifdown path places a hold on the loopback_dev as part of releasing the dev associated with the original dst cache entry. Most of our notifier callbacks are straight-forward, but this one a) looks complex, and b) places a hold on the network interface in question. I constructed a new bcc script that watches various events in the liftime of a dst cache entry. Note that dst_ifdown will take a hold on the loopback device until the invalidated dst entry gets freed. [ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183 __dst_free rcu_nocb_kthread kthread ret_from_fork --- [ dst_gc_task] completed in 4: start: 1282115783057516 end 1282115783062462 [ dst_gc_task] completed in 5: start: 1282116447054101 end 1282116447059392 [ dst_gc_task] completed in 5: start: 1282118111030391 end 1282118111036041 [ dst_gc_task] completed in 6: start: 1282121247074938 end 1282121247081141 [ dst_gc_task] completed in 5: start: 1282126815053252 end 1282126815058751 [ dst_gc_task] completed in 6: start: 1282135007055673 end 1282135007061877 [ dst_gc_task] completed in 6: start: 1282145759021745 end 1282145759027830 [ dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257 dst_release ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_send_fin tcp_close inet_release sock_release sock_close __fput ____fput task_work_run do_exit do_group_exit SyS_exit_group do_syscall_64 return_from_SYSCALL_64 [ dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680 dst_ifdown notifier_call_chain raw_notifier_call_chain call_netdevice_notifiers_info netdev_run_todo kretprobe_trampoline default_device_exit_batch ops_exit_list.isra.4 cleanup_net process_one_work worker_thread kthread ret_from_fork --- [ dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269 dst_ifdown notifier_call_chain raw_notifier_call_chain call_netdevice_notifiers_info netdev_run_todo kretprobe_trampoline default_device_exit_batch ops_exit_list.isra.4 cleanup_net process_one_work worker_thread kthread ret_from_fork --- <...> [ dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022 dst_destroy kretprobe_trampoline worker_thread kthread ret_from_fork --- [ dst_gc_task] completed in 30: start: 1282160607047161 end 1282160607077572 The way this works is that if there's still a reference on the dst entry at the time we try to free it, it gets placed in the gc list by __dst_free and the dst_destroy() call is invoked by the gc task once the refcount is 0. If the gc task processes a 10th or less of its entries on a single pass, it inreases the amount of time it waits between gc intervals. Looking at the gc_task intervals, they started at 663ms when we invoked __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, 10751, and 14848. The release that set the refcnt to 0 on our dst entry occurred after the gc_task was enqueued for 14 second interval so we had to wait longer than the warning time in wait_allrefs in order for the dst entry to get free'd and the hold on 'lo' to be released. A simple solution to this problem is to have dst_dev_event() reset the gc timer, which causes us to process this list shortly after the gc_mutex is relased when dst_dev_event() completes. Signed-off-by: Krister Johansen <k...@templeofstupid.com> --- net/core/dst.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/net/core/dst.c b/net/core/dst.c index 6192f11..13ba4a0 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event, spin_lock_bh(&dst_garbage.lock); dst = dst_garbage.list; dst_garbage.list = NULL; + /* The code in dst_ifdown places a hold on the loopback device. + * If the gc entry processing is set to expire after a lengthy + * interval, this hold can cause netdev_wait_allrefs() to hang + * out and wait for a long time -- until the the loopback + * interface is released. If we're really unlucky, it'll emit + * pr_emerg messages to console too. Reset the interval here, + * so dst cleanups occur in a more timely fashion. + */ + if (dst_garbage.timer_inc > DST_GC_INC) { + dst_garbage.timer_inc = DST_GC_INC; + dst_garbage.timer_expires = DST_GC_MIN; + mod_delayed_work(system_wq, &dst_gc_work, + dst_garbage.timer_expires); + } spin_unlock_bh(&dst_garbage.lock); if (last) -- 2.7.4