My analysis so far of the problem:

1. container A has an outstanding TCP connection (thus, a socket and dst
which hold reference on the "lo" interface from the container). When the
container is stopped, the TCP connection takes ~2 minutes to timeout
(with default settings).

2. when container A is being removed, its net namespace is removed via
net/core/net_namespace.c function cleanup_net(). This takes two locks -
first, the net_mutex, and then the rtnl mutex (i.e. rtnl_lock). It then
cleans up the net ns and calls rtnl_unlock(). However, rtnl_unlock()
internally waits for all the namespace's interfaces to be freed, which
requires all their references to reach 0. This must wait until the above
TCP connection times out, before it releases its reference. So, at this
point the container is hung inside rtnl_unlock() waiting, and it still
holds the net_mutex. It doesn't release the net_mutex until its lo
interface finally is destroyed after the TCP connection times out.

3. When a new container is started, part of its startup is to call
net/core/net_namespace.c function copy_net_ns() to copy the caller's net
namespace to the new container net namespace. However, this function
locks the net_mutex. Since the previous container still is holding the
net_mutex as explaned in #2 above, this new container creation blocks
until #2 releases the mutex.


There are a few ways to possibly address this: 

a) when a container is being removed, all its TCP connections should
abort themselves. Currently, TCP connections don't directly register for
interface unregister events - they explictly want to stick around, so if
an interface is taken down and then brought back up, the TCP connection
remains, and the communication riding on top of the connection isn't
interrupted. However, the way TCP does this is to move all its dst
references off the interface that is unregistering, to the loopback
interface. This works for the initial network namespace, where the
loopback interface is always present and never removed. However, this
doesn't work for non-default network namespaces - like containers -
where the loopback interface is unregistered when the container is being
removed. So this aspect of TCP may need to change, to correctly handle
containers.  This also may not cover all causes of this hang, since
sockets handle more than just tcp connections.

b) when a container is being removed, instead of holding the net_mutex
while the cleaning up (and calling rtnl_unlock), it could release the
net_mutex first (after removing all netns marked for cleanup from the
pernet list), then call rtnl_unlock. This needs examination to make sure
it would not introduce any races or other problems.

c) rtnl_unlock could be simplified - currently, it includes significant
side-effects, which include the long delay while waiting to actually
remove all references to the namespace's interfaces (including
loopback). Instead of blocking rtnl_unlock() to do all this cleanup, the
cleanup could be deferred. This also would need investigation to make
sure no caller is expecting to be able to free resources that may be
accessed later from the cleanup action (which I believe is the case).

As this is a complex problem there are likely other options to fix it as
well.  This issue also has been around ever since namespaces were
introduced to the kernel, as far as I can tell, but it's not a commonly
seen issue because in most cases socket connections are shut down before
stopping the container.  The socket connection causing the problem here
is a kernel socket, which is different than normal userspace-created
sockets; that may make a difference though I haven't investigated that
angle yet.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1711407

Title:
  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  New
Status in linux source package in Xenial:
  New
Status in linux source package in Zesty:
  New
Status in linux source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  New

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.

  --

  [Impact]

  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

      unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-
  samba-loop

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to