I've sent an inquiry to the upstream maintainers for assistance. I've also taken a stab at a fix, which I think should prevent the hang, but I'm not sure whether or not it might cause other problems. The patch and test build are here:
http://people.canonical.com/~sforshee/lp1598285/ I've currently got a setup running trying to reproduce the bug, once I've confirmed I've reproduced it I can test my fix. ** Changed in: linux (Ubuntu) Status: Incomplete => In Progress ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Seth Forshee (sforshee) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1598285 Title: possible deadlock while using the cgroup freezer on a container with NFS-based workload Status in linux package in Ubuntu: In Progress Bug description: Hi guys, For background: I'm running a container with an NFS filesystem bind mounted into it. The workload I'm running is iozone, a filesystem benchmarking tool. While running this workload, I attempt to freeze the container, which gets stuck in the FREEZING state. After a while, I get: Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.104156] INFO: task iozone:20035 blocked for more than 120 seconds. Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.111056] Tainted: P O 4.4.0-24-generic #43-Ubuntu Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.118053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126110] iozone D ffff880015673e18 0 20035 20005 0x00000104 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126116] ffff880015673e18 ffff880000000010 ffff880045a21b80 ffff880037776e00 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126118] ffff880015674000 ffff8800179d6e54 ffff880037776e00 00000000ffffffff Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126120] ffff8800179d6e58 ffff880015673e30 ffffffff81821b15 ffff8800179d6e50 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126121] Call Trace: Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126129] [<ffffffff81821b15>] schedule+0x35/0x80 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126131] [<ffffffff81821dbe>] schedule_preempt_disabled+0xe/0x10 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126134] [<ffffffff818239f9>] __mutex_lock_slowpath+0xb9/0x130 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126136] [<ffffffff81823a8f>] mutex_lock+0x1f/0x30 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126139] [<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126142] [<ffffffff8121dc16>] SyS_unlink+0x16/0x20 Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126146] [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71 It looks like the task is actually stuck in generic fs code, not anything NFS specific, but perhaps that's a relevant detail. Anyway: ubuntu@juju-19f8e3-15:~$ sudo cat /proc/20035/stack [<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0 [<ffffffff8121dc16>] SyS_unlink+0x16/0x20 [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71 [<ffffffffffffffff>] 0xffffffffffffffff The container and host are both xenial: ubuntu@juju-19f8e3-15:~$ uname -a Linux juju-19f8e3-15 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Finally, I don't have a good reproducer for this. It's pretty rare, as I'm running this benchmark in a loop, and over thousands of runs I've seen this exactly once. I'll leave these hosts up for a bit if there's any other interesting bits of info to collect. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1598285/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp