Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel
On Sunday 28 September 2014 at 15:01:21 +0100, Ben Hutchings wrote: > On Sat, 2014-09-27 at 19:41 +0100, Mike Crowe wrote: > > I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with > > CONFIG_LOCKDEP and panic on hung task enabled. > > > > >From the crash dump: > > > > [25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds. > > [25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [25202.170432] nfsdD 88080aa0eca8 0 3247 2 > > 0x > > [25202.170444] 88080a8e19f0 0046 0006 > > 8808 > > [25202.170458] 88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 > > 001d4040 > > [25202.170472] 88040e9926c0 88080aa0e9c0 8138d6da > > 0001a04c47dd > > [25202.170488] Call Trace: > > [25202.170504] [] ? __mutex_lock_common+0x236/0x379 > > [25202.170531] [] ? fh_lock_nested+0x4d/0x61 [nfsd] > > [25202.170542] [] schedule+0x55/0x57 > > [25202.170552] [] __mutex_lock_common+0x243/0x379 > > [25202.170569] [] ? fh_lock_nested+0x4d/0x61 [nfsd] > > [25202.170581] [] mutex_lock_nested+0x2a/0x31 > > [25202.170598] [] fh_lock_nested+0x4d/0x61 [nfsd] > > [25202.170610] [] ? sched_clock+0x9/0xd > > [25202.170626] [] nfsd_lookup_dentry+0x196/0x227 [nfsd] > > [25202.170646] [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd] > > [25202.170666] [] nfsd4_secinfo+0x4d/0x5b [nfsd] > > [25202.170688] [] nfsd4_proc_compound+0x265/0x43e [nfsd] > > [25202.170703] [] nfsd_dispatch+0xe2/0x1c8 [nfsd] > > [25202.170734] [] svc_process_common+0x2cf/0x4d0 [sunrpc] > > [25202.170759] [] svc_process+0x118/0x136 [sunrpc] > > [25202.170773] [] nfsd+0xeb/0x131 [nfsd] > > [25202.170796] [] ? 0xa04c0fff > > [25202.170806] [] kthread+0xa3/0xab > > [25202.170815] [] kernel_thread_helper+0x4/0x10 > > [25202.170823] [] ? retint_restore_args+0x13/0x13 > > [25202.170830] [] ? __init_kthread_worker+0x53/0x53 > > [25202.170837] [] ? gs_change+0x13/0x13 > > [25202.170842] 1 lock held by nfsd/3247: > > [25202.170845] #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: > > [] fh_lock_nested+0x4d/0x61 [nfsd] > > [25202.170870] Kernel panic - not syncing: hung_task: blocked tasks [snip] > nfsd is trying to lock two objects in the same class: specifically, it > locks a file handle and then the file handle for the file's parent. > It's generally safe to do this so long as they're always taken in that > order. lockdep should complain (much more verbosely) if this is not > done consistently. That makes sense. So is there any clue as to why it's blocking inside the second mutex_lock_nested? > I'm afraid this doesn't explain what's going wrong. But if there are > any more messages from lockdep further up the log (like, 15 minutes > earlier), they might do. Unfortunately not, the previous line in the log is the last message from boot time: [ 38.624072] vnet0: no IPv6 routers present Is there a way I can persuade crash(8) to tell me which process currently has the lock in question? Do you have any advice as to any more debug stuff I should try turning on when compiling the kernel? Thanks for your help. Mike. signature.asc Description: Digital signature
Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel
On Sat, 2014-09-27 at 19:41 +0100, Mike Crowe wrote: > I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with > CONFIG_LOCKDEP and panic on hung task enabled. > > >From the crash dump: > > [25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds. > [25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [25202.170432] nfsdD 88080aa0eca8 0 3247 2 > 0x > [25202.170444] 88080a8e19f0 0046 0006 > 8808 > [25202.170458] 88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 > 001d4040 > [25202.170472] 88040e9926c0 88080aa0e9c0 8138d6da > 0001a04c47dd > [25202.170488] Call Trace: > [25202.170504] [] ? __mutex_lock_common+0x236/0x379 > [25202.170531] [] ? fh_lock_nested+0x4d/0x61 [nfsd] > [25202.170542] [] schedule+0x55/0x57 > [25202.170552] [] __mutex_lock_common+0x243/0x379 > [25202.170569] [] ? fh_lock_nested+0x4d/0x61 [nfsd] > [25202.170581] [] mutex_lock_nested+0x2a/0x31 > [25202.170598] [] fh_lock_nested+0x4d/0x61 [nfsd] > [25202.170610] [] ? sched_clock+0x9/0xd > [25202.170626] [] nfsd_lookup_dentry+0x196/0x227 [nfsd] > [25202.170646] [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd] > [25202.170666] [] nfsd4_secinfo+0x4d/0x5b [nfsd] > [25202.170688] [] nfsd4_proc_compound+0x265/0x43e [nfsd] > [25202.170703] [] nfsd_dispatch+0xe2/0x1c8 [nfsd] > [25202.170734] [] svc_process_common+0x2cf/0x4d0 [sunrpc] > [25202.170759] [] svc_process+0x118/0x136 [sunrpc] > [25202.170773] [] nfsd+0xeb/0x131 [nfsd] > [25202.170796] [] ? 0xa04c0fff > [25202.170806] [] kthread+0xa3/0xab > [25202.170815] [] kernel_thread_helper+0x4/0x10 > [25202.170823] [] ? retint_restore_args+0x13/0x13 > [25202.170830] [] ? __init_kthread_worker+0x53/0x53 > [25202.170837] [] ? gs_change+0x13/0x13 > [25202.170842] 1 lock held by nfsd/3247: > [25202.170845] #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: > [] fh_lock_nested+0x4d/0x61 [nfsd] > [25202.170870] Kernel panic - not syncing: hung_task: blocked tasks > > I'm no expert at interpreting lockdep output but I think this is saying > that nfsd is taking a nested lock and then deadlocks trying to take it > again (which presumably shouldn't happen because it is nested.) nfsd is trying to lock two objects in the same class: specifically, it locks a file handle and then the file handle for the file's parent. It's generally safe to do this so long as they're always taken in that order. lockdep should complain (much more verbosely) if this is not done consistently. I'm afraid this doesn't explain what's going wrong. But if there are any more messages from lockdep further up the log (like, 15 minutes earlier), they might do. Ben. -- Ben Hutchings This sentence contradicts itself - no actually it doesn't. signature.asc Description: This is a digitally signed message part
Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel
I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with CONFIG_LOCKDEP and panic on hung task enabled. >From the crash dump: [25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds. [25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [25202.170432] nfsdD 88080aa0eca8 0 3247 2 0x [25202.170444] 88080a8e19f0 0046 0006 8808 [25202.170458] 88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 001d4040 [25202.170472] 88040e9926c0 88080aa0e9c0 8138d6da 0001a04c47dd [25202.170488] Call Trace: [25202.170504] [] ? __mutex_lock_common+0x236/0x379 [25202.170531] [] ? fh_lock_nested+0x4d/0x61 [nfsd] [25202.170542] [] schedule+0x55/0x57 [25202.170552] [] __mutex_lock_common+0x243/0x379 [25202.170569] [] ? fh_lock_nested+0x4d/0x61 [nfsd] [25202.170581] [] mutex_lock_nested+0x2a/0x31 [25202.170598] [] fh_lock_nested+0x4d/0x61 [nfsd] [25202.170610] [] ? sched_clock+0x9/0xd [25202.170626] [] nfsd_lookup_dentry+0x196/0x227 [nfsd] [25202.170646] [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd] [25202.170666] [] nfsd4_secinfo+0x4d/0x5b [nfsd] [25202.170688] [] nfsd4_proc_compound+0x265/0x43e [nfsd] [25202.170703] [] nfsd_dispatch+0xe2/0x1c8 [nfsd] [25202.170734] [] svc_process_common+0x2cf/0x4d0 [sunrpc] [25202.170759] [] svc_process+0x118/0x136 [sunrpc] [25202.170773] [] nfsd+0xeb/0x131 [nfsd] [25202.170796] [] ? 0xa04c0fff [25202.170806] [] kthread+0xa3/0xab [25202.170815] [] kernel_thread_helper+0x4/0x10 [25202.170823] [] ? retint_restore_args+0x13/0x13 [25202.170830] [] ? __init_kthread_worker+0x53/0x53 [25202.170837] [] ? gs_change+0x13/0x13 [25202.170842] 1 lock held by nfsd/3247: [25202.170845] #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [] fh_lock_nested+0x4d/0x61 [nfsd] [25202.170870] Kernel panic - not syncing: hung_task: blocked tasks I'm no expert at interpreting lockdep output but I think this is saying that nfsd is taking a nested lock and then deadlocks trying to take it again (which presumably shouldn't happen because it is nested.) Mike. -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel
On Monday 21 July 2014 at 15:01:31 +0100, Mike Crowe wrote: > It is possible that I'm seeing the same problem. Our AMD Opteron 4386 (16 > cores) machine is also getting stuck with lots of hung tasks. [snip] > PID: 4087 TASK: 88040ea63840 CPU: 2 COMMAND: "nfsd" > #0 [8804034b9c00] __schedule at 8134f195 > #1 [8804034b9c88] __mutex_lock_common.isra.5 at 8134fb74 > #2 [8804034b9cf8] mutex_lock at 8134fa62 > #3 [8804034b9d18] fh_lock_nested.isra.6 at a043d63c [nfsd] > #4 [8804034b9d28] nfsd_lookup_dentry at a043df1a [nfsd] > #5 [8804034b9d98] nfsd4_secinfo.part.15 at a0447692 [nfsd] > #6 [8804034b9dc8] nfsd4_proc_compound at a04468d6 [nfsd] > #7 [8804034b9e18] nfsd_dispatch at a043a7cd [nfsd] > #8 [8804034b9e48] svc_process_common at a0336c3f [sunrpc] > #9 [8804034b9eb8] svc_process at a0337050 [sunrpc] > #10 [8804034b9ed8] nfsd at a043a0e3 [nfsd] > #11 [8804034b9ef8] kthread at 8105f701 > #12 [8804034b9f48] kernel_thread_helper at 813576f4 > ii linux-image-amd643.2+46 > ii nfs-kernel-server1:1.2.6-4 That version information wasn't very useful, was it. :( I believe that this crash was from linux-image-3.2.0-4-amd64 3.2.60-1+deb7u1 I've just had the same failure happen with 3.2.60-1+deb7u3. Mike. -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel
It is possible that I'm seeing the same problem. Our AMD Opteron 4386 (16 cores) machine is also getting stuck with lots of hung tasks. Although it responds to ping, and even a KVM virtual machine running on it appears to continue working correctly, the host itself is locked up. This happens once a week - probably when the machine is under the most direct CPU load and NFS load. Once the machine is in this state I can type in a username at the login prompt but no password prompt ever appears. I forced a crashdump and it contained hundreds of tasks with backtraces involving a mutex_lock in walk_component or nfsd_lookup_dentry which look similar to Alexander's: PID: 499TASK: 880490a29080 CPU: 11 COMMAND: "nrpe" #0 [880454e099a8] __schedule at 8134f195 #1 [880454e09a30] __mutex_lock_common.isra.5 at 8134fb74 #2 [880454e09aa0] mutex_lock at 8134fa62 #3 [880454e09ac0] walk_component at 81103868 #4 [880454e09b30] link_path_walk at 811040c1 #5 [880454e09bc0] path_openat at 8110611d #6 [880454e09c50] do_filp_open at 8110646d #7 [880454e09d20] open_exec at 810fed80 #8 [880454e09d40] load_elf_binary at 81135939 #9 [880454e09e50] search_binary_handler at 810ff7fd #10 [880454e09ea0] do_execve_common.isra.24 at 81100551 #11 [880454e09f10] sys_execve at 81014dd2 #12 [880454e09f50] stub_execve at 813559ec RIP: 7fcc8991ca87 RSP: 7fffe8b91ef8 RFLAGS: 0246 RAX: 003b RBX: 0003 RCX: RDX: 0164d180 RSI: 7fffe8b91f10 RDI: 7fcc899bc3ad RBP: 0003 R8: R9: 01f2 R10: 7fcc8a88f9d0 R11: 0246 R12: 7fffe8b91f10 R13: 0400 R14: 0001 R15: 7fffe8b91f10 ORIG_RAX: 003b CS: 0033 SS: 002b and: PID: 4087 TASK: 88040ea63840 CPU: 2 COMMAND: "nfsd" #0 [8804034b9c00] __schedule at 8134f195 #1 [8804034b9c88] __mutex_lock_common.isra.5 at 8134fb74 #2 [8804034b9cf8] mutex_lock at 8134fa62 #3 [8804034b9d18] fh_lock_nested.isra.6 at a043d63c [nfsd] #4 [8804034b9d28] nfsd_lookup_dentry at a043df1a [nfsd] #5 [8804034b9d98] nfsd4_secinfo.part.15 at a0447692 [nfsd] #6 [8804034b9dc8] nfsd4_proc_compound at a04468d6 [nfsd] #7 [8804034b9e18] nfsd_dispatch at a043a7cd [nfsd] #8 [8804034b9e48] svc_process_common at a0336c3f [sunrpc] #9 [8804034b9eb8] svc_process at a0337050 [sunrpc] #10 [8804034b9ed8] nfsd at a043a0e3 [nfsd] #11 [8804034b9ef8] kthread at 8105f701 #12 [8804034b9f48] kernel_thread_helper at 813576f4 and: PID: 5013 TASK: 880805c8b180 CPU: 8 COMMAND: "getty" #0 [88080cb8b9a8] __schedule at 8134f195 #1 [88080cb8ba30] __mutex_lock_common.isra.5 at 8134fb74 #2 [88080cb8baa0] mutex_lock at 8134fa62 #3 [88080cb8bac0] walk_component at 81103868 #4 [88080cb8bb30] link_path_walk at 811040c1 #5 [88080cb8bbc0] path_openat at 8110611d #6 [88080cb8bc50] do_filp_open at 8110646d #7 [88080cb8bd20] open_exec at 810fed80 #8 [88080cb8bd40] load_elf_binary at 81135939 #9 [88080cb8be50] search_binary_handler at 810ff7fd #10 [88080cb8bea0] do_execve_common.isra.24 at 81100551 #11 [88080cb8bf10] sys_execve at 81014dd2 #12 [88080cb8bf50] stub_execve at 813559ec RIP: 7f0d1ed74a87 RSP: 7fffab157528 RFLAGS: 0206 RAX: 003b RBX: RCX: RDX: 7fffab159ee8 RSI: 7fffab157600 RDI: 00405d7c RBP: 0003 R8: R9: R10: R11: 0206 R12: 006075a0 R13: 011da750 R14: R15: ORIG_RAX: 003b CS: 0033 SS: 002b ii linux-image-amd643.2+46 ii nfs-kernel-server1:1.2.6-4 Mike. -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org