Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel

2014-09-28 Thread Mike Crowe
On Sunday 28 September 2014 at 15:01:21 +0100, Ben Hutchings wrote:
> On Sat, 2014-09-27 at 19:41 +0100, Mike Crowe wrote:
> > I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with
> > CONFIG_LOCKDEP and panic on hung task enabled.
> > 
> > >From the crash dump:
> > 
> > [25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds.
> > [25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [25202.170432] nfsdD 88080aa0eca8 0  3247  2 
> > 0x
> > [25202.170444]  88080a8e19f0 0046 0006 
> > 8808
> > [25202.170458]  88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 
> > 001d4040
> > [25202.170472]  88040e9926c0 88080aa0e9c0 8138d6da 
> > 0001a04c47dd
> > [25202.170488] Call Trace:
> > [25202.170504]  [] ? __mutex_lock_common+0x236/0x379
> > [25202.170531]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
> > [25202.170542]  [] schedule+0x55/0x57
> > [25202.170552]  [] __mutex_lock_common+0x243/0x379
> > [25202.170569]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
> > [25202.170581]  [] mutex_lock_nested+0x2a/0x31
> > [25202.170598]  [] fh_lock_nested+0x4d/0x61 [nfsd]
> > [25202.170610]  [] ? sched_clock+0x9/0xd
> > [25202.170626]  [] nfsd_lookup_dentry+0x196/0x227 [nfsd]
> > [25202.170646]  [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd]
> > [25202.170666]  [] nfsd4_secinfo+0x4d/0x5b [nfsd]
> > [25202.170688]  [] nfsd4_proc_compound+0x265/0x43e [nfsd]
> > [25202.170703]  [] nfsd_dispatch+0xe2/0x1c8 [nfsd]
> > [25202.170734]  [] svc_process_common+0x2cf/0x4d0 [sunrpc]
> > [25202.170759]  [] svc_process+0x118/0x136 [sunrpc]
> > [25202.170773]  [] nfsd+0xeb/0x131 [nfsd]
> > [25202.170796]  [] ? 0xa04c0fff
> > [25202.170806]  [] kthread+0xa3/0xab
> > [25202.170815]  [] kernel_thread_helper+0x4/0x10
> > [25202.170823]  [] ? retint_restore_args+0x13/0x13
> > [25202.170830]  [] ? __init_kthread_worker+0x53/0x53
> > [25202.170837]  [] ? gs_change+0x13/0x13
> > [25202.170842] 1 lock held by nfsd/3247:
> > [25202.170845]  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: 
> > [] fh_lock_nested+0x4d/0x61 [nfsd]
> > [25202.170870] Kernel panic - not syncing: hung_task: blocked tasks

[snip]

> nfsd is trying to lock two objects in the same class: specifically, it
> locks a file handle and then the file handle for the file's parent.
> It's generally safe to do this so long as they're always taken in that
> order.  lockdep should complain (much more verbosely) if this is not
> done consistently.

That makes sense. So is there any clue as to why it's blocking inside the
second mutex_lock_nested?

> I'm afraid this doesn't explain what's going wrong.  But if there are
> any more messages from lockdep further up the log (like, 15 minutes
> earlier), they might do.

Unfortunately not, the previous line in the log is the last message from boot 
time:

[   38.624072] vnet0: no IPv6 routers present

Is there a way I can persuade crash(8) to tell me which process currently
has the lock in question?

Do you have any advice as to any more debug stuff I should try turning on
when compiling the kernel?

Thanks for your help.

Mike.



signature.asc
Description: Digital signature


Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel

2014-09-28 Thread Ben Hutchings
On Sat, 2014-09-27 at 19:41 +0100, Mike Crowe wrote:
> I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with
> CONFIG_LOCKDEP and panic on hung task enabled.
> 
> >From the crash dump:
> 
> [25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds.
> [25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [25202.170432] nfsdD 88080aa0eca8 0  3247  2 
> 0x
> [25202.170444]  88080a8e19f0 0046 0006 
> 8808
> [25202.170458]  88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 
> 001d4040
> [25202.170472]  88040e9926c0 88080aa0e9c0 8138d6da 
> 0001a04c47dd
> [25202.170488] Call Trace:
> [25202.170504]  [] ? __mutex_lock_common+0x236/0x379
> [25202.170531]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
> [25202.170542]  [] schedule+0x55/0x57
> [25202.170552]  [] __mutex_lock_common+0x243/0x379
> [25202.170569]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
> [25202.170581]  [] mutex_lock_nested+0x2a/0x31
> [25202.170598]  [] fh_lock_nested+0x4d/0x61 [nfsd]
> [25202.170610]  [] ? sched_clock+0x9/0xd
> [25202.170626]  [] nfsd_lookup_dentry+0x196/0x227 [nfsd]
> [25202.170646]  [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd]
> [25202.170666]  [] nfsd4_secinfo+0x4d/0x5b [nfsd]
> [25202.170688]  [] nfsd4_proc_compound+0x265/0x43e [nfsd]
> [25202.170703]  [] nfsd_dispatch+0xe2/0x1c8 [nfsd]
> [25202.170734]  [] svc_process_common+0x2cf/0x4d0 [sunrpc]
> [25202.170759]  [] svc_process+0x118/0x136 [sunrpc]
> [25202.170773]  [] nfsd+0xeb/0x131 [nfsd]
> [25202.170796]  [] ? 0xa04c0fff
> [25202.170806]  [] kthread+0xa3/0xab
> [25202.170815]  [] kernel_thread_helper+0x4/0x10
> [25202.170823]  [] ? retint_restore_args+0x13/0x13
> [25202.170830]  [] ? __init_kthread_worker+0x53/0x53
> [25202.170837]  [] ? gs_change+0x13/0x13
> [25202.170842] 1 lock held by nfsd/3247:
> [25202.170845]  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: 
> [] fh_lock_nested+0x4d/0x61 [nfsd]
> [25202.170870] Kernel panic - not syncing: hung_task: blocked tasks
> 
> I'm no expert at interpreting lockdep output but I think this is saying
> that nfsd is taking a nested lock and then deadlocks trying to take it
> again (which presumably shouldn't happen because it is nested.)

nfsd is trying to lock two objects in the same class: specifically, it
locks a file handle and then the file handle for the file's parent.
It's generally safe to do this so long as they're always taken in that
order.  lockdep should complain (much more verbosely) if this is not
done consistently.

I'm afraid this doesn't explain what's going wrong.  But if there are
any more messages from lockdep further up the log (like, 15 minutes
earlier), they might do.

Ben.

-- 
Ben Hutchings
This sentence contradicts itself - no actually it doesn't.


signature.asc
Description: This is a digitally signed message part


Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel

2014-09-27 Thread Mike Crowe
I compiled my own version of the Debian 3.2.60-1+deb7u3 kernel with
CONFIG_LOCKDEP and panic on hung task enabled.

>From the crash dump:

[25202.156175] INFO: task nfsd:3247 blocked for more than 900 seconds.
[25202.162565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[25202.170432] nfsdD 88080aa0eca8 0  3247  2 0x
[25202.170444]  88080a8e19f0 0046 0006 
8808
[25202.170458]  88080aa0e9c0 88080a8e1fd8 88080a8e1fd8 
001d4040
[25202.170472]  88040e9926c0 88080aa0e9c0 8138d6da 
0001a04c47dd
[25202.170488] Call Trace:
[25202.170504]  [] ? __mutex_lock_common+0x236/0x379
[25202.170531]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
[25202.170542]  [] schedule+0x55/0x57
[25202.170552]  [] __mutex_lock_common+0x243/0x379
[25202.170569]  [] ? fh_lock_nested+0x4d/0x61 [nfsd]
[25202.170581]  [] mutex_lock_nested+0x2a/0x31
[25202.170598]  [] fh_lock_nested+0x4d/0x61 [nfsd]
[25202.170610]  [] ? sched_clock+0x9/0xd
[25202.170626]  [] nfsd_lookup_dentry+0x196/0x227 [nfsd]
[25202.170646]  [] nfsd4_secinfo.part.15+0x26/0x9e [nfsd]
[25202.170666]  [] nfsd4_secinfo+0x4d/0x5b [nfsd]
[25202.170688]  [] nfsd4_proc_compound+0x265/0x43e [nfsd]
[25202.170703]  [] nfsd_dispatch+0xe2/0x1c8 [nfsd]
[25202.170734]  [] svc_process_common+0x2cf/0x4d0 [sunrpc]
[25202.170759]  [] svc_process+0x118/0x136 [sunrpc]
[25202.170773]  [] nfsd+0xeb/0x131 [nfsd]
[25202.170796]  [] ? 0xa04c0fff
[25202.170806]  [] kthread+0xa3/0xab
[25202.170815]  [] kernel_thread_helper+0x4/0x10
[25202.170823]  [] ? retint_restore_args+0x13/0x13
[25202.170830]  [] ? __init_kthread_worker+0x53/0x53
[25202.170837]  [] ? gs_change+0x13/0x13
[25202.170842] 1 lock held by nfsd/3247:
[25202.170845]  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: 
[] fh_lock_nested+0x4d/0x61 [nfsd]
[25202.170870] Kernel panic - not syncing: hung_task: blocked tasks

I'm no expert at interpreting lockdep output but I think this is saying
that nfsd is taking a nested lock and then deadlocks trying to take it
again (which presumably shouldn't happen because it is nested.)

Mike.


-- 
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel

2014-07-30 Thread Mike Crowe
On Monday 21 July 2014 at 15:01:31 +0100, Mike Crowe wrote:
> It is possible that I'm seeing the same problem. Our AMD Opteron 4386 (16
> cores) machine is also getting stuck with lots of hung tasks.

[snip]

> PID: 4087   TASK: 88040ea63840  CPU: 2   COMMAND: "nfsd"
>  #0 [8804034b9c00] __schedule at 8134f195
>  #1 [8804034b9c88] __mutex_lock_common.isra.5 at 8134fb74
>  #2 [8804034b9cf8] mutex_lock at 8134fa62
>  #3 [8804034b9d18] fh_lock_nested.isra.6 at a043d63c [nfsd]
>  #4 [8804034b9d28] nfsd_lookup_dentry at a043df1a [nfsd]
>  #5 [8804034b9d98] nfsd4_secinfo.part.15 at a0447692 [nfsd]
>  #6 [8804034b9dc8] nfsd4_proc_compound at a04468d6 [nfsd]
>  #7 [8804034b9e18] nfsd_dispatch at a043a7cd [nfsd]
>  #8 [8804034b9e48] svc_process_common at a0336c3f [sunrpc]
>  #9 [8804034b9eb8] svc_process at a0337050 [sunrpc]
> #10 [8804034b9ed8] nfsd at a043a0e3 [nfsd]
> #11 [8804034b9ef8] kthread at 8105f701
> #12 [8804034b9f48] kernel_thread_helper at 813576f4

> ii  linux-image-amd643.2+46
> ii  nfs-kernel-server1:1.2.6-4

That version information wasn't very useful, was it. :( I believe that this
crash was from linux-image-3.2.0-4-amd64 3.2.60-1+deb7u1

I've just had the same failure happen with 3.2.60-1+deb7u3.

Mike.


-- 
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#754354: Something holds dentry-related mutex forever in Wheezy amd64 kernel

2014-07-21 Thread Mike Crowe
It is possible that I'm seeing the same problem. Our AMD Opteron 4386 (16
cores) machine is also getting stuck with lots of hung tasks.

Although it responds to ping, and even a KVM virtual machine running on it
appears to continue working correctly, the host itself is locked up. This
happens once a week - probably when the machine is under the most direct
CPU load and NFS load.

Once the machine is in this state I can type in a username at the login
prompt but no password prompt ever appears.

I forced a crashdump and it contained hundreds of tasks with backtraces
involving a mutex_lock in walk_component or nfsd_lookup_dentry which look
similar to Alexander's:

PID: 499TASK: 880490a29080  CPU: 11  COMMAND: "nrpe"
 #0 [880454e099a8] __schedule at 8134f195
 #1 [880454e09a30] __mutex_lock_common.isra.5 at 8134fb74
 #2 [880454e09aa0] mutex_lock at 8134fa62
 #3 [880454e09ac0] walk_component at 81103868
 #4 [880454e09b30] link_path_walk at 811040c1
 #5 [880454e09bc0] path_openat at 8110611d
 #6 [880454e09c50] do_filp_open at 8110646d
 #7 [880454e09d20] open_exec at 810fed80
 #8 [880454e09d40] load_elf_binary at 81135939
 #9 [880454e09e50] search_binary_handler at 810ff7fd
#10 [880454e09ea0] do_execve_common.isra.24 at 81100551
#11 [880454e09f10] sys_execve at 81014dd2
#12 [880454e09f50] stub_execve at 813559ec
RIP: 7fcc8991ca87  RSP: 7fffe8b91ef8  RFLAGS: 0246
RAX: 003b  RBX: 0003  RCX: 
RDX: 0164d180  RSI: 7fffe8b91f10  RDI: 7fcc899bc3ad
RBP: 0003   R8:    R9: 01f2
R10: 7fcc8a88f9d0  R11: 0246  R12: 7fffe8b91f10
R13: 0400  R14: 0001  R15: 7fffe8b91f10
ORIG_RAX: 003b  CS: 0033  SS: 002b

and:

PID: 4087   TASK: 88040ea63840  CPU: 2   COMMAND: "nfsd"
 #0 [8804034b9c00] __schedule at 8134f195
 #1 [8804034b9c88] __mutex_lock_common.isra.5 at 8134fb74
 #2 [8804034b9cf8] mutex_lock at 8134fa62
 #3 [8804034b9d18] fh_lock_nested.isra.6 at a043d63c [nfsd]
 #4 [8804034b9d28] nfsd_lookup_dentry at a043df1a [nfsd]
 #5 [8804034b9d98] nfsd4_secinfo.part.15 at a0447692 [nfsd]
 #6 [8804034b9dc8] nfsd4_proc_compound at a04468d6 [nfsd]
 #7 [8804034b9e18] nfsd_dispatch at a043a7cd [nfsd]
 #8 [8804034b9e48] svc_process_common at a0336c3f [sunrpc]
 #9 [8804034b9eb8] svc_process at a0337050 [sunrpc]
#10 [8804034b9ed8] nfsd at a043a0e3 [nfsd]
#11 [8804034b9ef8] kthread at 8105f701
#12 [8804034b9f48] kernel_thread_helper at 813576f4

and:

PID: 5013   TASK: 880805c8b180  CPU: 8   COMMAND: "getty"
 #0 [88080cb8b9a8] __schedule at 8134f195
 #1 [88080cb8ba30] __mutex_lock_common.isra.5 at 8134fb74
 #2 [88080cb8baa0] mutex_lock at 8134fa62
 #3 [88080cb8bac0] walk_component at 81103868
 #4 [88080cb8bb30] link_path_walk at 811040c1
 #5 [88080cb8bbc0] path_openat at 8110611d
 #6 [88080cb8bc50] do_filp_open at 8110646d
 #7 [88080cb8bd20] open_exec at 810fed80
 #8 [88080cb8bd40] load_elf_binary at 81135939
 #9 [88080cb8be50] search_binary_handler at 810ff7fd
#10 [88080cb8bea0] do_execve_common.isra.24 at 81100551
#11 [88080cb8bf10] sys_execve at 81014dd2
#12 [88080cb8bf50] stub_execve at 813559ec
RIP: 7f0d1ed74a87  RSP: 7fffab157528  RFLAGS: 0206
RAX: 003b  RBX:   RCX: 
RDX: 7fffab159ee8  RSI: 7fffab157600  RDI: 00405d7c
RBP: 0003   R8:    R9: 
R10:   R11: 0206  R12: 006075a0
R13: 011da750  R14:   R15: 
ORIG_RAX: 003b  CS: 0033  SS: 002b

ii  linux-image-amd643.2+46
ii  nfs-kernel-server1:1.2.6-4

Mike.


-- 
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org