Re: NFS deadlock on 9.2-Beta1
On Sun, Aug 25, 2013 at 7:42 AM, Adrian Chadd wrote: > Does -HEAD have this same problem? If I understood kib@ correctly, this is fixed in -HEAD by r253927. > If so, we should likely just revert the patch entirely from -HEAD and -9 > until it's resolved. It was not too difficult to prepare a releng/9.2 build with r254754 reverted (reverting the revert) and then applying kib@'s suggested backport fix. So far that is running on 9 nodes with no reported problems, but only since last night. We were hesitant to do the significant work involved to push it out to dozens of nodes if nobody was going to consider it for 9.2 anyway. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Michael Tratz wrote: > > On Aug 15, 2013, at 2:39 PM, Rick Macklem > wrote: > > > Michael Tratz wrote: > >> > >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov > >> wrote: > >> > >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > Let's assume the pid which started the deadlock is 14001 (it > will > be a different pid when we get the results, because the machine > has been restarted) > > I type: > > show proc 14001 > > I get the thread numbers from that output and type: > > show thread x > > for each one. > > And a trace for each thread with the command? > > tr > > Anything else I should try to get or do? Or is that not the data > at all you are looking for? > > >>> Yes, everything else which is listed in the 'debugging deadlocks' > >>> page > >>> must be provided, otherwise the deadlock cannot be tracked. > >>> > >>> The investigator should be able to see the whole deadlock chain > >>> (loop) > >>> to make any useful advance. > >> > >> Ok, I have made some excellent progress in debugging the NFS > >> deadlock. > >> > >> Rick! You are genius. :-) You found the right commit r250907 > >> (dated > >> May 22) is the definitely the problem. > >> > >> Here is how I did the testing: One machine received a kernel > >> before > >> r250907, the second machine received a kernel after r250907. Sure > >> enough within a few hours the machine with r250907 went into the > >> usual deadlock state. The machine without that commit kept on > >> working fine. Then I went back to the latest revision (r253726), > >> but > >> leaving r250907 out. The machines have been running happy and rock > >> solid without any deadlocks. I have expanded the testing to 3 > >> machines now and no reports of any issues. > >> > >> I guess now Konstantin has to figure out why that commit is > >> causing > >> the deadlock. Lovely! :-) I will get that information as soon as > >> possible. I'm a little behind with normal work load, but I expect > >> to > >> have the data by Tuesday evening or Wednesday. > >> > > Have you been able to pass the debugging info on to Kostik? > > > > It would be really nice to get this fixed for FreeBSD9.2. > > > > Thanks for your help with this, rick > > Sorry Rick, I wasn't able to get you guys that info quickly enough. I > thought I would have enough time, before my own wedding and > honeymoon came along, but everything went a little crazy and > stressful. I didn't think it would be this nuts. :-) > > I'm caught up with everything and from what I can see from the > discussions is that we know now what the problem is. > > I can report that the machines which I have had without r250907 have > been running without any problems for 27+ days. > > If you need me to test any new patches, please let me know. If I > should test with the partial merge of r253927 I'll be happy to do > so. > It's up to you, but you might want to wait until the other tester (J. David?) reports back on success/failure. Thanks for your help with this, rick > Thanks, > > Michael > > > > > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Hi, Does -HEAD have this same problem? If so, we should likely just revert the patch entirely from -HEAD and -9 until it's resolved. -adrian On 24 August 2013 23:51, Michael Tratz wrote: > > On Aug 15, 2013, at 2:39 PM, Rick Macklem wrote: > > > Michael Tratz wrote: > >> > >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov > >> wrote: > >> > >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > Let's assume the pid which started the deadlock is 14001 (it will > be a different pid when we get the results, because the machine > has been restarted) > > I type: > > show proc 14001 > > I get the thread numbers from that output and type: > > show thread x > > for each one. > > And a trace for each thread with the command? > > tr > > Anything else I should try to get or do? Or is that not the data > at all you are looking for? > > >>> Yes, everything else which is listed in the 'debugging deadlocks' > >>> page > >>> must be provided, otherwise the deadlock cannot be tracked. > >>> > >>> The investigator should be able to see the whole deadlock chain > >>> (loop) > >>> to make any useful advance. > >> > >> Ok, I have made some excellent progress in debugging the NFS > >> deadlock. > >> > >> Rick! You are genius. :-) You found the right commit r250907 (dated > >> May 22) is the definitely the problem. > >> > >> Here is how I did the testing: One machine received a kernel before > >> r250907, the second machine received a kernel after r250907. Sure > >> enough within a few hours the machine with r250907 went into the > >> usual deadlock state. The machine without that commit kept on > >> working fine. Then I went back to the latest revision (r253726), but > >> leaving r250907 out. The machines have been running happy and rock > >> solid without any deadlocks. I have expanded the testing to 3 > >> machines now and no reports of any issues. > >> > >> I guess now Konstantin has to figure out why that commit is causing > >> the deadlock. Lovely! :-) I will get that information as soon as > >> possible. I'm a little behind with normal work load, but I expect to > >> have the data by Tuesday evening or Wednesday. > >> > > Have you been able to pass the debugging info on to Kostik? > > > > It would be really nice to get this fixed for FreeBSD9.2. > > > > Thanks for your help with this, rick > > Sorry Rick, I wasn't able to get you guys that info quickly enough. I > thought I would have enough time, before my own wedding and honeymoon came > along, but everything went a little crazy and stressful. I didn't think it > would be this nuts. :-) > > I'm caught up with everything and from what I can see from the discussions > is that we know now what the problem is. > > I can report that the machines which I have had without r250907 have been > running without any problems for 27+ days. > > If you need me to test any new patches, please let me know. If I should > test with the partial merge of r253927 I'll be happy to do so. > > Thanks, > > Michael > > > > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Aug 15, 2013, at 2:39 PM, Rick Macklem wrote: > Michael Tratz wrote: >> >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov >> wrote: >> >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: Let's assume the pid which started the deadlock is 14001 (it will be a different pid when we get the results, because the machine has been restarted) I type: show proc 14001 I get the thread numbers from that output and type: show thread x for each one. And a trace for each thread with the command? tr Anything else I should try to get or do? Or is that not the data at all you are looking for? >>> Yes, everything else which is listed in the 'debugging deadlocks' >>> page >>> must be provided, otherwise the deadlock cannot be tracked. >>> >>> The investigator should be able to see the whole deadlock chain >>> (loop) >>> to make any useful advance. >> >> Ok, I have made some excellent progress in debugging the NFS >> deadlock. >> >> Rick! You are genius. :-) You found the right commit r250907 (dated >> May 22) is the definitely the problem. >> >> Here is how I did the testing: One machine received a kernel before >> r250907, the second machine received a kernel after r250907. Sure >> enough within a few hours the machine with r250907 went into the >> usual deadlock state. The machine without that commit kept on >> working fine. Then I went back to the latest revision (r253726), but >> leaving r250907 out. The machines have been running happy and rock >> solid without any deadlocks. I have expanded the testing to 3 >> machines now and no reports of any issues. >> >> I guess now Konstantin has to figure out why that commit is causing >> the deadlock. Lovely! :-) I will get that information as soon as >> possible. I'm a little behind with normal work load, but I expect to >> have the data by Tuesday evening or Wednesday. >> > Have you been able to pass the debugging info on to Kostik? > > It would be really nice to get this fixed for FreeBSD9.2. > > Thanks for your help with this, rick Sorry Rick, I wasn't able to get you guys that info quickly enough. I thought I would have enough time, before my own wedding and honeymoon came along, but everything went a little crazy and stressful. I didn't think it would be this nuts. :-) I'm caught up with everything and from what I can see from the discussions is that we know now what the problem is. I can report that the machines which I have had without r250907 have been running without any problems for 27+ days. If you need me to test any new patches, please let me know. If I should test with the partial merge of r253927 I'll be happy to do so. Thanks, Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Kostik wrote: > On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote: > > The requested information about the deadlock was finally obtained > > and > > provided off-list to the requested parties due to size. > > Thank you, the problem is clear now. > > The problematic process backtrace is > > Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900 > sched_switch() at sched_switch+0x234/frame 0xff834c442360 > mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0 > sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0 > sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410 > sleeplk() at sleeplk+0x11a/frame 0xff834c442460 > __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580 > nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0 > VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0 > _vn_lock() at _vn_lock+0x63/frame 0xff834c442640 > ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame > 0xff834c442670 > ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0 > VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810 > vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0 > kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0 > do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20 > amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30 > --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, > rsp = 0x7fffce98, rbp = 0x7fffd1d0 --- > > It tries to do the upgrade of the nfs vnode lock, and for this, the > lock > is dropped and re-acquired. Since this happens with the vnode vm > object' > page busied, we get a reversal between vnode lock and page busy > state. > So effectively, my suspicion that NFS read path drops vnode lock was > true, and in fact I knew about the upgrade. > Ouch. I had forgotten that LK_UPGRADE could result in the shared lock being dropped. I'll admit I've never liked the lock upgrade in nfs_read(), but I'm not sure how to avoid it. I just looked at the commit log message for r138469, which is where this appeared in the old NFS client. (The new NFS client just cloned this code.) It basically notes that with a shared lock, new pages can be faulted in for the vnode while vinvalbuf() is in progress, causing it to fail (I suspect "fail" means never completed?). At the very least, I don't think the lock upgrade is needed unless a call to vinvalbuf() is going to be done. (I'm wondering is a dedicated lock used to serialize this case might be better than using a node LK_UPGRADE?) I think I'll take a closer look at the vinvalbuf() code in head. Do others have any comments on this? (I added jhb@ to the cc list, since he may be familiar with this?) But none of this can happen quickly, so it wouldn't be feasible for stable/9 or even 10.0 at this point in time. rick > I think the easiest route is to a partial merge of the r253927 from > HEAD. > > Index: fs > === > --- fs(revision 254800) > +++ fs(working copy) > > Property changes on: fs > ___ > Modified: svn:mergeinfo >Merged /head/sys/fs:r253927 > Index: kern/uipc_syscalls.c > === > --- kern/uipc_syscalls.c (revision 254800) > +++ kern/uipc_syscalls.c (working copy) > @@ -2124,11 +2124,6 @@ > else { > ssize_t resid; > > - /* > - * Ensure that our page is still around > - * when the I/O completes. > - */ > - vm_page_io_start(pg); > VM_OBJECT_UNLOCK(obj); > > /* > @@ -2144,10 +2139,8 @@ > IO_VMIO | ((MAXBSIZE / bsize) << > IO_SEQSHIFT), > td->td_ucred, NOCRED, &resid, td); > VFS_UNLOCK_GIANT(vfslocked); > - VM_OBJECT_LOCK(obj); > - vm_page_io_finish(pg); > - if (!error) > - VM_OBJECT_UNLOCK(obj); > + if (error) > + VM_OBJECT_LOCK(obj); > mbstat.sf_iocnt++; > } > if (error) { > Index: . > === > --- . (revision 254800) > +++ . (working copy) > > Property changes on: . > ___ > Modified: svn:mergeinfo >Merged /head/sys:r253927 > ___ freebsd-stable@freebsd.org
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 4:55 PM, Konstantin Belousov wrote: > On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote: >> On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov >> wrote: >> > No, at least not without reverting the r254754 first. The IGN_SBUSY patch >> > is not critical there. >> >> There is lots of other stuff in r250907 / reverted by r254754. Some >> of it looks important for sendfile() performance. If testing this >> extensively in the next few days could help get that work back into >> 9.2 we are happy to do it, but if it's too late then we can leave it >> for those on stable/9. > > The revert in the r254754 is only a workaround for your workload, it does > not fix the real issue, which can be reproduced by other means. > > I am not sure would re allow to merge the proper fix, since we are already > somewhere in RC3. Well, let's ask them. :) Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote: > On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov > wrote: > > No, at least not without reverting the r254754 first. The IGN_SBUSY patch > > is not critical there. > > There is lots of other stuff in r250907 / reverted by r254754. Some > of it looks important for sendfile() performance. If testing this > extensively in the next few days could help get that work back into > 9.2 we are happy to do it, but if it's too late then we can leave it > for those on stable/9. The revert in the r254754 is only a workaround for your workload, it does not fix the real issue, which can be reproduced by other means. I am not sure would re allow to merge the proper fix, since we are already somewhere in RC3. pgpLSIemB6rTj.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov wrote: > No, at least not without reverting the r254754 first. The IGN_SBUSY patch > is not critical there. There is lots of other stuff in r250907 / reverted by r254754. Some of it looks important for sendfile() performance. If testing this extensively in the next few days could help get that work back into 9.2 we are happy to do it, but if it's too late then we can leave it for those on stable/9. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 02:03:50PM -0400, J David wrote: > On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov > wrote: > > I think the easiest route is to a partial merge of the r253927 from HEAD. > > Is it helpful if we restart testing releng/9.2 using your suggested > fix? And if so, the IGN_SBUSY patch you posted earlier be applied as > well or no? No, at least not without reverting the r254754 first. The IGN_SBUSY patch is not critical there. > > If it ran successfully on a bunch of machines for next few days, maybe > that would still be in time to be useful feedback for 9.2. > > Thanks! pgp3JVF_QAbg0.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov wrote: > I think the easiest route is to a partial merge of the r253927 from HEAD. Is it helpful if we restart testing releng/9.2 using your suggested fix? And if so, the IGN_SBUSY patch you posted earlier be applied as well or no? If it ran successfully on a bunch of machines for next few days, maybe that would still be in time to be useful feedback for 9.2. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote: > The requested information about the deadlock was finally obtained and > provided off-list to the requested parties due to size. Thank you, the problem is clear now. The problematic process backtrace is Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900 sched_switch() at sched_switch+0x234/frame 0xff834c442360 mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410 sleeplk() at sleeplk+0x11a/frame 0xff834c442460 __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580 nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0 VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0 _vn_lock() at _vn_lock+0x63/frame 0xff834c442640 ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c442670 ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0 VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810 vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0 kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0 do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp = 0x7fffce98, rbp = 0x7fffd1d0 --- It tries to do the upgrade of the nfs vnode lock, and for this, the lock is dropped and re-acquired. Since this happens with the vnode vm object' page busied, we get a reversal between vnode lock and page busy state. So effectively, my suspicion that NFS read path drops vnode lock was true, and in fact I knew about the upgrade. I think the easiest route is to a partial merge of the r253927 from HEAD. Index: fs === --- fs (revision 254800) +++ fs (working copy) Property changes on: fs ___ Modified: svn:mergeinfo Merged /head/sys/fs:r253927 Index: kern/uipc_syscalls.c === --- kern/uipc_syscalls.c(revision 254800) +++ kern/uipc_syscalls.c(working copy) @@ -2124,11 +2124,6 @@ else { ssize_t resid; - /* -* Ensure that our page is still around -* when the I/O completes. -*/ - vm_page_io_start(pg); VM_OBJECT_UNLOCK(obj); /* @@ -2144,10 +2139,8 @@ IO_VMIO | ((MAXBSIZE / bsize) << IO_SEQSHIFT), td->td_ucred, NOCRED, &resid, td); VFS_UNLOCK_GIANT(vfslocked); - VM_OBJECT_LOCK(obj); - vm_page_io_finish(pg); - if (!error) - VM_OBJECT_UNLOCK(obj); + if (error) + VM_OBJECT_LOCK(obj); mbstat.sf_iocnt++; } if (error) { Index: . === --- . (revision 254800) +++ . (working copy) Property changes on: . ___ Modified: svn:mergeinfo Merged /head/sys:r253927 pgp75rUqLWx27.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
The requested information about the deadlock was finally obtained and provided off-list to the requested parties due to size. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
J. David wrote: > One deadlocked process cropped up overnight, but I managed to panic > the box before getting too much debugging info. :( > > The process was in state T instead of D, which I guess must be a side > effect of some of the debugging code compiled in. > > Here are the details I was able to capture: > > db> show proc 7692 > Process 7692 (httpd) at 0xfe0158793000: > state: NORMAL > uid: 25000 gids: 25000 > parent: pid 1 at 0xfe00039c3950 > ABI: FreeBSD ELF64 > arguments: /nfsn/apps/tapache22/bin/httpd > threads: 3 > 100674 D newnfs 0xfe021cdd9848 httpd > 100597 D pgrbwt 0xfe02fda788b8 httpd > 100910 s httpd > > db> show thread 100674 > Thread 100674 at 0xfe0108c79480: > proc (pid 7692): 0xfe0158793000 > name: httpd > stack: 0xff834c80f000-0xff834c812fff > flags: 0x2a804 pflags: 0 > state: INHIBITED: {SLEEPING} > wmesg: newnfs wchan: 0xfe021cdd9848 > priority: 96 > container lock: sleepq chain (0x813c03c8) > > db> tr 100674 > Tracing pid 7692 tid 100674 td 0xfe0108c79480 > sched_switch() at sched_switch+0x234/frame 0xff834c812360 > mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0 > sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0 > sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410 > sleeplk() at sleeplk+0x11a/frame 0xff834c812460 > __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580 > nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0 > VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0 > _vn_lock() at _vn_lock+0x63/frame 0xff834c812640 > ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame > 0xff834c812670 > ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0 > VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810 > vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0 > kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0 > do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20 > amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30 > --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, > rsp > = 0x7e9f43c8, rbp = 0x7e9f4700 --- > > db> show lockchain 100674 > thread 100674 (pid 7692, httpd) inhibited > > db> show thread 100597 > Thread 100597 at 0xfe021c976000: > proc (pid 7692): 0xfe0158793000 > name: httpd > stack: 0xff834c80a000-0xff834c80dfff > flags: 0x28804 pflags: 0 > state: INHIBITED: {SLEEPING} > wmesg: pgrbwt wchan: 0xfe02fda788b8 > priority: 84 > container lock: sleepq chain (0x813c0148) > > db> tr 100597 > Tracing pid 7692 tid 100597 td 0xfe021c976000 > sched_switch() at sched_switch+0x234/frame 0xff834c80d750 > mi_switch() at mi_switch+0x15c/frame 0xff834c80d790 > sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0 > sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800 > _sleep() at _sleep+0x30f/frame 0xff834c80d890 > vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0 > kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0 > do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20 > amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30 > --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, > rsp > = 0x7ebf53c8, rbp = 0x7ebf5700 --- > > db> show lockchain 100597 > thread 100597 (pid 7692, httpd) inhibited > > The "inhibited" is not something I'm familiar with and didn't match > the example output; I thought that maybe the T state was overpowering > the locks, and that maybe I should continue the system and then -CONT > the process. However, a few seconds after I issued "c" at the DDB > prompt, the system panicked in the console driver ("mtx_lock_spin: > recursed on non-recursive mutex cnputs_mtx @ > /usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to > do. :( > > Sorry my stupidity and ignorance is dragging this out. :( This is > all > well outside my comfort zone, but next time I'll get it for sure. > No problem. Thanks for trying to capture this stuff. Unfortunately, what you have above doesn't tell me anything more about the problem. The main question to me is "Why is the thread stuck in "pgrbwt" permanently?". To figure this out, we need the info on all threads on the system. In particular, the status (the output of "ps axHl" would be a start, before going into the debugger) of the "nfsiod" threads might point to the cause, although it may involve other threads as well. If you are running a serial console, just start "script" and then type the commands "ps axHl" followed by going into the debugger and doing the commands here: (basically everything with "all"): http://www.freebsd.org/doc/e
Re: NFS deadlock on 9.2-Beta1
One deadlocked process cropped up overnight, but I managed to panic the box before getting too much debugging info. :( The process was in state T instead of D, which I guess must be a side effect of some of the debugging code compiled in. Here are the details I was able to capture: db> show proc 7692 Process 7692 (httpd) at 0xfe0158793000: state: NORMAL uid: 25000 gids: 25000 parent: pid 1 at 0xfe00039c3950 ABI: FreeBSD ELF64 arguments: /nfsn/apps/tapache22/bin/httpd threads: 3 100674 D newnfs 0xfe021cdd9848 httpd 100597 D pgrbwt 0xfe02fda788b8 httpd 100910 s httpd db> show thread 100674 Thread 100674 at 0xfe0108c79480: proc (pid 7692): 0xfe0158793000 name: httpd stack: 0xff834c80f000-0xff834c812fff flags: 0x2a804 pflags: 0 state: INHIBITED: {SLEEPING} wmesg: newnfs wchan: 0xfe021cdd9848 priority: 96 container lock: sleepq chain (0x813c03c8) db> tr 100674 Tracing pid 7692 tid 100674 td 0xfe0108c79480 sched_switch() at sched_switch+0x234/frame 0xff834c812360 mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410 sleeplk() at sleeplk+0x11a/frame 0xff834c812460 __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580 nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0 VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0 _vn_lock() at _vn_lock+0x63/frame 0xff834c812640 ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c812670 ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0 VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810 vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0 kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0 do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp = 0x7e9f43c8, rbp = 0x7e9f4700 --- db> show lockchain 100674 thread 100674 (pid 7692, httpd) inhibited db> show thread 100597 Thread 100597 at 0xfe021c976000: proc (pid 7692): 0xfe0158793000 name: httpd stack: 0xff834c80a000-0xff834c80dfff flags: 0x28804 pflags: 0 state: INHIBITED: {SLEEPING} wmesg: pgrbwt wchan: 0xfe02fda788b8 priority: 84 container lock: sleepq chain (0x813c0148) db> tr 100597 Tracing pid 7692 tid 100597 td 0xfe021c976000 sched_switch() at sched_switch+0x234/frame 0xff834c80d750 mi_switch() at mi_switch+0x15c/frame 0xff834c80d790 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800 _sleep() at _sleep+0x30f/frame 0xff834c80d890 vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0 kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0 do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp = 0x7ebf53c8, rbp = 0x7ebf5700 --- db> show lockchain 100597 thread 100597 (pid 7692, httpd) inhibited The "inhibited" is not something I'm familiar with and didn't match the example output; I thought that maybe the T state was overpowering the locks, and that maybe I should continue the system and then -CONT the process. However, a few seconds after I issued "c" at the DDB prompt, the system panicked in the console driver ("mtx_lock_spin: recursed on non-recursive mutex cnputs_mtx @ /usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to do. :( Sorry my stupidity and ignorance is dragging this out. :( This is all well outside my comfort zone, but next time I'll get it for sure. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Wed, Aug 21, 2013 at 09:08:10PM -0400, Rick Macklem wrote: > Kostik wrote: > > On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote: > > > J David wrote: > > > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem > > > > > > > > wrote: > > > > > Have you been able to pass the debugging info on to Kostik? > > > > > > > > > > It would be really nice to get this fixed for FreeBSD9.2. > > > > > > > > You're probably not talking to me, but headway here is slow. At > > > > our > > > > location, we have been continuing to test releng/9.2 extensively, > > > > but > > > > with r250907 reverted. Since reverting it solves the issue, and > > > > since > > > > there haven't been any further changes to releng/9.2 that might > > > > also > > > > resolve this issue, re-applying r250907 is perceived here as > > > > un-fixing > > > > a problem. Enthusiasm for doing so is correspondingly low, even > > > > if > > > > the purpose is to gather debugging info. :( > > > > > > > > However, after finally having clearance to test releng/9.2 > > > > r254540 > > > > with r250907 included and with DDB on five nodes. The problem > > > > cropped > > > > up in about an hour. Two threads in one process deadlocked, was > > > > perfect. Got it into DDB and saw the stack trace was scrolling > > > > off > > > > so > > > > there was no way to copy it by hand. Also, the machine's disk is > > > > smaller than physical RAM, so no dump file. :( > > > > > > > > Here's what is available so far: > > > > > > > > db> show proc 33362 > > > > > > > > Process 33362 (httpd) at 0xcd225b50: > > > > > > > > state: NORMAL > > > > > > > > uid: 25000 gids: 25000 > > > > > > > > parent: pid 25104 at 0xc95f92d4 > > > > > > > > ABI: FreeBSD ELF32 > > > > > > > > arguments: /usr/local/libexec/httpd > > > > > > > > threads: 3 > > > > > > > > 100405 D newnfs 0xc9b875e4 httpd > > > > > > > Ok, so this one is waiting for an NFS vnode lock. > > > > > > > 100393 D pgrbwt 0xc43a30c0 httpd > > > > > > > This one is sleeping in vm_page_grab() { which I suspect has > > > been called from kern_sendfile() with a shared vnode lock held, > > > from what I saw on the previous debug info }. > > > > > > > 100755 S uwait 0xc84b7c80 httpd > > > > > > > > > > > > Not much to go on. :( Maybe these five can be configured with > > > > serial > > > > consoles. > > > > > > > > So, inquiries are continuing, but the answer to "does this still > > > > happen on 9.2-RC2?" is definitely yes. > > > > > > > Since r250027 moves a vn_lock() to before the vm_page_grab() call > > > in > > > kern_sendfile(), I suspect that is the cause of the deadlock. > > > (r250027 > > > is one of the 3 commits MFC'd by r250907) > > > > > > I don't know if it would be safe to VOP_UNLOCK() the vnode after > > > VOP_GETATTR() > > > and then put the vn_lock() call that comes after vm_page_grab() > > > back in or whether > > > r250027 should be reverted (getting rid of the VOP_GETATTR() and > > > going back to > > > using the size in the vm stuff). > > > > > > Hopefully Kostik will know what is best to do with it now, rick > > > > I already described what to do with this. I need the debugging > > information to see what is going on. Without the data, it is only > > wasted time of everybody involved. > > > Sorry, I didn't make what I was asking clear. I was referring specifically > to stopping the hang from occurring in the soon to be released 9.2. > > I think you indirectly answered the question, in that you don't know > of a fix for the hangs without more debugging information. This > implies that reverting r250907 is the main option to resolve this > for the 9.2 release (unless more debugging info arrives very soon), > since that is the only fix that has been confirmed to work. > Does this sound reasonable? I do not object against reverting it for 9.2. Please go ahead. On the other hand, I do not want to revert it in stable/9, at least until the cause is understood. > > > Some technical notes. The sendfile() uses shared lock for the > > duration > > of vnode i/o, so any thread which is sleeping on the vnode lock > > cannot > > be in the sendfile path, at least for UFS and NFS which do support > > true > > shared locks. > > > > The right lock order is vnode lock -> page busy wait. From this PoV, > > the ordering in the sendfile is correct. Rick, are you aware of any > > situation where the VOP_READ in nfs client could drop vnode lock > > and then re-acquire it ? I was not able to find this from the code > > inspection. But, if such situation exists, it would be problematic in > > 9. > > > I am not aware of a case where nfs_read() drops/re-acquires the vnode > lock. > > However, readaheads will still be in progress when nfs_read() returns, > so those can still be in progress after the vnode lock is dropped. > > vfs_busy_pages() will have been called on the page(s) that readahead > is in progress on (I think that means the shared busy bit will be set, > if I understo
Re: NFS deadlock on 9.2-Beta1
Now that a kernel with INVARIANTS/WITNESS is finally available on a machine with serial console I am having terrible trouble provoking this to happen. (Machine grinds to a halt if I put the usual test load on it due to all the debug code in the kernel.) Did get this interesting LOR, though it did not cause a deadlock: lock order reversal: 1st 0xfe000adb9f30 so_snd_sx (so_snd_sx) @ /usr/src/sys/kern/uipc_sockbuf.c:145 2nd 0xfe000aa5b098 newnfs (newnfs) @ /usr/src/sys/kern/uipc_syscalls.c:2062 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xff834c3995c0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xff834c399670 witness_checkorder() at witness_checkorder+0xc0a/frame 0xff834c3996f0 __lockmgr_args() at __lockmgr_args+0x390/frame 0xff834c399810 nfs_lock1() at nfs_lock1+0x87/frame 0xff834c399840 VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c399870 _vn_lock() at _vn_lock+0x63/frame 0xff834c3998d0 kern_sendfile() at kern_sendfile+0x812/frame 0xff834c399ac0 do_sendfile() at do_sendfile+0x92/frame 0xff834c399b20 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c399c30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c399c30 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp = 0x7fffcf58, rbp = 0x7fffd290 --- Once the real deal pops up, collecting the full requested info should be no problem, but it could take awhile to happen with only one machine that can't run the full test battery. So if a "real" fix is dependent on this, reverting r250907 for 9.2-RELEASE is probably the way to go. With that configuration, releng/9.2 continues to be pretty solid for us. Thanks! (Since this doesn't contain the requested info, I heavily trimmed the Cc: list. It is not my intention to waste the time of everybody involved.) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Kostik wrote: > On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote: > > J David wrote: > > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem > > > > > > wrote: > > > > Have you been able to pass the debugging info on to Kostik? > > > > > > > > It would be really nice to get this fixed for FreeBSD9.2. > > > > > > You're probably not talking to me, but headway here is slow. At > > > our > > > location, we have been continuing to test releng/9.2 extensively, > > > but > > > with r250907 reverted. Since reverting it solves the issue, and > > > since > > > there haven't been any further changes to releng/9.2 that might > > > also > > > resolve this issue, re-applying r250907 is perceived here as > > > un-fixing > > > a problem. Enthusiasm for doing so is correspondingly low, even > > > if > > > the purpose is to gather debugging info. :( > > > > > > However, after finally having clearance to test releng/9.2 > > > r254540 > > > with r250907 included and with DDB on five nodes. The problem > > > cropped > > > up in about an hour. Two threads in one process deadlocked, was > > > perfect. Got it into DDB and saw the stack trace was scrolling > > > off > > > so > > > there was no way to copy it by hand. Also, the machine's disk is > > > smaller than physical RAM, so no dump file. :( > > > > > > Here's what is available so far: > > > > > > db> show proc 33362 > > > > > > Process 33362 (httpd) at 0xcd225b50: > > > > > > state: NORMAL > > > > > > uid: 25000 gids: 25000 > > > > > > parent: pid 25104 at 0xc95f92d4 > > > > > > ABI: FreeBSD ELF32 > > > > > > arguments: /usr/local/libexec/httpd > > > > > > threads: 3 > > > > > > 100405 D newnfs 0xc9b875e4 httpd > > > > > Ok, so this one is waiting for an NFS vnode lock. > > > > > 100393 D pgrbwt 0xc43a30c0 httpd > > > > > This one is sleeping in vm_page_grab() { which I suspect has > > been called from kern_sendfile() with a shared vnode lock held, > > from what I saw on the previous debug info }. > > > > > 100755 S uwait 0xc84b7c80 httpd > > > > > > > > > Not much to go on. :( Maybe these five can be configured with > > > serial > > > consoles. > > > > > > So, inquiries are continuing, but the answer to "does this still > > > happen on 9.2-RC2?" is definitely yes. > > > > > Since r250027 moves a vn_lock() to before the vm_page_grab() call > > in > > kern_sendfile(), I suspect that is the cause of the deadlock. > > (r250027 > > is one of the 3 commits MFC'd by r250907) > > > > I don't know if it would be safe to VOP_UNLOCK() the vnode after > > VOP_GETATTR() > > and then put the vn_lock() call that comes after vm_page_grab() > > back in or whether > > r250027 should be reverted (getting rid of the VOP_GETATTR() and > > going back to > > using the size in the vm stuff). > > > > Hopefully Kostik will know what is best to do with it now, rick > > I already described what to do with this. I need the debugging > information to see what is going on. Without the data, it is only > wasted time of everybody involved. > Sorry, I didn't make what I was asking clear. I was referring specifically to stopping the hang from occurring in the soon to be released 9.2. I think you indirectly answered the question, in that you don't know of a fix for the hangs without more debugging information. This implies that reverting r250907 is the main option to resolve this for the 9.2 release (unless more debugging info arrives very soon), since that is the only fix that has been confirmed to work. Does this sound reasonable? > Some technical notes. The sendfile() uses shared lock for the > duration > of vnode i/o, so any thread which is sleeping on the vnode lock > cannot > be in the sendfile path, at least for UFS and NFS which do support > true > shared locks. > > The right lock order is vnode lock -> page busy wait. From this PoV, > the ordering in the sendfile is correct. Rick, are you aware of any > situation where the VOP_READ in nfs client could drop vnode lock > and then re-acquire it ? I was not able to find this from the code > inspection. But, if such situation exists, it would be problematic in > 9. > I am not aware of a case where nfs_read() drops/re-acquires the vnode lock. However, readaheads will still be in progress when nfs_read() returns, so those can still be in progress after the vnode lock is dropped. vfs_busy_pages() will have been called on the page(s) that readahead is in progress on (I think that means the shared busy bit will be set, if I understood vfs_busy_pages()). When the readahead is completed, bufdone() is called, so I don't understand why the page wouldn't become unbusied (waking up the thread sleeping on "pgrbwt"). I can't see why not being able to acquire the vnode lock would affect this, but my hunch is that it somehow does have this effect, since that is the only way I can see that r250907 would cause the hangs. > Last note. The HEAD dropped pre-busying pages in the sendfile() > syscall. > As I understand
Re: NFS deadlock on 9.2-Beta1
On Wed, Aug 21, 2013 at 08:03:35PM +0200, Yamagi Burmeister wrote: > Could the problem be related to this deadlock / LOR? - > http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html This is not related. > > My test setup is still in place. Will test with r250907 reverted > tomorrow morning and report back. Additional informations could be > provided if necessary. I just need to know what exactly. Just follow the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html and collect all the information listed there after the apache/sendfile/nfs deadlock is reproduced. pgpkkcUXOfUil.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On Wed, 21 Aug 2013 16:10:32 +0300 Konstantin Belousov wrote: > I already described what to do with this. I need the debugging > information to see what is going on. Without the data, it is only > wasted time of everybody involved. > > Some technical notes. The sendfile() uses shared lock for the duration > of vnode i/o, so any thread which is sleeping on the vnode lock cannot > be in the sendfile path, at least for UFS and NFS which do support true > shared locks. > > The right lock order is vnode lock -> page busy wait. From this PoV, > the ordering in the sendfile is correct. Rick, are you aware of any > situation where the VOP_READ in nfs client could drop vnode lock > and then re-acquire it ? I was not able to find this from the code > inspection. But, if such situation exists, it would be problematic in 9. > > Last note. The HEAD dropped pre-busying pages in the sendfile() syscall. > As I understand, this is because new Attilio' busy implementation cannot > support both busy and sbusy states simultaneously, and vfs_busy_pages()/ > vfs_drain_busy_pages() actually created such situation. I think that > because the sbusy is removed from the sendfile(), and the vm object > lock is dropped, there is no sense to require vm_page_grab() to wait > for the busy state to clean. It is done by buffer cache or filesystem > code later. See the patch at the end. > > Still, I do not know what happens in the supposedly reported deadlock. > > diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c > index 4797444..b974f53 100644 > --- a/sys/kern/uipc_syscalls.c > +++ b/sys/kern/uipc_syscalls.c > @@ -2230,7 +2230,8 @@ retry_space: > pindex = OFF_TO_IDX(off); > VM_OBJECT_WLOCK(obj); > pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY | > - VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY); > + VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL | > + VM_ALLOC_WIRED | VM_ALLOC_RETRY); > > /* >* Check if page is valid for what we need, Could the problem be related to this deadlock / LOR? - http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html My test setup is still in place. Will test with r250907 reverted tomorrow morning and report back. Additional informations could be provided if necessary. I just need to know what exactly. Ciao, Yamagi -- Homepage: www.yamagi.org XMPP: yam...@yamagi.org GnuPG/GPG: 0xEFBCCBCB pgpiQ7wf0tTdP.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote: > J David wrote: > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem > > wrote: > > > Have you been able to pass the debugging info on to Kostik? > > > > > > It would be really nice to get this fixed for FreeBSD9.2. > > > > You're probably not talking to me, but headway here is slow. At our > > location, we have been continuing to test releng/9.2 extensively, but > > with r250907 reverted. Since reverting it solves the issue, and > > since > > there haven't been any further changes to releng/9.2 that might also > > resolve this issue, re-applying r250907 is perceived here as > > un-fixing > > a problem. Enthusiasm for doing so is correspondingly low, even if > > the purpose is to gather debugging info. :( > > > > However, after finally having clearance to test releng/9.2 r254540 > > with r250907 included and with DDB on five nodes. The problem > > cropped > > up in about an hour. Two threads in one process deadlocked, was > > perfect. Got it into DDB and saw the stack trace was scrolling off > > so > > there was no way to copy it by hand. Also, the machine's disk is > > smaller than physical RAM, so no dump file. :( > > > > Here's what is available so far: > > > > db> show proc 33362 > > > > Process 33362 (httpd) at 0xcd225b50: > > > > state: NORMAL > > > > uid: 25000 gids: 25000 > > > > parent: pid 25104 at 0xc95f92d4 > > > > ABI: FreeBSD ELF32 > > > > arguments: /usr/local/libexec/httpd > > > > threads: 3 > > > > 100405 D newnfs 0xc9b875e4 httpd > > > Ok, so this one is waiting for an NFS vnode lock. > > > 100393 D pgrbwt 0xc43a30c0 httpd > > > This one is sleeping in vm_page_grab() { which I suspect has > been called from kern_sendfile() with a shared vnode lock held, > from what I saw on the previous debug info }. > > > 100755 S uwait 0xc84b7c80 httpd > > > > > > Not much to go on. :( Maybe these five can be configured with serial > > consoles. > > > > So, inquiries are continuing, but the answer to "does this still > > happen on 9.2-RC2?" is definitely yes. > > > Since r250027 moves a vn_lock() to before the vm_page_grab() call in > kern_sendfile(), I suspect that is the cause of the deadlock. (r250027 > is one of the 3 commits MFC'd by r250907) > > I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR() > and then put the vn_lock() call that comes after vm_page_grab() back in or > whether > r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to > using the size in the vm stuff). > > Hopefully Kostik will know what is best to do with it now, rick I already described what to do with this. I need the debugging information to see what is going on. Without the data, it is only wasted time of everybody involved. Some technical notes. The sendfile() uses shared lock for the duration of vnode i/o, so any thread which is sleeping on the vnode lock cannot be in the sendfile path, at least for UFS and NFS which do support true shared locks. The right lock order is vnode lock -> page busy wait. From this PoV, the ordering in the sendfile is correct. Rick, are you aware of any situation where the VOP_READ in nfs client could drop vnode lock and then re-acquire it ? I was not able to find this from the code inspection. But, if such situation exists, it would be problematic in 9. Last note. The HEAD dropped pre-busying pages in the sendfile() syscall. As I understand, this is because new Attilio' busy implementation cannot support both busy and sbusy states simultaneously, and vfs_busy_pages()/ vfs_drain_busy_pages() actually created such situation. I think that because the sbusy is removed from the sendfile(), and the vm object lock is dropped, there is no sense to require vm_page_grab() to wait for the busy state to clean. It is done by buffer cache or filesystem code later. See the patch at the end. Still, I do not know what happens in the supposedly reported deadlock. diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c index 4797444..b974f53 100644 --- a/sys/kern/uipc_syscalls.c +++ b/sys/kern/uipc_syscalls.c @@ -2230,7 +2230,8 @@ retry_space: pindex = OFF_TO_IDX(off); VM_OBJECT_WLOCK(obj); pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY | - VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY); + VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL | + VM_ALLOC_WIRED | VM_ALLOC_RETRY); /* * Check if page is valid for what we need, pgpWozKzpdFz5.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On 8/20/13, J David wrote: > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem wrote: >> Have you been able to pass the debugging info on to Kostik? >> >> It would be really nice to get this fixed for FreeBSD9.2. > > You're probably not talking to me, but headway here is slow. At our > location, we have been continuing to test releng/9.2 extensively, but > with r250907 reverted. Since reverting it solves the issue, and since > there haven't been any further changes to releng/9.2 that might also > resolve this issue, re-applying r250907 is perceived here as un-fixing > a problem. Enthusiasm for doing so is correspondingly low, even if > the purpose is to gather debugging info. :( > > However, after finally having clearance to test releng/9.2 r254540 > with r250907 included and with DDB on five nodes. The problem cropped > up in about an hour. Two threads in one process deadlocked, was > perfect. Got it into DDB and saw the stack trace was scrolling off so > there was no way to copy it by hand. Also, the machine's disk is > smaller than physical RAM, so no dump file. :( > > Here's what is available so far: > > db> show proc 33362 > > Process 33362 (httpd) at 0xcd225b50: > > state: NORMAL > > uid: 25000 gids: 25000 > > parent: pid 25104 at 0xc95f92d4 > > ABI: FreeBSD ELF32 > > arguments: /usr/local/libexec/httpd > > threads: 3 > > 100405 D newnfs 0xc9b875e4 httpd > > 100393 D pgrbwt 0xc43a30c0 httpd > > 100755 S uwait 0xc84b7c80 httpd > > > Not much to go on. :( Maybe these five can be configured with serial > consoles. try this with serial console: host # script debug-output-file host # cu -s 9600 -l /dev/ttyU0 ~^B KDB: enter: Break to debugger [ thread pid 11 tid 15 ] Stopped at kdb_alt_break_internal+0x17f: movq$0,kdb_why db> show msgbuf ... ~. ^D > > So, inquiries are continuing, but the answer to "does this still > happen on 9.2-RC2?" is definitely yes. > > Thanks! > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
J David wrote: > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem > wrote: > > Have you been able to pass the debugging info on to Kostik? > > > > It would be really nice to get this fixed for FreeBSD9.2. > > You're probably not talking to me, but headway here is slow. At our > location, we have been continuing to test releng/9.2 extensively, but > with r250907 reverted. Since reverting it solves the issue, and > since > there haven't been any further changes to releng/9.2 that might also > resolve this issue, re-applying r250907 is perceived here as > un-fixing > a problem. Enthusiasm for doing so is correspondingly low, even if > the purpose is to gather debugging info. :( > > However, after finally having clearance to test releng/9.2 r254540 > with r250907 included and with DDB on five nodes. The problem > cropped > up in about an hour. Two threads in one process deadlocked, was > perfect. Got it into DDB and saw the stack trace was scrolling off > so > there was no way to copy it by hand. Also, the machine's disk is > smaller than physical RAM, so no dump file. :( > > Here's what is available so far: > > db> show proc 33362 > > Process 33362 (httpd) at 0xcd225b50: > > state: NORMAL > > uid: 25000 gids: 25000 > > parent: pid 25104 at 0xc95f92d4 > > ABI: FreeBSD ELF32 > > arguments: /usr/local/libexec/httpd > > threads: 3 > > 100405 D newnfs 0xc9b875e4 httpd > Ok, so this one is waiting for an NFS vnode lock. > 100393 D pgrbwt 0xc43a30c0 httpd > This one is sleeping in vm_page_grab() { which I suspect has been called from kern_sendfile() with a shared vnode lock held, from what I saw on the previous debug info }. > 100755 S uwait 0xc84b7c80 httpd > > > Not much to go on. :( Maybe these five can be configured with serial > consoles. > > So, inquiries are continuing, but the answer to "does this still > happen on 9.2-RC2?" is definitely yes. > Since r250027 moves a vn_lock() to before the vm_page_grab() call in kern_sendfile(), I suspect that is the cause of the deadlock. (r250027 is one of the 3 commits MFC'd by r250907) I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR() and then put the vn_lock() call that comes after vm_page_grab() back in or whether r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to using the size in the vm stuff). Hopefully Kostik will know what is best to do with it now, rick > Thanks! > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem wrote: > Have you been able to pass the debugging info on to Kostik? > > It would be really nice to get this fixed for FreeBSD9.2. You're probably not talking to me, but headway here is slow. At our location, we have been continuing to test releng/9.2 extensively, but with r250907 reverted. Since reverting it solves the issue, and since there haven't been any further changes to releng/9.2 that might also resolve this issue, re-applying r250907 is perceived here as un-fixing a problem. Enthusiasm for doing so is correspondingly low, even if the purpose is to gather debugging info. :( However, after finally having clearance to test releng/9.2 r254540 with r250907 included and with DDB on five nodes. The problem cropped up in about an hour. Two threads in one process deadlocked, was perfect. Got it into DDB and saw the stack trace was scrolling off so there was no way to copy it by hand. Also, the machine's disk is smaller than physical RAM, so no dump file. :( Here's what is available so far: db> show proc 33362 Process 33362 (httpd) at 0xcd225b50: state: NORMAL uid: 25000 gids: 25000 parent: pid 25104 at 0xc95f92d4 ABI: FreeBSD ELF32 arguments: /usr/local/libexec/httpd threads: 3 100405 D newnfs 0xc9b875e4 httpd 100393 D pgrbwt 0xc43a30c0 httpd 100755 S uwait 0xc84b7c80 httpd Not much to go on. :( Maybe these five can be configured with serial consoles. So, inquiries are continuing, but the answer to "does this still happen on 9.2-RC2?" is definitely yes. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Michael Tratz wrote: > > On Jul 27, 2013, at 11:25 PM, Konstantin Belousov > wrote: > > > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > >> Let's assume the pid which started the deadlock is 14001 (it will > >> be a different pid when we get the results, because the machine > >> has been restarted) > >> > >> I type: > >> > >> show proc 14001 > >> > >> I get the thread numbers from that output and type: > >> > >> show thread x > >> > >> for each one. > >> > >> And a trace for each thread with the command? > >> > >> tr > >> > >> Anything else I should try to get or do? Or is that not the data > >> at all you are looking for? > >> > > Yes, everything else which is listed in the 'debugging deadlocks' > > page > > must be provided, otherwise the deadlock cannot be tracked. > > > > The investigator should be able to see the whole deadlock chain > > (loop) > > to make any useful advance. > > Ok, I have made some excellent progress in debugging the NFS > deadlock. > > Rick! You are genius. :-) You found the right commit r250907 (dated > May 22) is the definitely the problem. > > Here is how I did the testing: One machine received a kernel before > r250907, the second machine received a kernel after r250907. Sure > enough within a few hours the machine with r250907 went into the > usual deadlock state. The machine without that commit kept on > working fine. Then I went back to the latest revision (r253726), but > leaving r250907 out. The machines have been running happy and rock > solid without any deadlocks. I have expanded the testing to 3 > machines now and no reports of any issues. > > I guess now Konstantin has to figure out why that commit is causing > the deadlock. Lovely! :-) I will get that information as soon as > possible. I'm a little behind with normal work load, but I expect to > have the data by Tuesday evening or Wednesday. > Have you been able to pass the debugging info on to Kostik? It would be really nice to get this fixed for FreeBSD9.2. Thanks for your help with this, rick > Thanks again!! > > Michael > > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Mon, Aug 5, 2013 at 12:06 PM, Mark Saad wrote: > Is there any updates on this issue ? Has anyone tested it or see it happen > on the release candidate ? It's a bit premature for that; the RC has been out for a few hours. We put BETA2 on 25 nodes and only saw the problem on five after 24 hours. At that point we switched to a build that reverts the patch that causes the deadlock and no node on which that was done (at this point, all of them) has had the problem since. We'll get some machines on releng/9.2 today, but I didn't see anything in the release candidate announcement to indicate that relevant changes had been made. Is there anything in the release candidate that is believe to address this issue? If so, let us know with svn revision it's in and we'll try to accelerate test deployment. Thanks! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Jul 29, 2013, at 10:48 PM, J David wrote: > If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and > without especially trying to exercise this bug, we found > sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25 > nodes. The ones with highest uptime (about 3 days) seem most > affected, so it does seem like a "sooner or later" type of thing. > Hopefully the fix is easy and it won't be an issue, but it definitely > does seem like a problem 9.2-RELEASE would be better off without. > > Unfortunately we are not in a position to capture the requested > debugging information at this time; none of those nodes are running a > debug version of the kernel. If Michael is unable to get the > information as he hopes, we can try to do that, possibly over the > weekend. For the time being, we will convert half the machines to > rollback r250907 to try to confirm that resolves the issue. > > Thanks all! If one has to encounter a problem like this, it is nice > to come to the list and find the research already so well underway! > __ All Is there any updates on this issue ? Has anyone tested it or see it happen on the release candidate ? --- Mark saad | mark.s...@longcount.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and without especially trying to exercise this bug, we found sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25 nodes. The ones with highest uptime (about 3 days) seem most affected, so it does seem like a "sooner or later" type of thing. Hopefully the fix is easy and it won't be an issue, but it definitely does seem like a problem 9.2-RELEASE would be better off without. Unfortunately we are not in a position to capture the requested debugging information at this time; none of those nodes are running a debug version of the kernel. If Michael is unable to get the information as he hopes, we can try to do that, possibly over the weekend. For the time being, we will convert half the machines to rollback r250907 to try to confirm that resolves the issue. Thanks all! If one has to encounter a problem like this, it is nice to come to the list and find the research already so well underway! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Michael Tratz wrote: > > On Jul 27, 2013, at 11:25 PM, Konstantin Belousov > wrote: > > > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > >> Let's assume the pid which started the deadlock is 14001 (it will > >> be a different pid when we get the results, because the machine > >> has been restarted) > >> > >> I type: > >> > >> show proc 14001 > >> > >> I get the thread numbers from that output and type: > >> > >> show thread x > >> > >> for each one. > >> > >> And a trace for each thread with the command? > >> > >> tr > >> > >> Anything else I should try to get or do? Or is that not the data > >> at all you are looking for? > >> > > Yes, everything else which is listed in the 'debugging deadlocks' > > page > > must be provided, otherwise the deadlock cannot be tracked. > > > > The investigator should be able to see the whole deadlock chain > > (loop) > > to make any useful advance. > > Ok, I have made some excellent progress in debugging the NFS > deadlock. > > Rick! You are genius. :-) You found the right commit r250907 (dated > May 22) is the definitely the problem. > Nowhere close, take my word for it;-) (At least you put a smiley after it.) (I've never actually even been employed as a software developer, but that's off topic.) I just got lucky (basically there wasn't any other commit that seemed it might cause this). But, the good news is that it is partially isolated. Hopefully the debugging stuff you get for Kostik will allow him (I suspect he is a genius) to solve the problem. (If I was going to take another "shot in the dark", I'd guess its r250027 moving the vn_lock() call. Maybe calling vm_page_grab() with the shared vnode lock held?) I've added re@ to the cc list, since I think this might be a show stopper for 9.2? Thanks for reporting this and all your help with tracking it down, rick > Here is how I did the testing: One machine received a kernel before > r250907, the second machine received a kernel after r250907. Sure > enough within a few hours the machine with r250907 went into the > usual deadlock state. The machine without that commit kept on > working fine. Then I went back to the latest revision (r253726), but > leaving r250907 out. The machines have been running happy and rock > solid without any deadlocks. I have expanded the testing to 3 > machines now and no reports of any issues. > > I guess now Konstantin has to figure out why that commit is causing > the deadlock. Lovely! :-) I will get that information as soon as > possible. I'm a little behind with normal work load, but I expect to > have the data by Tuesday evening or Wednesday. > > Thanks again!! > > Michael > > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Jul 27, 2013, at 11:25 PM, Konstantin Belousov wrote: > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: >> Let's assume the pid which started the deadlock is 14001 (it will be a >> different pid when we get the results, because the machine has been >> restarted) >> >> I type: >> >> show proc 14001 >> >> I get the thread numbers from that output and type: >> >> show thread x >> >> for each one. >> >> And a trace for each thread with the command? >> >> tr >> >> Anything else I should try to get or do? Or is that not the data at all you >> are looking for? >> > Yes, everything else which is listed in the 'debugging deadlocks' page > must be provided, otherwise the deadlock cannot be tracked. > > The investigator should be able to see the whole deadlock chain (loop) > to make any useful advance. Ok, I have made some excellent progress in debugging the NFS deadlock. Rick! You are genius. :-) You found the right commit r250907 (dated May 22) is the definitely the problem. Here is how I did the testing: One machine received a kernel before r250907, the second machine received a kernel after r250907. Sure enough within a few hours the machine with r250907 went into the usual deadlock state. The machine without that commit kept on working fine. Then I went back to the latest revision (r253726), but leaving r250907 out. The machines have been running happy and rock solid without any deadlocks. I have expanded the testing to 3 machines now and no reports of any issues. I guess now Konstantin has to figure out why that commit is causing the deadlock. Lovely! :-) I will get that information as soon as possible. I'm a little behind with normal work load, but I expect to have the data by Tuesday evening or Wednesday. Thanks again!! Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > Let's assume the pid which started the deadlock is 14001 (it will be a > different pid when we get the results, because the machine has been restarted) > > I type: > > show proc 14001 > > I get the thread numbers from that output and type: > > show thread x > > for each one. > > And a trace for each thread with the command? > > tr > > Anything else I should try to get or do? Or is that not the data at all you > are looking for? > Yes, everything else which is listed in the 'debugging deadlocks' page must be provided, otherwise the deadlock cannot be tracked. The investigator should be able to see the whole deadlock chain (loop) to make any useful advance. pgp32XmLnF34j.pgp Description: PGP signature
Re: NFS deadlock on 9.2-Beta1
On Jul 27, 2013, at 1:58 PM, Konstantin Belousov wrote: > On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote: >> Michael Tratz wrote: >>> >>> On Jul 24, 2013, at 5:25 PM, Rick Macklem >>> wrote: >>> Michael Tratz wrote: > Two machines (NFS Server: running ZFS / Client: disk-less), both > are > running FreeBSD r253506. The NFS client starts to deadlock > processes > within a few hours. It usually gets worse from there on. The > processes stay in "D" state. I haven't been able to reproduce it > when I want it to happen. I only have to wait a few hours until > the > deadlocks occur when traffic to the client machine starts to pick > up. The only way to fix the deadlocks is to reboot the client. > Even > an ls to the path which is deadlocked, will deadlock ls itself. > It's > totally random what part of the file system gets deadlocked. The > NFS > server itself has no problem at all to access the files/path when > something is deadlocked on the client. > > Last night I decided to put an older kernel on the system r252025 > (June 20th). The NFS server stayed untouched. So far 0 deadlocks > on > the client machine (it should have deadlocked by now). FreeBSD is > working hard like it always does. :-) There are a few changes to > the > NFS code from the revision which seems to work until Beta1. I > haven't tried to narrow it down if one of those commits are > causing > the problem. Maybe someone has an idea what could be wrong and I > can > test a patch or if it's something else, because I'm not a kernel > expert. :-) > Well, the only NFS client change committed between r252025 and r253506 is r253124. It fixes a file corruption problem caused by a previous commit that delayed the vnode_pager_setsize() call until after the nfs node mutex lock was unlocked. If you can test with only r253124 reverted to see if that gets rid of the hangs, it would be useful, although from the procstats, I doubt it. > I have run several procstat -kk on the processes including the ls > which deadlocked. You can see them here: > > http://pastebin.com/1RPnFT6r All the processes you show seem to be stuck waiting for a vnode lock or in __utmx_op_wait. (I`m not sure what the latter means.) What is missing is what processes are holding the vnode locks and what they are stuck on. A starting point might be ``ps axhl``, to see what all the threads are doing (particularily the WCHAN for them all). If you can drop into the debugger when the NFS mounts are hung and do a ```show alllocks`` that could help. See: http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html I`ll admit I`d be surprised if r253124 caused this, but who knows. If there have been changes to your network device driver between r252025 and r253506, I`d try reverting those. (If an RPC gets stuck waiting for a reply while holding a vnode lock, that would do it.) Good luck with it and maybe someone else can think of a commit between r252025 and r253506 that could cause vnode locking or network problems. rick > > I have tried to mount the file system with and without nolockd. It > didn't make a difference. Other than that it is mounted with: > > rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > > Let me know if you need me to do something else or if some other > output is required. I would have to go back to the problem kernel > and wait until the deadlock occurs to get that information. > >>> >>> Thanks Rick and Steven for your quick replies. >>> >>> I spoke too soon regarding r252025 fixing the problem. The same issue >>> started to show up after about 1 day and a few hours of uptime. >>> >>> "ps axhl" shows all those stuck processes in newnfs >>> >>> I recompiled the GENERIC kernel for Beta1 with the debugging options: >>> >>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html >>> >>> ps and debugging output: >>> >>> http://pastebin.com/1v482Dfw >>> >>> (I only listed processes matching newnfs, if you need the whole list, >>> please let me know) >>> >> Is your "show alllocks" complete? If not, a complete list of locks >> would definitely help. As for "ps axhl", a complete list of processes/threads >> might be useful, but not as much, I think. >> >>> The first PID showing up having that problem is 14001. Certainly the >>> "show alllocks" command shows interesting information for that PID. >>> I looked through the commit history for those files mentioned in the >>> output to see if there is something obvious to me. But I don't know. >>> :-) >>> I hope that information helps you to
Re: NFS deadlock on 9.2-Beta1
On Jul 27, 2013, at 1:20 PM, Rick Macklem wrote: > Michael Tratz wrote: >> >> On Jul 24, 2013, at 5:25 PM, Rick Macklem >> wrote: >> >>> Michael Tratz wrote: Two machines (NFS Server: running ZFS / Client: disk-less), both are running FreeBSD r253506. The NFS client starts to deadlock processes within a few hours. It usually gets worse from there on. The processes stay in "D" state. I haven't been able to reproduce it when I want it to happen. I only have to wait a few hours until the deadlocks occur when traffic to the client machine starts to pick up. The only way to fix the deadlocks is to reboot the client. Even an ls to the path which is deadlocked, will deadlock ls itself. It's totally random what part of the file system gets deadlocked. The NFS server itself has no problem at all to access the files/path when something is deadlocked on the client. Last night I decided to put an older kernel on the system r252025 (June 20th). The NFS server stayed untouched. So far 0 deadlocks on the client machine (it should have deadlocked by now). FreeBSD is working hard like it always does. :-) There are a few changes to the NFS code from the revision which seems to work until Beta1. I haven't tried to narrow it down if one of those commits are causing the problem. Maybe someone has an idea what could be wrong and I can test a patch or if it's something else, because I'm not a kernel expert. :-) >>> Well, the only NFS client change committed between r252025 and >>> r253506 >>> is r253124. It fixes a file corruption problem caused by a previous >>> commit that delayed the vnode_pager_setsize() call until after the >>> nfs node mutex lock was unlocked. >>> >>> If you can test with only r253124 reverted to see if that gets rid >>> of >>> the hangs, it would be useful, although from the procstats, I doubt >>> it. >>> I have run several procstat -kk on the processes including the ls which deadlocked. You can see them here: http://pastebin.com/1RPnFT6r >>> >>> All the processes you show seem to be stuck waiting for a vnode >>> lock >>> or in __utmx_op_wait. (I`m not sure what the latter means.) >>> >>> What is missing is what processes are holding the vnode locks and >>> what they are stuck on. >>> >>> A starting point might be ``ps axhl``, to see what all the threads >>> are doing (particularily the WCHAN for them all). If you can drop >>> into >>> the debugger when the NFS mounts are hung and do a ```show >>> alllocks`` >>> that could help. See: >>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html >>> >>> I`ll admit I`d be surprised if r253124 caused this, but who knows. >>> >>> If there have been changes to your network device driver between >>> r252025 and r253506, I`d try reverting those. (If an RPC gets stuck >>> waiting for a reply while holding a vnode lock, that would do it.) >>> >>> Good luck with it and maybe someone else can think of a commit >>> between r252025 and r253506 that could cause vnode locking or >>> network >>> problems. >>> >>> rick >>> I have tried to mount the file system with and without nolockd. It didn't make a difference. Other than that it is mounted with: rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 Let me know if you need me to do something else or if some other output is required. I would have to go back to the problem kernel and wait until the deadlock occurs to get that information. >> >> Thanks Rick and Steven for your quick replies. >> >> I spoke too soon regarding r252025 fixing the problem. The same issue >> started to show up after about 1 day and a few hours of uptime. >> >> "ps axhl" shows all those stuck processes in newnfs >> >> I recompiled the GENERIC kernel for Beta1 with the debugging options: >> >> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html >> >> ps and debugging output: >> >> http://pastebin.com/1v482Dfw >> >> (I only listed processes matching newnfs, if you need the whole list, >> please let me know) >> > Is your "show alllocks" complete? If not, a complete list of locks > would definitely help. As for "ps axhl", a complete list of processes/threads > might be useful, but not as much, I think. Yes that was the entire output for show alllocks. > >> The first PID showing up having that problem is 14001. Certainly the >> "show alllocks" command shows interesting information for that PID. >> I looked through the commit history for those files mentioned in the >> output to see if there is something obvious to me. But I don't know. >> :-) >> I hope that information helps you to dig deeper into the issue what >> might be causing those deadlocks. >> > Well, pid 14001 is interesting in that it holds both the sleep
Re: NFS deadlock on 9.2-Beta1
On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote: > Michael Tratz wrote: > > > > On Jul 24, 2013, at 5:25 PM, Rick Macklem > > wrote: > > > > > Michael Tratz wrote: > > >> Two machines (NFS Server: running ZFS / Client: disk-less), both > > >> are > > >> running FreeBSD r253506. The NFS client starts to deadlock > > >> processes > > >> within a few hours. It usually gets worse from there on. The > > >> processes stay in "D" state. I haven't been able to reproduce it > > >> when I want it to happen. I only have to wait a few hours until > > >> the > > >> deadlocks occur when traffic to the client machine starts to pick > > >> up. The only way to fix the deadlocks is to reboot the client. > > >> Even > > >> an ls to the path which is deadlocked, will deadlock ls itself. > > >> It's > > >> totally random what part of the file system gets deadlocked. The > > >> NFS > > >> server itself has no problem at all to access the files/path when > > >> something is deadlocked on the client. > > >> > > >> Last night I decided to put an older kernel on the system r252025 > > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks > > >> on > > >> the client machine (it should have deadlocked by now). FreeBSD is > > >> working hard like it always does. :-) There are a few changes to > > >> the > > >> NFS code from the revision which seems to work until Beta1. I > > >> haven't tried to narrow it down if one of those commits are > > >> causing > > >> the problem. Maybe someone has an idea what could be wrong and I > > >> can > > >> test a patch or if it's something else, because I'm not a kernel > > >> expert. :-) > > >> > > > Well, the only NFS client change committed between r252025 and > > > r253506 > > > is r253124. It fixes a file corruption problem caused by a previous > > > commit that delayed the vnode_pager_setsize() call until after the > > > nfs node mutex lock was unlocked. > > > > > > If you can test with only r253124 reverted to see if that gets rid > > > of > > > the hangs, it would be useful, although from the procstats, I doubt > > > it. > > > > > >> I have run several procstat -kk on the processes including the ls > > >> which deadlocked. You can see them here: > > >> > > >> http://pastebin.com/1RPnFT6r > > > > > > All the processes you show seem to be stuck waiting for a vnode > > > lock > > > or in __utmx_op_wait. (I`m not sure what the latter means.) > > > > > > What is missing is what processes are holding the vnode locks and > > > what they are stuck on. > > > > > > A starting point might be ``ps axhl``, to see what all the threads > > > are doing (particularily the WCHAN for them all). If you can drop > > > into > > > the debugger when the NFS mounts are hung and do a ```show > > > alllocks`` > > > that could help. See: > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > > > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > > > > > If there have been changes to your network device driver between > > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > > > waiting for a reply while holding a vnode lock, that would do it.) > > > > > > Good luck with it and maybe someone else can think of a commit > > > between r252025 and r253506 that could cause vnode locking or > > > network > > > problems. > > > > > > rick > > > > > >> > > >> I have tried to mount the file system with and without nolockd. It > > >> didn't make a difference. Other than that it is mounted with: > > >> > > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > > >> > > >> Let me know if you need me to do something else or if some other > > >> output is required. I would have to go back to the problem kernel > > >> and wait until the deadlock occurs to get that information. > > >> > > > > Thanks Rick and Steven for your quick replies. > > > > I spoke too soon regarding r252025 fixing the problem. The same issue > > started to show up after about 1 day and a few hours of uptime. > > > > "ps axhl" shows all those stuck processes in newnfs > > > > I recompiled the GENERIC kernel for Beta1 with the debugging options: > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > > ps and debugging output: > > > > http://pastebin.com/1v482Dfw > > > > (I only listed processes matching newnfs, if you need the whole list, > > please let me know) > > > Is your "show alllocks" complete? If not, a complete list of locks > would definitely help. As for "ps axhl", a complete list of processes/threads > might be useful, but not as much, I think. > > > The first PID showing up having that problem is 14001. Certainly the > > "show alllocks" command shows interesting information for that PID. > > I looked through the commit history for those files mentioned in the > > output to see if there is something obvious to me. But I don't know. > > :-) > > I hope that inf
Re: NFS deadlock on 9.2-Beta1
Michael Tratz wrote: > > On Jul 24, 2013, at 5:25 PM, Rick Macklem > wrote: > > > Michael Tratz wrote: > >> Two machines (NFS Server: running ZFS / Client: disk-less), both > >> are > >> running FreeBSD r253506. The NFS client starts to deadlock > >> processes > >> within a few hours. It usually gets worse from there on. The > >> processes stay in "D" state. I haven't been able to reproduce it > >> when I want it to happen. I only have to wait a few hours until > >> the > >> deadlocks occur when traffic to the client machine starts to pick > >> up. The only way to fix the deadlocks is to reboot the client. > >> Even > >> an ls to the path which is deadlocked, will deadlock ls itself. > >> It's > >> totally random what part of the file system gets deadlocked. The > >> NFS > >> server itself has no problem at all to access the files/path when > >> something is deadlocked on the client. > >> > >> Last night I decided to put an older kernel on the system r252025 > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks > >> on > >> the client machine (it should have deadlocked by now). FreeBSD is > >> working hard like it always does. :-) There are a few changes to > >> the > >> NFS code from the revision which seems to work until Beta1. I > >> haven't tried to narrow it down if one of those commits are > >> causing > >> the problem. Maybe someone has an idea what could be wrong and I > >> can > >> test a patch or if it's something else, because I'm not a kernel > >> expert. :-) > >> > > Well, the only NFS client change committed between r252025 and > > r253506 > > is r253124. It fixes a file corruption problem caused by a previous > > commit that delayed the vnode_pager_setsize() call until after the > > nfs node mutex lock was unlocked. > > > > If you can test with only r253124 reverted to see if that gets rid > > of > > the hangs, it would be useful, although from the procstats, I doubt > > it. > > > >> I have run several procstat -kk on the processes including the ls > >> which deadlocked. You can see them here: > >> > >> http://pastebin.com/1RPnFT6r > > > > All the processes you show seem to be stuck waiting for a vnode > > lock > > or in __utmx_op_wait. (I`m not sure what the latter means.) > > > > What is missing is what processes are holding the vnode locks and > > what they are stuck on. > > > > A starting point might be ``ps axhl``, to see what all the threads > > are doing (particularily the WCHAN for them all). If you can drop > > into > > the debugger when the NFS mounts are hung and do a ```show > > alllocks`` > > that could help. See: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > > > If there have been changes to your network device driver between > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > > waiting for a reply while holding a vnode lock, that would do it.) > > > > Good luck with it and maybe someone else can think of a commit > > between r252025 and r253506 that could cause vnode locking or > > network > > problems. > > > > rick > > > >> > >> I have tried to mount the file system with and without nolockd. It > >> didn't make a difference. Other than that it is mounted with: > >> > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > >> > >> Let me know if you need me to do something else or if some other > >> output is required. I would have to go back to the problem kernel > >> and wait until the deadlock occurs to get that information. > >> > > Thanks Rick and Steven for your quick replies. > > I spoke too soon regarding r252025 fixing the problem. The same issue > started to show up after about 1 day and a few hours of uptime. > > "ps axhl" shows all those stuck processes in newnfs > > I recompiled the GENERIC kernel for Beta1 with the debugging options: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > ps and debugging output: > > http://pastebin.com/1v482Dfw > > (I only listed processes matching newnfs, if you need the whole list, > please let me know) > Is your "show alllocks" complete? If not, a complete list of locks would definitely help. As for "ps axhl", a complete list of processes/threads might be useful, but not as much, I think. > The first PID showing up having that problem is 14001. Certainly the > "show alllocks" command shows interesting information for that PID. > I looked through the commit history for those files mentioned in the > output to see if there is something obvious to me. But I don't know. > :-) > I hope that information helps you to dig deeper into the issue what > might be causing those deadlocks. > Well, pid 14001 is interesting in that it holds both the sleep lock acquired by sblock() and an NFS vnode lock, but is waiting for another NFS vnode lock, if I read the pastebin stuff correctly. I suspect that t
Re: NFS deadlock on 9.2-Beta1
> > On Jul 24, 2013, at 5:25 PM, Rick Macklem wrote: > > > Michael Tratz wrote: > >> Two machines (NFS Server: running ZFS / Client: disk-less), both are > >> running FreeBSD r253506. The NFS client starts to deadlock processes > >> within a few hours. It usually gets worse from there on. The > >> processes stay in "D" state. I haven't been able to reproduce it > >> when I want it to happen. I only have to wait a few hours until the > >> deadlocks occur when traffic to the client machine starts to pick > >> up. The only way to fix the deadlocks is to reboot the client. Even > >> an ls to the path which is deadlocked, will deadlock ls itself. It's > >> totally random what part of the file system gets deadlocked. The NFS > >> server itself has no problem at all to access the files/path when > >> something is deadlocked on the client. > >> > >> Last night I decided to put an older kernel on the system r252025 > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on > >> the client machine (it should have deadlocked by now). FreeBSD is > >> working hard like it always does. :-) There are a few changes to the > >> NFS code from the revision which seems to work until Beta1. I > >> haven't tried to narrow it down if one of those commits are causing > >> the problem. Maybe someone has an idea what could be wrong and I can > >> test a patch or if it's something else, because I'm not a kernel > >> expert. :-) > >> > > Well, the only NFS client change committed between r252025 and r253506 > > is r253124. It fixes a file corruption problem caused by a previous > > commit that delayed the vnode_pager_setsize() call until after the > > nfs node mutex lock was unlocked. > > > > If you can test with only r253124 reverted to see if that gets rid of > > the hangs, it would be useful, although from the procstats, I doubt it. > > > >> I have run several procstat -kk on the processes including the ls > >> which deadlocked. You can see them here: > >> > >> http://pastebin.com/1RPnFT6r > > > > All the processes you show seem to be stuck waiting for a vnode lock > > or in __utmx_op_wait. (I`m not sure what the latter means.) > > > > What is missing is what processes are holding the vnode locks and > > what they are stuck on. > > > > A starting point might be ``ps axhl``, to see what all the threads > > are doing (particularily the WCHAN for them all). If you can drop into > > the debugger when the NFS mounts are hung and do a ```show alllocks`` > > that could help. See: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > > > If there have been changes to your network device driver between > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > > waiting for a reply while holding a vnode lock, that would do it.) > > > > Good luck with it and maybe someone else can think of a commit > > between r252025 and r253506 that could cause vnode locking or network > > problems. > > > > rick > > > >> > >> I have tried to mount the file system with and without nolockd. It > >> didn't make a difference. Other than that it is mounted with: > >> > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > >> > >> Let me know if you need me to do something else or if some other > >> output is required. I would have to go back to the problem kernel > >> and wait until the deadlock occurs to get that information. > >> > > Thanks Rick and Steven for your quick replies. > > I spoke too soon regarding r252025 fixing the problem. The same issue started > to show up after about 1 day and a few hours of uptime. > > "ps axhl" shows all those stuck processes in newnfs > > I recompiled the GENERIC kernel for Beta1 with the debugging options: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > ps and debugging output: > > http://pastebin.com/1v482Dfw > > (I only listed processes matching newnfs, if you need the whole list, please > let me know) > > The first PID showing up having that problem is 14001. Certainly the "show > alllocks" command shows interesting information for that PID. > I looked through the commit history for those files mentioned in the output > to see if there is something obvious to me. But I don't know. :-) > I hope that information helps you to dig deeper into the issue what might be > causing those deadlocks. > > I did include the pciconf -lv, because you mentioned network device drivers. > It's Intel igb. The same hardware is running a kernel from January 19th, 2013 > also as an NFS client. That machine is rock solid. No problems at all. > > I also went to r251611. That's before r251641 (The NFS FHA changes). Same > problem. Here is another debugging output from that kernel: > > http://pastebin.com/ryv8BYc4 > > If I should test something else or provide some other output, please let me > know.
Re: NFS deadlock on 9.2-Beta1
On Jul 24, 2013, at 5:25 PM, Rick Macklem wrote: > Michael Tratz wrote: >> Two machines (NFS Server: running ZFS / Client: disk-less), both are >> running FreeBSD r253506. The NFS client starts to deadlock processes >> within a few hours. It usually gets worse from there on. The >> processes stay in "D" state. I haven't been able to reproduce it >> when I want it to happen. I only have to wait a few hours until the >> deadlocks occur when traffic to the client machine starts to pick >> up. The only way to fix the deadlocks is to reboot the client. Even >> an ls to the path which is deadlocked, will deadlock ls itself. It's >> totally random what part of the file system gets deadlocked. The NFS >> server itself has no problem at all to access the files/path when >> something is deadlocked on the client. >> >> Last night I decided to put an older kernel on the system r252025 >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on >> the client machine (it should have deadlocked by now). FreeBSD is >> working hard like it always does. :-) There are a few changes to the >> NFS code from the revision which seems to work until Beta1. I >> haven't tried to narrow it down if one of those commits are causing >> the problem. Maybe someone has an idea what could be wrong and I can >> test a patch or if it's something else, because I'm not a kernel >> expert. :-) >> > Well, the only NFS client change committed between r252025 and r253506 > is r253124. It fixes a file corruption problem caused by a previous > commit that delayed the vnode_pager_setsize() call until after the > nfs node mutex lock was unlocked. > > If you can test with only r253124 reverted to see if that gets rid of > the hangs, it would be useful, although from the procstats, I doubt it. > >> I have run several procstat -kk on the processes including the ls >> which deadlocked. You can see them here: >> >> http://pastebin.com/1RPnFT6r > > All the processes you show seem to be stuck waiting for a vnode lock > or in __utmx_op_wait. (I`m not sure what the latter means.) > > What is missing is what processes are holding the vnode locks and > what they are stuck on. > > A starting point might be ``ps axhl``, to see what all the threads > are doing (particularily the WCHAN for them all). If you can drop into > the debugger when the NFS mounts are hung and do a ```show alllocks`` > that could help. See: > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > If there have been changes to your network device driver between > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > waiting for a reply while holding a vnode lock, that would do it.) > > Good luck with it and maybe someone else can think of a commit > between r252025 and r253506 that could cause vnode locking or network > problems. > > rick > >> >> I have tried to mount the file system with and without nolockd. It >> didn't make a difference. Other than that it is mounted with: >> >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 >> >> Let me know if you need me to do something else or if some other >> output is required. I would have to go back to the problem kernel >> and wait until the deadlock occurs to get that information. >> Thanks Rick and Steven for your quick replies. I spoke too soon regarding r252025 fixing the problem. The same issue started to show up after about 1 day and a few hours of uptime. "ps axhl" shows all those stuck processes in newnfs I recompiled the GENERIC kernel for Beta1 with the debugging options: http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html ps and debugging output: http://pastebin.com/1v482Dfw (I only listed processes matching newnfs, if you need the whole list, please let me know) The first PID showing up having that problem is 14001. Certainly the "show alllocks" command shows interesting information for that PID. I looked through the commit history for those files mentioned in the output to see if there is something obvious to me. But I don't know. :-) I hope that information helps you to dig deeper into the issue what might be causing those deadlocks. I did include the pciconf -lv, because you mentioned network device drivers. It's Intel igb. The same hardware is running a kernel from January 19th, 2013 also as an NFS client. That machine is rock solid. No problems at all. I also went to r251611. That's before r251641 (The NFS FHA changes). Same problem. Here is another debugging output from that kernel: http://pastebin.com/ryv8BYc4 If I should test something else or provide some other output, please let me know. Again thank you! Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-sta
Re: NFS deadlock on 9.2-Beta1
- Original Message - From: "Rick Macklem" To: "Michael Tratz" Cc: Sent: Thursday, July 25, 2013 1:25 AM Subject: Re: NFS deadlock on 9.2-Beta1 Michael Tratz wrote: Two machines (NFS Server: running ZFS / Client: disk-less), both are running FreeBSD r253506. The NFS client starts to deadlock processes within a few hours. It usually gets worse from there on. The processes stay in "D" state. I haven't been able to reproduce it when I want it to happen. I only have to wait a few hours until the deadlocks occur when traffic to the client machine starts to pick up. The only way to fix the deadlocks is to reboot the client. Even an ls to the path which is deadlocked, will deadlock ls itself. It's totally random what part of the file system gets deadlocked. The NFS server itself has no problem at all to access the files/path when something is deadlocked on the client. Last night I decided to put an older kernel on the system r252025 (June 20th). The NFS server stayed untouched. So far 0 deadlocks on the client machine (it should have deadlocked by now). FreeBSD is working hard like it always does. :-) There are a few changes to the NFS code from the revision which seems to work until Beta1. I haven't tried to narrow it down if one of those commits are causing the problem. Maybe someone has an idea what could be wrong and I can test a patch or if it's something else, because I'm not a kernel expert. :-) Well, the only NFS client change committed between r252025 and r253506 is r253124. It fixes a file corruption problem caused by a previous commit that delayed the vnode_pager_setsize() call until after the nfs node mutex lock was unlocked. If you can test with only r253124 reverted to see if that gets rid of the hangs, it would be useful, although from the procstats, I doubt it. I have run several procstat -kk on the processes including the ls which deadlocked. You can see them here: http://pastebin.com/1RPnFT6r All the processes you show seem to be stuck waiting for a vnode lock or in __utmx_op_wait. (I`m not sure what the latter means.) What is missing is what processes are holding the vnode locks and what they are stuck on. A starting point might be ``ps axhl``, to see what all the threads are doing (particularily the WCHAN for them all). If you can drop into the debugger when the NFS mounts are hung and do a ```show alllocks`` that could help. See: http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html I`ll admit I`d be surprised if r253124 caused this, but who knows. If there have been changes to your network device driver between r252025 and r253506, I`d try reverting those. (If an RPC gets stuck waiting for a reply while holding a vnode lock, that would do it.) Good luck with it and maybe someone else can think of a commit between r252025 and r253506 that could cause vnode locking or network problems. You could break to the debugger when it happens and run: show sleepchain and show lockchain to see whats waiting on what. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: NFS deadlock on 9.2-Beta1
Michael Tratz wrote: > Two machines (NFS Server: running ZFS / Client: disk-less), both are > running FreeBSD r253506. The NFS client starts to deadlock processes > within a few hours. It usually gets worse from there on. The > processes stay in "D" state. I haven't been able to reproduce it > when I want it to happen. I only have to wait a few hours until the > deadlocks occur when traffic to the client machine starts to pick > up. The only way to fix the deadlocks is to reboot the client. Even > an ls to the path which is deadlocked, will deadlock ls itself. It's > totally random what part of the file system gets deadlocked. The NFS > server itself has no problem at all to access the files/path when > something is deadlocked on the client. > > Last night I decided to put an older kernel on the system r252025 > (June 20th). The NFS server stayed untouched. So far 0 deadlocks on > the client machine (it should have deadlocked by now). FreeBSD is > working hard like it always does. :-) There are a few changes to the > NFS code from the revision which seems to work until Beta1. I > haven't tried to narrow it down if one of those commits are causing > the problem. Maybe someone has an idea what could be wrong and I can > test a patch or if it's something else, because I'm not a kernel > expert. :-) > Well, the only NFS client change committed between r252025 and r253506 is r253124. It fixes a file corruption problem caused by a previous commit that delayed the vnode_pager_setsize() call until after the nfs node mutex lock was unlocked. If you can test with only r253124 reverted to see if that gets rid of the hangs, it would be useful, although from the procstats, I doubt it. > I have run several procstat -kk on the processes including the ls > which deadlocked. You can see them here: > > http://pastebin.com/1RPnFT6r All the processes you show seem to be stuck waiting for a vnode lock or in __utmx_op_wait. (I`m not sure what the latter means.) What is missing is what processes are holding the vnode locks and what they are stuck on. A starting point might be ``ps axhl``, to see what all the threads are doing (particularily the WCHAN for them all). If you can drop into the debugger when the NFS mounts are hung and do a ```show alllocks`` that could help. See: http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html I`ll admit I`d be surprised if r253124 caused this, but who knows. If there have been changes to your network device driver between r252025 and r253506, I`d try reverting those. (If an RPC gets stuck waiting for a reply while holding a vnode lock, that would do it.) Good luck with it and maybe someone else can think of a commit between r252025 and r253506 that could cause vnode locking or network problems. rick > > I have tried to mount the file system with and without nolockd. It > didn't make a difference. Other than that it is mounted with: > > rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > > Let me know if you need me to do something else or if some other > output is required. I would have to go back to the problem kernel > and wait until the deadlock occurs to get that information. > > Thanks for your help, > > Michael > > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
NFS deadlock on 9.2-Beta1
Two machines (NFS Server: running ZFS / Client: disk-less), both are running FreeBSD r253506. The NFS client starts to deadlock processes within a few hours. It usually gets worse from there on. The processes stay in "D" state. I haven't been able to reproduce it when I want it to happen. I only have to wait a few hours until the deadlocks occur when traffic to the client machine starts to pick up. The only way to fix the deadlocks is to reboot the client. Even an ls to the path which is deadlocked, will deadlock ls itself. It's totally random what part of the file system gets deadlocked. The NFS server itself has no problem at all to access the files/path when something is deadlocked on the client. Last night I decided to put an older kernel on the system r252025 (June 20th). The NFS server stayed untouched. So far 0 deadlocks on the client machine (it should have deadlocked by now). FreeBSD is working hard like it always does. :-) There are a few changes to the NFS code from the revision which seems to work until Beta1. I haven't tried to narrow it down if one of those commits are causing the problem. Maybe someone has an idea what could be wrong and I can test a patch or if it's something else, because I'm not a kernel expert. :-) I have run several procstat -kk on the processes including the ls which deadlocked. You can see them here: http://pastebin.com/1RPnFT6r I have tried to mount the file system with and without nolockd. It didn't make a difference. Other than that it is mounted with: rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 Let me know if you need me to do something else or if some other output is required. I would have to go back to the problem kernel and wait until the deadlock occurs to get that information. Thanks for your help, Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"