subject:"Re\: NFS deadlock on 9.2\-Beta1"

Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Michael Tratz


On Aug 15, 2013, at 2:39 PM, Rick Macklem rmack...@uoguelph.ca wrote:

 Michael Tratz wrote:
 
 On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
 
 On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
 Let's assume the pid which started the deadlock is 14001 (it will
 be a different pid when we get the results, because the machine
 has been restarted)
 
 I type:
 
 show proc 14001
 
 I get the thread numbers from that output and type:
 
 show thread x
 
 for each one.
 
 And a trace for each thread with the command?
 
 tr 
 
 Anything else I should try to get or do? Or is that not the data
 at all you are looking for?
 
 Yes, everything else which is listed in the 'debugging deadlocks'
 page
 must be provided, otherwise the deadlock cannot be tracked.
 
 The investigator should be able to see the whole deadlock chain
 (loop)
 to make any useful advance.
 
 Ok, I have made some excellent progress in debugging the NFS
 deadlock.
 
 Rick! You are genius. :-) You found the right commit r250907 (dated
 May 22) is the definitely the problem.
 
 Here is how I did the testing: One machine received a kernel before
 r250907, the second machine received a kernel after r250907. Sure
 enough within a few hours the machine with r250907 went into the
 usual deadlock state. The machine without that commit kept on
 working fine. Then I went back to the latest revision (r253726), but
 leaving r250907 out. The machines have been running happy and rock
 solid without any deadlocks. I have expanded the testing to 3
 machines now and no reports of any issues.
 
 I guess now Konstantin has to figure out why that commit is causing
 the deadlock. Lovely! :-) I will get that information as soon as
 possible. I'm a little behind with normal work load, but I expect to
 have the data by Tuesday evening or Wednesday.
 
 Have you been able to pass the debugging info on to Kostik?
 
 It would be really nice to get this fixed for FreeBSD9.2.
 
 Thanks for your help with this, rick

Sorry Rick, I wasn't able to get you guys that info quickly enough. I thought I 
would have enough time, before my own wedding and honeymoon came along, but 
everything went a little crazy and stressful. I didn't think it would be this 
nuts. :-)

I'm caught up with everything and from what I can see from the discussions is 
that we know now what the problem is.

I can report that the machines which I have had without r250907 have been 
running without any problems for 27+ days.

If you need me to test any new patches, please let me know. If I should test 
with the partial merge of r253927 I'll be happy to do so.

Thanks,

Michael




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Adrian Chadd

Hi,

Does -HEAD have this same problem?

If so, we should likely just revert the patch entirely from -HEAD and -9
until it's resolved.



-adrian



On 24 August 2013 23:51, Michael Tratz mich...@esosoft.com wrote:


 On Aug 15, 2013, at 2:39 PM, Rick Macklem rmack...@uoguelph.ca wrote:

  Michael Tratz wrote:
 
  On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
  kostik...@gmail.com wrote:
 
  On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
  Let's assume the pid which started the deadlock is 14001 (it will
  be a different pid when we get the results, because the machine
  has been restarted)
 
  I type:
 
  show proc 14001
 
  I get the thread numbers from that output and type:
 
  show thread x
 
  for each one.
 
  And a trace for each thread with the command?
 
  tr 
 
  Anything else I should try to get or do? Or is that not the data
  at all you are looking for?
 
  Yes, everything else which is listed in the 'debugging deadlocks'
  page
  must be provided, otherwise the deadlock cannot be tracked.
 
  The investigator should be able to see the whole deadlock chain
  (loop)
  to make any useful advance.
 
  Ok, I have made some excellent progress in debugging the NFS
  deadlock.
 
  Rick! You are genius. :-) You found the right commit r250907 (dated
  May 22) is the definitely the problem.
 
  Here is how I did the testing: One machine received a kernel before
  r250907, the second machine received a kernel after r250907. Sure
  enough within a few hours the machine with r250907 went into the
  usual deadlock state. The machine without that commit kept on
  working fine. Then I went back to the latest revision (r253726), but
  leaving r250907 out. The machines have been running happy and rock
  solid without any deadlocks. I have expanded the testing to 3
  machines now and no reports of any issues.
 
  I guess now Konstantin has to figure out why that commit is causing
  the deadlock. Lovely! :-) I will get that information as soon as
  possible. I'm a little behind with normal work load, but I expect to
  have the data by Tuesday evening or Wednesday.
 
  Have you been able to pass the debugging info on to Kostik?
 
  It would be really nice to get this fixed for FreeBSD9.2.
 
  Thanks for your help with this, rick

 Sorry Rick, I wasn't able to get you guys that info quickly enough. I
 thought I would have enough time, before my own wedding and honeymoon came
 along, but everything went a little crazy and stressful. I didn't think it
 would be this nuts. :-)

 I'm caught up with everything and from what I can see from the discussions
 is that we know now what the problem is.

 I can report that the machines which I have had without r250907 have been
 running without any problems for 27+ days.

 If you need me to test any new patches, please let me know. If I should
 test with the partial merge of r253927 I'll be happy to do so.

 Thanks,

 Michael




 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Rick Macklem

Michael Tratz wrote:
 
 On Aug 15, 2013, at 2:39 PM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Michael Tratz wrote:
  
  On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
  kostik...@gmail.com wrote:
  
  On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
  Let's assume the pid which started the deadlock is 14001 (it
  will
  be a different pid when we get the results, because the machine
  has been restarted)
  
  I type:
  
  show proc 14001
  
  I get the thread numbers from that output and type:
  
  show thread x
  
  for each one.
  
  And a trace for each thread with the command?
  
  tr 
  
  Anything else I should try to get or do? Or is that not the data
  at all you are looking for?
  
  Yes, everything else which is listed in the 'debugging deadlocks'
  page
  must be provided, otherwise the deadlock cannot be tracked.
  
  The investigator should be able to see the whole deadlock chain
  (loop)
  to make any useful advance.
  
  Ok, I have made some excellent progress in debugging the NFS
  deadlock.
  
  Rick! You are genius. :-) You found the right commit r250907
  (dated
  May 22) is the definitely the problem.
  
  Here is how I did the testing: One machine received a kernel
  before
  r250907, the second machine received a kernel after r250907. Sure
  enough within a few hours the machine with r250907 went into the
  usual deadlock state. The machine without that commit kept on
  working fine. Then I went back to the latest revision (r253726),
  but
  leaving r250907 out. The machines have been running happy and rock
  solid without any deadlocks. I have expanded the testing to 3
  machines now and no reports of any issues.
  
  I guess now Konstantin has to figure out why that commit is
  causing
  the deadlock. Lovely! :-) I will get that information as soon as
  possible. I'm a little behind with normal work load, but I expect
  to
  have the data by Tuesday evening or Wednesday.
  
  Have you been able to pass the debugging info on to Kostik?
  
  It would be really nice to get this fixed for FreeBSD9.2.
  
  Thanks for your help with this, rick
 
 Sorry Rick, I wasn't able to get you guys that info quickly enough. I
 thought I would have enough time, before my own wedding and
 honeymoon came along, but everything went a little crazy and
 stressful. I didn't think it would be this nuts. :-)
 
 I'm caught up with everything and from what I can see from the
 discussions is that we know now what the problem is.
 
 I can report that the machines which I have had without r250907 have
 been running without any problems for 27+ days.
 
 If you need me to test any new patches, please let me know. If I
 should test with the partial merge of r253927 I'll be happy to do
 so.
 
It's up to you, but you might want to wait until the other tester (J. David?)
reports back on success/failure.

Thanks for your help with this, rick

 Thanks,
 
 Michael
 
 
 
 
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread J David

On Sun, Aug 25, 2013 at 7:42 AM, Adrian Chadd adr...@freebsd.org wrote:
 Does -HEAD have this same problem?

If I understood kib@ correctly, this is fixed in -HEAD by r253927.

 If so, we should likely just revert the patch entirely from -HEAD and -9
 until it's resolved.

It was not too difficult to prepare a releng/9.2 build with r254754
reverted (reverting the revert) and then applying kib@'s suggested
backport fix.

So far that is running on 9 nodes with no reported problems, but only
since last night.  We were hesitant to do the significant work
involved to push it out to dozens of nodes if nobody was going to
consider it for 9.2 anyway.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David

The requested information about the deadlock was finally obtained and
provided off-list to the requested parties due to size.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov

On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote:
 The requested information about the deadlock was finally obtained and
 provided off-list to the requested parties due to size.

Thank you, the problem is clear now.

The problematic process backtrace is

Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900
sched_switch() at sched_switch+0x234/frame 0xff834c442360
mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410
sleeplk() at sleeplk+0x11a/frame 0xff834c442460
__lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0
_vn_lock() at _vn_lock+0x63/frame 0xff834c442640
ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c442670
ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0
VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810
vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0
kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp = 
0x7fffce98, rbp = 0x7fffd1d0 ---

It tries to do the upgrade of the nfs vnode lock, and for this, the lock
is dropped and re-acquired. Since this happens with the vnode vm object'
page busied, we get a reversal between vnode lock and page busy state.
So effectively, my suspicion that NFS read path drops vnode lock was
true, and in fact I knew about the upgrade.

I think the easiest route is to a partial merge of the r253927 from HEAD.

Index: fs
===
--- fs  (revision 254800)
+++ fs  (working copy)

Property changes on: fs
___
Modified: svn:mergeinfo
   Merged /head/sys/fs:r253927
Index: kern/uipc_syscalls.c
===
--- kern/uipc_syscalls.c(revision 254800)
+++ kern/uipc_syscalls.c(working copy)
@@ -2124,11 +2124,6 @@
else {
ssize_t resid;
 
-   /*
-* Ensure that our page is still around
-* when the I/O completes.
-*/
-   vm_page_io_start(pg);
VM_OBJECT_UNLOCK(obj);
 
/*
@@ -2144,10 +2139,8 @@
IO_VMIO | ((MAXBSIZE / bsize)  
IO_SEQSHIFT),
td-td_ucred, NOCRED, resid, td);
VFS_UNLOCK_GIANT(vfslocked);
-   VM_OBJECT_LOCK(obj);
-   vm_page_io_finish(pg);
-   if (!error)
-   VM_OBJECT_UNLOCK(obj);
+   if (error)
+   VM_OBJECT_LOCK(obj);
mbstat.sf_iocnt++;
}
if (error) {
Index: .
===
--- .   (revision 254800)
+++ .   (working copy)

Property changes on: .
___
Modified: svn:mergeinfo
   Merged /head/sys:r253927


pgp75rUqLWx27.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David

On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov
kostik...@gmail.com wrote:
 I think the easiest route is to a partial merge of the r253927 from HEAD.

Is it helpful if we restart testing releng/9.2 using your suggested
fix?  And if so, the IGN_SBUSY patch you posted earlier be applied as
well or no?

If it ran successfully on a bunch of machines for next few days, maybe
that would still be in time to be useful feedback for 9.2.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov

On Sat, Aug 24, 2013 at 02:03:50PM -0400, J David wrote:
 On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
  I think the easiest route is to a partial merge of the r253927 from HEAD.
 
 Is it helpful if we restart testing releng/9.2 using your suggested
 fix?  And if so, the IGN_SBUSY patch you posted earlier be applied as
 well or no?
No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
is not critical there.

 
 If it ran successfully on a bunch of machines for next few days, maybe
 that would still be in time to be useful feedback for 9.2.
 
 Thanks!


pgp3JVF_QAbg0.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David

On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
kostik...@gmail.com wrote:
 No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
 is not critical there.

There is lots of other stuff in r250907 / reverted by r254754.  Some
of it looks important for sendfile() performance.  If testing this
extensively in the next few days could help get that work back into
9.2 we are happy to do it, but if it's too late then we can leave it
for those on stable/9.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov

On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote:
 On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
  No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
  is not critical there.
 
 There is lots of other stuff in r250907 / reverted by r254754.  Some
 of it looks important for sendfile() performance.  If testing this
 extensively in the next few days could help get that work back into
 9.2 we are happy to do it, but if it's too late then we can leave it
 for those on stable/9.

The revert in the r254754 is only a workaround for your workload, it does
not fix the real issue, which can be reproduced by other means.

I am not sure would re allow to merge the proper fix, since we are already
somewhere in RC3.


pgpLSIemB6rTj.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David

On Sat, Aug 24, 2013 at 4:55 PM, Konstantin Belousov
kostik...@gmail.com wrote:
 On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote:
 On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
  No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
  is not critical there.

 There is lots of other stuff in r250907 / reverted by r254754.  Some
 of it looks important for sendfile() performance.  If testing this
 extensively in the next few days could help get that work back into
 9.2 we are happy to do it, but if it's too late then we can leave it
 for those on stable/9.

 The revert in the r254754 is only a workaround for your workload, it does
 not fix the real issue, which can be reproduced by other means.

 I am not sure would re allow to merge the proper fix, since we are already
 somewhere in RC3.

Well, let's ask them. :)

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Rick Macklem

Kostik wrote:
 On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote:
  The requested information about the deadlock was finally obtained
  and
  provided off-list to the requested parties due to size.
 
 Thank you, the problem is clear now.
 
 The problematic process backtrace is
 
 Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900
 sched_switch() at sched_switch+0x234/frame 0xff834c442360
 mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0
 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0
 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410
 sleeplk() at sleeplk+0x11a/frame 0xff834c442460
 __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580
 nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0
 VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0
 _vn_lock() at _vn_lock+0x63/frame 0xff834c442640
 ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame
 0xff834c442670
 ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0
 VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810
 vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0
 kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0
 do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20
 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30
 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30
 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c,
 rsp = 0x7fffce98, rbp = 0x7fffd1d0 ---
 
 It tries to do the upgrade of the nfs vnode lock, and for this, the
 lock
 is dropped and re-acquired. Since this happens with the vnode vm
 object'
 page busied, we get a reversal between vnode lock and page busy
 state.
 So effectively, my suspicion that NFS read path drops vnode lock was
 true, and in fact I knew about the upgrade.
 
Ouch. I had forgotten that LK_UPGRADE could result in the shared lock
being dropped.

I'll admit I've never liked the lock upgrade in nfs_read(), but I'm
not sure how to avoid it. I just looked at the commit log message
for r138469, which is where this appeared in the old NFS client.
(The new NFS client just cloned this code.)

It basically notes that with a shared lock, new pages can be faulted
in for the vnode while vinvalbuf() is in progress, causing it to fail
(I suspect fail means never completed?).

At the very least, I don't think the lock upgrade is needed unless
a call to vinvalbuf() is going to be done. (I'm wondering is a dedicated
lock used to serialize this case might be better than using a node LK_UPGRADE?)
I think I'll take a closer look at the vinvalbuf() code in head.

Do others have any comments on this? (I added jhb@ to the cc list, since he
may be familiar with this?)

But none of this can happen quickly, so it wouldn't be feasible for stable/9
or even 10.0 at this point in time.

rick

 I think the easiest route is to a partial merge of the r253927 from
 HEAD.
 
 Index: fs
 ===
 --- fs(revision 254800)
 +++ fs(working copy)
 
 Property changes on: fs
 ___
 Modified: svn:mergeinfo
Merged /head/sys/fs:r253927
 Index: kern/uipc_syscalls.c
 ===
 --- kern/uipc_syscalls.c  (revision 254800)
 +++ kern/uipc_syscalls.c  (working copy)
 @@ -2124,11 +2124,6 @@
   else {
   ssize_t resid;
  
 - /*
 -  * Ensure that our page is still around
 -  * when the I/O completes.
 -  */
 - vm_page_io_start(pg);
   VM_OBJECT_UNLOCK(obj);
  
   /*
 @@ -2144,10 +2139,8 @@
   IO_VMIO | ((MAXBSIZE / bsize)  
 IO_SEQSHIFT),
   td-td_ucred, NOCRED, resid, td);
   VFS_UNLOCK_GIANT(vfslocked);
 - VM_OBJECT_LOCK(obj);
 - vm_page_io_finish(pg);
 - if (!error)
 - VM_OBJECT_UNLOCK(obj);
 + if (error)
 + VM_OBJECT_LOCK(obj);
   mbstat.sf_iocnt++;
   }
   if (error) {
 Index: .
 ===
 --- . (revision 254800)
 +++ . (working copy)
 
 Property changes on: .
 ___
 Modified: svn:mergeinfo
Merged /head/sys:r253927
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail

Re: NFS deadlock on 9.2-Beta1

2013-08-23 Thread Rick Macklem

J. David wrote:
 One deadlocked process cropped up overnight, but I managed to panic
 the box before getting too much debugging info. :(
 
 The process was in state T instead of D, which I guess must be a side
 effect of some of the debugging code compiled in.
 
 Here are the details I was able to capture:
 
 db  show proc 7692
 Process 7692 (httpd) at 0xfe0158793000:
  state: NORMAL
  uid: 25000  gids: 25000
  parent: pid 1 at 0xfe00039c3950
  ABI: FreeBSD ELF64
  arguments: /nfsn/apps/tapache22/bin/httpd
  threads: 3
 100674   D   newnfs   0xfe021cdd9848 httpd
 100597   D   pgrbwt   0xfe02fda788b8 httpd
 100910   s   httpd
 
 db show thread 100674
 Thread 100674 at 0xfe0108c79480:
  proc (pid 7692): 0xfe0158793000
  name: httpd
  stack: 0xff834c80f000-0xff834c812fff
  flags: 0x2a804  pflags: 0
  state: INHIBITED: {SLEEPING}
  wmesg: newnfs  wchan: 0xfe021cdd9848
  priority: 96
  container lock: sleepq chain (0x813c03c8)
 
 db tr 100674
 Tracing pid 7692 tid 100674 td 0xfe0108c79480
 sched_switch() at sched_switch+0x234/frame 0xff834c812360
 mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0
 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0
 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410
 sleeplk() at sleeplk+0x11a/frame 0xff834c812460
 __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580
 nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0
 VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0
 _vn_lock() at _vn_lock+0x63/frame 0xff834c812640
 ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame
 0xff834c812670
 ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0
 VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810
 vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0
 kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0
 do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20
 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30
 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30
 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
 rsp
 = 0x7e9f43c8, rbp = 0x7e9f4700 ---
 
 db show lockchain 100674
 thread 100674 (pid 7692, httpd) inhibited
 
 db show thread 100597
 Thread 100597 at 0xfe021c976000:
  proc (pid 7692): 0xfe0158793000
  name: httpd
  stack: 0xff834c80a000-0xff834c80dfff
  flags: 0x28804  pflags: 0
  state: INHIBITED: {SLEEPING}
  wmesg: pgrbwt  wchan: 0xfe02fda788b8
  priority: 84
  container lock: sleepq chain (0x813c0148)
 
 db tr 100597
 Tracing pid 7692 tid 100597 td 0xfe021c976000
 sched_switch() at sched_switch+0x234/frame 0xff834c80d750
 mi_switch() at mi_switch+0x15c/frame 0xff834c80d790
 sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0
 sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800
 _sleep() at _sleep+0x30f/frame 0xff834c80d890
 vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0
 kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0
 do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20
 amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30
 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30
 --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
 rsp
 = 0x7ebf53c8, rbp = 0x7ebf5700 ---
 
 db show lockchain 100597
 thread 100597 (pid 7692, httpd) inhibited
 
 The inhibited is not something I'm familiar with and didn't match
 the example output; I thought that maybe the T state was overpowering
 the locks, and that maybe I should continue the system and then -CONT
 the process.  However, a few seconds after I issued c at the DDB
 prompt, the system panicked in the console driver (mtx_lock_spin:
 recursed on non-recursive mutex cnputs_mtx @
 /usr/src/sys/kern/kern_cons.c:500), so I guess that's not a thing to
 do. :(
 
 Sorry my stupidity and ignorance is dragging this out. :(  This is
 all
 well outside my comfort zone, but next time I'll get it for sure.
 
No problem. Thanks for trying to capture this stuff.

Unfortunately, what you have above doesn't tell me anything more about
the problem.
The main question to me is Why is the thread stuck in pgrbwt permanently?.

To figure this out, we need the info on all threads on the system. In 
particular,
the status (the output of ps axHl would be a start, before going into the
debugger) of the nfsiod threads might point to the cause, although it may
involve other threads as well.

If you are running a serial console, just start script and then type the
commands ps axHl followed by going into the debugger and doing the commands
here: (basically everything with all):
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
(I don't know why your console crashed. Hopefully you can

Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread J David

Now that a kernel with INVARIANTS/WITNESS is finally available on a
machine with serial console I am having terrible trouble provoking
this to happen.  (Machine grinds to a halt if I put the usual test
load on it due to all the debug code in the kernel.)

Did get this interesting LOR, though it did not cause a deadlock:

lock order reversal:
 1st 0xfe000adb9f30 so_snd_sx (so_snd_sx) @
/usr/src/sys/kern/uipc_sockbuf.c:145
 2nd 0xfe000aa5b098 newnfs (newnfs) @ /usr/src/sys/kern/uipc_syscalls.c:2062
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xff834c3995c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xff834c399670
witness_checkorder() at witness_checkorder+0xc0a/frame 0xff834c3996f0
__lockmgr_args() at __lockmgr_args+0x390/frame 0xff834c399810
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c399840
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c399870
_vn_lock() at _vn_lock+0x63/frame 0xff834c3998d0
kern_sendfile() at kern_sendfile+0x812/frame 0xff834c399ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c399b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c399c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c399c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp
= 0x7fffcf58, rbp = 0x7fffd290 ---

Once the real deal pops up, collecting the full requested info should
be no problem, but it could take awhile to happen with only one
machine that can't run the full test battery.  So if a real fix is
dependent on this, reverting r250907 for 9.2-RELEASE is probably the
way to go. With that configuration, releng/9.2 continues to be pretty
solid for us.

Thanks!

(Since this doesn't contain the requested info, I heavily trimmed the
Cc: list.  It is not my intention to waste the time of everybody
involved.)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread Konstantin Belousov

On Wed, Aug 21, 2013 at 09:08:10PM -0400, Rick Macklem wrote:
 Kostik wrote:
  On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
   J David wrote:
On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
rmack...@uoguelph.ca
wrote:
 Have you been able to pass the debugging info on to Kostik?

 It would be really nice to get this fixed for FreeBSD9.2.

You're probably not talking to me, but headway here is slow.  At
our
location, we have been continuing to test releng/9.2 extensively,
but
with r250907 reverted.  Since reverting it solves the issue, and
since
there haven't been any further changes to releng/9.2 that might
also
resolve this issue, re-applying r250907 is perceived here as
un-fixing
a problem.  Enthusiasm for doing so is correspondingly low, even
if
the purpose is to gather debugging info. :(

However, after finally having clearance to test releng/9.2
r254540
with r250907 included and with DDB on five nodes.  The problem
cropped
up in about an hour.  Two threads in one process deadlocked, was
perfect.  Got it into DDB and saw the stack trace was scrolling
off
so
there was no way to copy it by hand.  Also, the machine's disk is
smaller than physical RAM, so no dump file. :(

Here's what is available so far:

db show proc 33362

Process 33362 (httpd) at 0xcd225b50:

 state: NORMAL

 uid: 25000 gids: 25000

 parent: pid 25104 at 0xc95f92d4

 ABI: FreeBSD ELF32

 arguments: /usr/local/libexec/httpd

 threads: 3

100405 D newnfs 0xc9b875e4 httpd

   Ok, so this one is waiting for an NFS vnode lock.
   
100393 D pgrbwt 0xc43a30c0 httpd

   This one is sleeping in vm_page_grab() { which I suspect has
   been called from kern_sendfile() with a shared vnode lock held,
   from what I saw on the previous debug info }.
   
100755 S uwait 0xc84b7c80 httpd


Not much to go on. :(  Maybe these five can be configured with
serial
consoles.

So, inquiries are continuing, but the answer to does this still
happen on 9.2-RC2? is definitely yes.

   Since r250027 moves a vn_lock() to before the vm_page_grab() call
   in
   kern_sendfile(), I suspect that is the cause of the deadlock.
   (r250027
   is one of the 3 commits MFC'd by r250907)
   
   I don't know if it would be safe to VOP_UNLOCK() the vnode after
   VOP_GETATTR()
   and then put the vn_lock() call that comes after vm_page_grab()
   back in or whether
   r250027 should be reverted (getting rid of the VOP_GETATTR() and
   going back to
   using the size in the vm stuff).
   
   Hopefully Kostik will know what is best to do with it now, rick
  
  I already described what to do with this.  I need the debugging
  information to see what is going on.  Without the data, it is only
  wasted time of everybody involved.
  
 Sorry, I didn't make what I was asking clear. I was referring specifically
 to stopping the hang from occurring in the soon to be released 9.2.
 
 I think you indirectly answered the question, in that you don't know
 of a fix for the hangs without more debugging information. This
 implies that reverting r250907 is the main option to resolve this
 for the 9.2 release (unless more debugging info arrives very soon),
 since that is the only fix that has been confirmed to work.
 Does this sound reasonable?
I do not object against reverting it for 9.2.  Please go ahead.

On the other hand, I do not want to revert it in stable/9, at least
until the cause is understood.

 
  Some technical notes.  The sendfile() uses shared lock for the
  duration
  of vnode i/o, so any thread which is sleeping on the vnode lock
  cannot
  be in the sendfile path, at least for UFS and NFS which do support
  true
  shared locks.
  
  The right lock order is vnode lock - page busy wait. From this PoV,
  the ordering in the sendfile is correct. Rick, are you aware of any
  situation where the VOP_READ in nfs client could drop vnode lock
  and then re-acquire it ? I was not able to find this from the code
  inspection. But, if such situation exists, it would be problematic in
  9.
  
 I am not aware of a case where nfs_read() drops/re-acquires the vnode
 lock.
 
 However, readaheads will still be in progress when nfs_read() returns,
 so those can still be in progress after the vnode lock is dropped.
 
 vfs_busy_pages() will have been called on the page(s) that readahead
 is in progress on (I think that means the shared busy bit will be set,
 if I understood vfs_busy_pages()). When the readahead is completed,
 bufdone() is called, so I don't understand why the page wouldn't become
 unbusied (waking up the thread sleeping on pgrbwt).
Exactly, this is the issue which I do not understand as well.

 I can't see why not being able to acquire the vnode lock would affect
 this, but my hunch is that it somehow does have this

Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread J David

One deadlocked process cropped up overnight, but I managed to panic
the box before getting too much debugging info. :(

The process was in state T instead of D, which I guess must be a side
effect of some of the debugging code compiled in.

Here are the details I was able to capture:

db  show proc 7692
Process 7692 (httpd) at 0xfe0158793000:
 state: NORMAL
 uid: 25000  gids: 25000
 parent: pid 1 at 0xfe00039c3950
 ABI: FreeBSD ELF64
 arguments: /nfsn/apps/tapache22/bin/httpd
 threads: 3
100674   D   newnfs   0xfe021cdd9848 httpd
100597   D   pgrbwt   0xfe02fda788b8 httpd
100910   s   httpd

db show thread 100674
Thread 100674 at 0xfe0108c79480:
 proc (pid 7692): 0xfe0158793000
 name: httpd
 stack: 0xff834c80f000-0xff834c812fff
 flags: 0x2a804  pflags: 0
 state: INHIBITED: {SLEEPING}
 wmesg: newnfs  wchan: 0xfe021cdd9848
 priority: 96
 container lock: sleepq chain (0x813c03c8)

db tr 100674
Tracing pid 7692 tid 100674 td 0xfe0108c79480
sched_switch() at sched_switch+0x234/frame 0xff834c812360
mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410
sleeplk() at sleeplk+0x11a/frame 0xff834c812460
__lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0
_vn_lock() at _vn_lock+0x63/frame 0xff834c812640
ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c812670
ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0
VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810
vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0
kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp
= 0x7e9f43c8, rbp = 0x7e9f4700 ---

db show lockchain 100674
thread 100674 (pid 7692, httpd) inhibited

db show thread 100597
Thread 100597 at 0xfe021c976000:
 proc (pid 7692): 0xfe0158793000
 name: httpd
 stack: 0xff834c80a000-0xff834c80dfff
 flags: 0x28804  pflags: 0
 state: INHIBITED: {SLEEPING}
 wmesg: pgrbwt  wchan: 0xfe02fda788b8
 priority: 84
 container lock: sleepq chain (0x813c0148)

db tr 100597
Tracing pid 7692 tid 100597 td 0xfe021c976000
sched_switch() at sched_switch+0x234/frame 0xff834c80d750
mi_switch() at mi_switch+0x15c/frame 0xff834c80d790
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800
_sleep() at _sleep+0x30f/frame 0xff834c80d890
vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0
kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp
= 0x7ebf53c8, rbp = 0x7ebf5700 ---

db show lockchain 100597
thread 100597 (pid 7692, httpd) inhibited

The inhibited is not something I'm familiar with and didn't match
the example output; I thought that maybe the T state was overpowering
the locks, and that maybe I should continue the system and then -CONT
the process.  However, a few seconds after I issued c at the DDB
prompt, the system panicked in the console driver (mtx_lock_spin:
recursed on non-recursive mutex cnputs_mtx @
/usr/src/sys/kern/kern_cons.c:500), so I guess that's not a thing to
do. :(

Sorry my stupidity and ignorance is dragging this out. :(  This is all
well outside my comfort zone, but next time I'll get it for sure.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Konstantin Belousov

On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
 J David wrote:
  On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem rmack...@uoguelph.ca
  wrote:
   Have you been able to pass the debugging info on to Kostik?
  
   It would be really nice to get this fixed for FreeBSD9.2.
  
  You're probably not talking to me, but headway here is slow.  At our
  location, we have been continuing to test releng/9.2 extensively, but
  with r250907 reverted.  Since reverting it solves the issue, and
  since
  there haven't been any further changes to releng/9.2 that might also
  resolve this issue, re-applying r250907 is perceived here as
  un-fixing
  a problem.  Enthusiasm for doing so is correspondingly low, even if
  the purpose is to gather debugging info. :(
  
  However, after finally having clearance to test releng/9.2 r254540
  with r250907 included and with DDB on five nodes.  The problem
  cropped
  up in about an hour.  Two threads in one process deadlocked, was
  perfect.  Got it into DDB and saw the stack trace was scrolling off
  so
  there was no way to copy it by hand.  Also, the machine's disk is
  smaller than physical RAM, so no dump file. :(
  
  Here's what is available so far:
  
  db show proc 33362
  
  Process 33362 (httpd) at 0xcd225b50:
  
   state: NORMAL
  
   uid: 25000 gids: 25000
  
   parent: pid 25104 at 0xc95f92d4
  
   ABI: FreeBSD ELF32
  
   arguments: /usr/local/libexec/httpd
  
   threads: 3
  
  100405 D newnfs 0xc9b875e4 httpd
  
 Ok, so this one is waiting for an NFS vnode lock.
 
  100393 D pgrbwt 0xc43a30c0 httpd
  
 This one is sleeping in vm_page_grab() { which I suspect has
 been called from kern_sendfile() with a shared vnode lock held,
 from what I saw on the previous debug info }.
 
  100755 S uwait 0xc84b7c80 httpd
  
  
  Not much to go on. :(  Maybe these five can be configured with serial
  consoles.
  
  So, inquiries are continuing, but the answer to does this still
  happen on 9.2-RC2? is definitely yes.
  
 Since r250027 moves a vn_lock() to before the vm_page_grab() call in
 kern_sendfile(), I suspect that is the cause of the deadlock. (r250027
 is one of the 3 commits MFC'd by r250907)
 
 I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR()
 and then put the vn_lock() call that comes after vm_page_grab() back in or 
 whether
 r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to
 using the size in the vm stuff).
 
 Hopefully Kostik will know what is best to do with it now, rick

I already described what to do with this.  I need the debugging
information to see what is going on.  Without the data, it is only
wasted time of everybody involved.

Some technical notes.  The sendfile() uses shared lock for the duration
of vnode i/o, so any thread which is sleeping on the vnode lock cannot
be in the sendfile path, at least for UFS and NFS which do support true
shared locks.

The right lock order is vnode lock - page busy wait. From this PoV,
the ordering in the sendfile is correct. Rick, are you aware of any
situation where the VOP_READ in nfs client could drop vnode lock
and then re-acquire it ? I was not able to find this from the code
inspection. But, if such situation exists, it would be problematic in 9.

Last note.  The HEAD dropped pre-busying pages in the sendfile() syscall.
As I understand, this is because new Attilio' busy implementation cannot
support both busy and sbusy states simultaneously, and vfs_busy_pages()/
vfs_drain_busy_pages() actually created such situation. I think that
because the sbusy is removed from the sendfile(), and the vm object
lock is dropped, there is no sense to require vm_page_grab() to wait
for the busy state to clean.  It is done by buffer cache or filesystem
code later. See the patch at the end.

Still, I do not know what happens in the supposedly reported deadlock.

diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
index 4797444..b974f53 100644
--- a/sys/kern/uipc_syscalls.c
+++ b/sys/kern/uipc_syscalls.c
@@ -2230,7 +2230,8 @@ retry_space:
pindex = OFF_TO_IDX(off);
VM_OBJECT_WLOCK(obj);
pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
-   VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
+   VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
+   VM_ALLOC_WIRED | VM_ALLOC_RETRY);
 
/*
 * Check if page is valid for what we need,


pgpWozKzpdFz5.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Yamagi Burmeister

On Wed, 21 Aug 2013 16:10:32 +0300
Konstantin Belousov kostik...@gmail.com wrote:

 I already described what to do with this.  I need the debugging
 information to see what is going on.  Without the data, it is only
 wasted time of everybody involved.
 
 Some technical notes.  The sendfile() uses shared lock for the duration
 of vnode i/o, so any thread which is sleeping on the vnode lock cannot
 be in the sendfile path, at least for UFS and NFS which do support true
 shared locks.
 
 The right lock order is vnode lock - page busy wait. From this PoV,
 the ordering in the sendfile is correct. Rick, are you aware of any
 situation where the VOP_READ in nfs client could drop vnode lock
 and then re-acquire it ? I was not able to find this from the code
 inspection. But, if such situation exists, it would be problematic in 9.
 
 Last note.  The HEAD dropped pre-busying pages in the sendfile() syscall.
 As I understand, this is because new Attilio' busy implementation cannot
 support both busy and sbusy states simultaneously, and vfs_busy_pages()/
 vfs_drain_busy_pages() actually created such situation. I think that
 because the sbusy is removed from the sendfile(), and the vm object
 lock is dropped, there is no sense to require vm_page_grab() to wait
 for the busy state to clean.  It is done by buffer cache or filesystem
 code later. See the patch at the end.
 
 Still, I do not know what happens in the supposedly reported deadlock.
 
 diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
 index 4797444..b974f53 100644
 --- a/sys/kern/uipc_syscalls.c
 +++ b/sys/kern/uipc_syscalls.c
 @@ -2230,7 +2230,8 @@ retry_space:
   pindex = OFF_TO_IDX(off);
   VM_OBJECT_WLOCK(obj);
   pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
 - VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
 + VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
 + VM_ALLOC_WIRED | VM_ALLOC_RETRY);
  
   /*
* Check if page is valid for what we need,

Could the problem be related to this deadlock / LOR? -
http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html

My test setup is still in place. Will test with r250907 reverted
tomorrow morning and report back. Additional informations could be
provided if necessary. I just need to know what exactly.

Ciao,
Yamagi

-- 
Homepage:  www.yamagi.org
XMPP:  yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB


pgpiQ7wf0tTdP.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Konstantin Belousov

On Wed, Aug 21, 2013 at 08:03:35PM +0200, Yamagi Burmeister wrote:
 Could the problem be related to this deadlock / LOR? -
 http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html
This is not related.

 
 My test setup is still in place. Will test with r250907 reverted
 tomorrow morning and report back. Additional informations could be
 provided if necessary. I just need to know what exactly.

Just follow the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
and collect all the information listed there after the apache/sendfile/nfs
deadlock is reproduced.


pgpkkcUXOfUil.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Rick Macklem

Kostik wrote:
 On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
  J David wrote:
   On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
   rmack...@uoguelph.ca
   wrote:
Have you been able to pass the debugging info on to Kostik?
   
It would be really nice to get this fixed for FreeBSD9.2.
   
   You're probably not talking to me, but headway here is slow.  At
   our
   location, we have been continuing to test releng/9.2 extensively,
   but
   with r250907 reverted.  Since reverting it solves the issue, and
   since
   there haven't been any further changes to releng/9.2 that might
   also
   resolve this issue, re-applying r250907 is perceived here as
   un-fixing
   a problem.  Enthusiasm for doing so is correspondingly low, even
   if
   the purpose is to gather debugging info. :(
   
   However, after finally having clearance to test releng/9.2
   r254540
   with r250907 included and with DDB on five nodes.  The problem
   cropped
   up in about an hour.  Two threads in one process deadlocked, was
   perfect.  Got it into DDB and saw the stack trace was scrolling
   off
   so
   there was no way to copy it by hand.  Also, the machine's disk is
   smaller than physical RAM, so no dump file. :(
   
   Here's what is available so far:
   
   db show proc 33362
   
   Process 33362 (httpd) at 0xcd225b50:
   
state: NORMAL
   
uid: 25000 gids: 25000
   
parent: pid 25104 at 0xc95f92d4
   
ABI: FreeBSD ELF32
   
arguments: /usr/local/libexec/httpd
   
threads: 3
   
   100405 D newnfs 0xc9b875e4 httpd
   
  Ok, so this one is waiting for an NFS vnode lock.
  
   100393 D pgrbwt 0xc43a30c0 httpd
   
  This one is sleeping in vm_page_grab() { which I suspect has
  been called from kern_sendfile() with a shared vnode lock held,
  from what I saw on the previous debug info }.
  
   100755 S uwait 0xc84b7c80 httpd
   
   
   Not much to go on. :(  Maybe these five can be configured with
   serial
   consoles.
   
   So, inquiries are continuing, but the answer to does this still
   happen on 9.2-RC2? is definitely yes.
   
  Since r250027 moves a vn_lock() to before the vm_page_grab() call
  in
  kern_sendfile(), I suspect that is the cause of the deadlock.
  (r250027
  is one of the 3 commits MFC'd by r250907)
  
  I don't know if it would be safe to VOP_UNLOCK() the vnode after
  VOP_GETATTR()
  and then put the vn_lock() call that comes after vm_page_grab()
  back in or whether
  r250027 should be reverted (getting rid of the VOP_GETATTR() and
  going back to
  using the size in the vm stuff).
  
  Hopefully Kostik will know what is best to do with it now, rick
 
 I already described what to do with this.  I need the debugging
 information to see what is going on.  Without the data, it is only
 wasted time of everybody involved.
 
Sorry, I didn't make what I was asking clear. I was referring specifically
to stopping the hang from occurring in the soon to be released 9.2.

I think you indirectly answered the question, in that you don't know
of a fix for the hangs without more debugging information. This
implies that reverting r250907 is the main option to resolve this
for the 9.2 release (unless more debugging info arrives very soon),
since that is the only fix that has been confirmed to work.
Does this sound reasonable?

 Some technical notes.  The sendfile() uses shared lock for the
 duration
 of vnode i/o, so any thread which is sleeping on the vnode lock
 cannot
 be in the sendfile path, at least for UFS and NFS which do support
 true
 shared locks.
 
 The right lock order is vnode lock - page busy wait. From this PoV,
 the ordering in the sendfile is correct. Rick, are you aware of any
 situation where the VOP_READ in nfs client could drop vnode lock
 and then re-acquire it ? I was not able to find this from the code
 inspection. But, if such situation exists, it would be problematic in
 9.
 
I am not aware of a case where nfs_read() drops/re-acquires the vnode
lock.

However, readaheads will still be in progress when nfs_read() returns,
so those can still be in progress after the vnode lock is dropped.

vfs_busy_pages() will have been called on the page(s) that readahead
is in progress on (I think that means the shared busy bit will be set,
if I understood vfs_busy_pages()). When the readahead is completed,
bufdone() is called, so I don't understand why the page wouldn't become
unbusied (waking up the thread sleeping on pgrbwt).
I can't see why not being able to acquire the vnode lock would affect
this, but my hunch is that it somehow does have this effect, since that
is the only way I can see that r250907 would cause the hangs.

 Last note.  The HEAD dropped pre-busying pages in the sendfile()
 syscall.
 As I understand, this is because new Attilio' busy implementation
 cannot
 support both busy and sbusy states simultaneously, and
 vfs_busy_pages()/
 vfs_drain_busy_pages() actually created such situation. I think that
 because the sbusy is removed from the

Re: NFS deadlock on 9.2-Beta1

2013-08-20 Thread Rick Macklem

J David wrote:
 On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem rmack...@uoguelph.ca
 wrote:
  Have you been able to pass the debugging info on to Kostik?
 
  It would be really nice to get this fixed for FreeBSD9.2.
 
 You're probably not talking to me, but headway here is slow.  At our
 location, we have been continuing to test releng/9.2 extensively, but
 with r250907 reverted.  Since reverting it solves the issue, and
 since
 there haven't been any further changes to releng/9.2 that might also
 resolve this issue, re-applying r250907 is perceived here as
 un-fixing
 a problem.  Enthusiasm for doing so is correspondingly low, even if
 the purpose is to gather debugging info. :(
 
 However, after finally having clearance to test releng/9.2 r254540
 with r250907 included and with DDB on five nodes.  The problem
 cropped
 up in about an hour.  Two threads in one process deadlocked, was
 perfect.  Got it into DDB and saw the stack trace was scrolling off
 so
 there was no way to copy it by hand.  Also, the machine's disk is
 smaller than physical RAM, so no dump file. :(
 
 Here's what is available so far:
 
 db show proc 33362
 
 Process 33362 (httpd) at 0xcd225b50:
 
  state: NORMAL
 
  uid: 25000 gids: 25000
 
  parent: pid 25104 at 0xc95f92d4
 
  ABI: FreeBSD ELF32
 
  arguments: /usr/local/libexec/httpd
 
  threads: 3
 
 100405 D newnfs 0xc9b875e4 httpd
 
Ok, so this one is waiting for an NFS vnode lock.

 100393 D pgrbwt 0xc43a30c0 httpd
 
This one is sleeping in vm_page_grab() { which I suspect has
been called from kern_sendfile() with a shared vnode lock held,
from what I saw on the previous debug info }.

 100755 S uwait 0xc84b7c80 httpd
 
 
 Not much to go on. :(  Maybe these five can be configured with serial
 consoles.
 
 So, inquiries are continuing, but the answer to does this still
 happen on 9.2-RC2? is definitely yes.
 
Since r250027 moves a vn_lock() to before the vm_page_grab() call in
kern_sendfile(), I suspect that is the cause of the deadlock. (r250027
is one of the 3 commits MFC'd by r250907)

I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR()
and then put the vn_lock() call that comes after vm_page_grab() back in or 
whether
r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to
using the size in the vm stuff).

Hopefully Kostik will know what is best to do with it now, rick

 Thanks!
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to
 freebsd-stable-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-20 Thread Oliver Pinter

On 8/20/13, J David j.david.li...@gmail.com wrote:
 On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 Have you been able to pass the debugging info on to Kostik?

 It would be really nice to get this fixed for FreeBSD9.2.

 You're probably not talking to me, but headway here is slow.  At our
 location, we have been continuing to test releng/9.2 extensively, but
 with r250907 reverted.  Since reverting it solves the issue, and since
 there haven't been any further changes to releng/9.2 that might also
 resolve this issue, re-applying r250907 is perceived here as un-fixing
 a problem.  Enthusiasm for doing so is correspondingly low, even if
 the purpose is to gather debugging info. :(

 However, after finally having clearance to test releng/9.2 r254540
 with r250907 included and with DDB on five nodes.  The problem cropped
 up in about an hour.  Two threads in one process deadlocked, was
 perfect.  Got it into DDB and saw the stack trace was scrolling off so
 there was no way to copy it by hand.  Also, the machine's disk is
 smaller than physical RAM, so no dump file. :(

 Here's what is available so far:

 db show proc 33362

 Process 33362 (httpd) at 0xcd225b50:

  state: NORMAL

  uid: 25000 gids: 25000

  parent: pid 25104 at 0xc95f92d4

  ABI: FreeBSD ELF32

  arguments: /usr/local/libexec/httpd

  threads: 3

 100405 D newnfs 0xc9b875e4 httpd

 100393 D pgrbwt 0xc43a30c0 httpd

 100755 S uwait 0xc84b7c80 httpd


 Not much to go on. :(  Maybe these five can be configured with serial
 consoles.

try this with serial console:

host # script debug-output-file
host # cu -s 9600 -l /dev/ttyU0
~^B
KDB: enter: Break to debugger
[ thread pid 11 tid 15 ]
Stopped at  kdb_alt_break_internal+0x17f:   movq$0,kdb_why
db   show msgbuf

...
~.
^D


 So, inquiries are continuing, but the answer to does this still
 happen on 9.2-RC2? is definitely yes.

 Thanks!
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-19 Thread J David

On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 Have you been able to pass the debugging info on to Kostik?

 It would be really nice to get this fixed for FreeBSD9.2.

You're probably not talking to me, but headway here is slow.  At our
location, we have been continuing to test releng/9.2 extensively, but
with r250907 reverted.  Since reverting it solves the issue, and since
there haven't been any further changes to releng/9.2 that might also
resolve this issue, re-applying r250907 is perceived here as un-fixing
a problem.  Enthusiasm for doing so is correspondingly low, even if
the purpose is to gather debugging info. :(

However, after finally having clearance to test releng/9.2 r254540
with r250907 included and with DDB on five nodes.  The problem cropped
up in about an hour.  Two threads in one process deadlocked, was
perfect.  Got it into DDB and saw the stack trace was scrolling off so
there was no way to copy it by hand.  Also, the machine's disk is
smaller than physical RAM, so no dump file. :(

Here's what is available so far:

db show proc 33362

Process 33362 (httpd) at 0xcd225b50:

 state: NORMAL

 uid: 25000 gids: 25000

 parent: pid 25104 at 0xc95f92d4

 ABI: FreeBSD ELF32

 arguments: /usr/local/libexec/httpd

 threads: 3

100405 D newnfs 0xc9b875e4 httpd

100393 D pgrbwt 0xc43a30c0 httpd

100755 S uwait 0xc84b7c80 httpd


Not much to go on. :(  Maybe these five can be configured with serial consoles.

So, inquiries are continuing, but the answer to does this still
happen on 9.2-RC2? is definitely yes.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-15 Thread Rick Macklem

Michael Tratz wrote:
 
 On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
 
  On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
  Let's assume the pid which started the deadlock is 14001 (it will
  be a different pid when we get the results, because the machine
  has been restarted)
  
  I type:
  
  show proc 14001
  
  I get the thread numbers from that output and type:
  
  show thread x
  
  for each one.
  
  And a trace for each thread with the command?
  
  tr 
  
  Anything else I should try to get or do? Or is that not the data
  at all you are looking for?
  
  Yes, everything else which is listed in the 'debugging deadlocks'
  page
  must be provided, otherwise the deadlock cannot be tracked.
  
  The investigator should be able to see the whole deadlock chain
  (loop)
  to make any useful advance.
 
 Ok, I have made some excellent progress in debugging the NFS
 deadlock.
 
 Rick! You are genius. :-) You found the right commit r250907 (dated
 May 22) is the definitely the problem.
 
 Here is how I did the testing: One machine received a kernel before
 r250907, the second machine received a kernel after r250907. Sure
 enough within a few hours the machine with r250907 went into the
 usual deadlock state. The machine without that commit kept on
 working fine. Then I went back to the latest revision (r253726), but
 leaving r250907 out. The machines have been running happy and rock
 solid without any deadlocks. I have expanded the testing to 3
 machines now and no reports of any issues.
 
 I guess now Konstantin has to figure out why that commit is causing
 the deadlock. Lovely! :-) I will get that information as soon as
 possible. I'm a little behind with normal work load, but I expect to
 have the data by Tuesday evening or Wednesday.
 
Have you been able to pass the debugging info on to Kostik?

It would be really nice to get this fixed for FreeBSD9.2.

Thanks for your help with this, rick

 Thanks again!!
 
 Michael
 
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-05 Thread Mark Saad



On Jul 29, 2013, at 10:48 PM, J David j.david.li...@gmail.com wrote:

 If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and
 without especially trying to exercise this bug, we found
 sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25
 nodes.  The ones with highest uptime (about 3 days) seem most
 affected, so it does seem like a sooner or later type of thing.
 Hopefully the fix is easy and it won't be an issue, but it definitely
 does seem like a problem 9.2-RELEASE would be better off without.
 
 Unfortunately we are not in a position to capture the requested
 debugging information at this time; none of those nodes are running a
 debug version of the kernel.  If Michael is unable to get the
 information as he hopes, we can try to do that, possibly over the
 weekend.  For the time being, we will convert half the machines to
 rollback r250907 to try to confirm that resolves the issue.
 
 Thanks all!  If one has to encounter a problem like this, it is nice
 to come to the list and find the research already so well underway!
 __

All
  Is there any updates on this issue ? Has anyone tested it or see it happen on 
the release candidate ? 
---
Mark saad | mark.s...@longcount.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-08-05 Thread J David

On Mon, Aug 5, 2013 at 12:06 PM, Mark Saad nones...@longcount.org wrote:
   Is there any updates on this issue ? Has anyone tested it or see it happen 
 on the release candidate ?

It's a bit premature for that; the RC has been out for a few hours.
We put BETA2 on 25 nodes and only saw the problem on five after 24
hours.  At that point we switched to a build that reverts the patch
that causes the deadlock and no node on which that was done (at this
point, all of them) has had the problem since.

We'll get some machines on releng/9.2 today, but I didn't see anything
in the release candidate announcement to indicate that relevant
changes had been made.

Is there anything in the release candidate that is believe to address
this issue?  If so, let us know with svn revision it's in and we'll
try to accelerate test deployment.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread Michael Tratz


On Jul 27, 2013, at 11:25 PM, Konstantin Belousov kostik...@gmail.com wrote:

 On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
 Let's assume the pid which started the deadlock is 14001 (it will be a 
 different pid when we get the results, because the machine has been 
 restarted)
 
 I type:
 
 show proc 14001
 
 I get the thread numbers from that output and type:
 
 show thread x
 
 for each one.
 
 And a trace for each thread with the command?
 
 tr 
 
 Anything else I should try to get or do? Or is that not the data at all you 
 are looking for?
 
 Yes, everything else which is listed in the 'debugging deadlocks' page
 must be provided, otherwise the deadlock cannot be tracked.
 
 The investigator should be able to see the whole deadlock chain (loop)
 to make any useful advance.

Ok, I have made some excellent progress in debugging the NFS deadlock.

Rick! You are genius. :-) You found the right commit r250907 (dated May 22) is 
the definitely the problem.

Here is how I did the testing: One machine received a kernel before r250907, 
the second machine received a kernel after r250907. Sure enough within a few 
hours the machine with r250907 went into the usual deadlock state. The machine 
without that commit kept on working fine. Then I went back to the latest 
revision (r253726), but leaving r250907 out. The machines have been running 
happy and rock solid without any deadlocks. I have expanded the testing to 3 
machines now and no reports of any issues.

I guess now Konstantin has to figure out why that commit is causing the 
deadlock. Lovely! :-) I will get that information as soon as possible. I'm a 
little behind with normal work load, but I expect to have the data by Tuesday 
evening or Wednesday.

Thanks again!!

Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread Rick Macklem

Michael Tratz wrote:
 
 On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
 
  On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
  Let's assume the pid which started the deadlock is 14001 (it will
  be a different pid when we get the results, because the machine
  has been restarted)
  
  I type:
  
  show proc 14001
  
  I get the thread numbers from that output and type:
  
  show thread x
  
  for each one.
  
  And a trace for each thread with the command?
  
  tr 
  
  Anything else I should try to get or do? Or is that not the data
  at all you are looking for?
  
  Yes, everything else which is listed in the 'debugging deadlocks'
  page
  must be provided, otherwise the deadlock cannot be tracked.
  
  The investigator should be able to see the whole deadlock chain
  (loop)
  to make any useful advance.
 
 Ok, I have made some excellent progress in debugging the NFS
 deadlock.
 
 Rick! You are genius. :-) You found the right commit r250907 (dated
 May 22) is the definitely the problem.
 
Nowhere close, take my word for it;-) (At least you put a smiley after it.)
(I've never actually even been employed as a software developer, but that's off 
topic.)

I just got lucky (basically there wasn't any other commit that seemed it might 
cause this).

But, the good news is that it is partially isolated. Hopefully the debugging 
stuff
you get for Kostik will allow him (I suspect he is a genius) to solve the 
problem.
(If I was going to take another shot in the dark, I'd guess its r250027 moving
 the vn_lock() call. Maybe calling vm_page_grab() with the shared vnode lock 
held?)

I've added re@ to the cc list, since I think this might be a show stopper for 
9.2?

Thanks for reporting this and all your help with tracking it down, rick

 Here is how I did the testing: One machine received a kernel before
 r250907, the second machine received a kernel after r250907. Sure
 enough within a few hours the machine with r250907 went into the
 usual deadlock state. The machine without that commit kept on
 working fine. Then I went back to the latest revision (r253726), but
 leaving r250907 out. The machines have been running happy and rock
 solid without any deadlocks. I have expanded the testing to 3
 machines now and no reports of any issues.
 
 I guess now Konstantin has to figure out why that commit is causing
 the deadlock. Lovely! :-) I will get that information as soon as
 possible. I'm a little behind with normal work load, but I expect to
 have the data by Tuesday evening or Wednesday.
 
 Thanks again!!
 
 Michael
 
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread J David

If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and
without especially trying to exercise this bug, we found
sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25
nodes.  The ones with highest uptime (about 3 days) seem most
affected, so it does seem like a sooner or later type of thing.
Hopefully the fix is easy and it won't be an issue, but it definitely
does seem like a problem 9.2-RELEASE would be better off without.

Unfortunately we are not in a position to capture the requested
debugging information at this time; none of those nodes are running a
debug version of the kernel.  If Michael is unable to get the
information as he hopes, we can try to do that, possibly over the
weekend.  For the time being, we will convert half the machines to
rollback r250907 to try to confirm that resolves the issue.

Thanks all!  If one has to encounter a problem like this, it is nice
to come to the list and find the research already so well underway!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-28 Thread Konstantin Belousov

On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
 Let's assume the pid which started the deadlock is 14001 (it will be a 
 different pid when we get the results, because the machine has been restarted)
 
 I type:
 
 show proc 14001
 
 I get the thread numbers from that output and type:
 
 show thread x
 
 for each one.
 
 And a trace for each thread with the command?
 
 tr 
 
 Anything else I should try to get or do? Or is that not the data at all you 
 are looking for?
 
Yes, everything else which is listed in the 'debugging deadlocks' page
must be provided, otherwise the deadlock cannot be tracked.

The investigator should be able to see the whole deadlock chain (loop)
to make any useful advance.


pgp32XmLnF34j.pgp
Description: PGP signature

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Daniel Braniss

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca wrote:

Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both are
running FreeBSD r253506. The NFS client starts to deadlock processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client. Even
an ls to the path which is deadlocked, will deadlock ls itself. It's
totally random what part of the file system gets deadlocked. The NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are causing
the problem. Maybe someone has an idea what could be wrong and I can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop into
the debugger when the NFS mounts are hung and do a ```show alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue started
to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list, please
let me know)

The first PID showing up having that problem is 14001. Certainly the show
alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the output
to see if there is something obvious to me. But I don't know. :-)
I hope that information helps you to dig deeper into the issue what might be
causing those deadlocks.

I did include the pciconf -lv, because you mentioned network device drivers.
It's Intel igb. The same hardware is running a kernel from January 19th, 2013
also as an NFS client. That machine is rock solid. No problems at all.

I also went to r251611. That's before r251641 (The NFS FHA changes). Same
problem. Here is another debugging output from that kernel:

http://pastebin.com/ryv8BYc4

If I should test something else or provide some other output, please let me
know.

Again thank you!

Michael

just a quick 'me too', It usually happens on our ftp server, and it's been
happening for a long time. It's diskless, and it happens randomly, so it's
difficult to

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Rick Macklem

Michael Tratz wrote:

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca
wrote:

Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both
are
running FreeBSD r253506. The NFS client starts to deadlock
processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until
the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client.
Even
an ls to the path which is deadlocked, will deadlock ls itself.
It's
totally random what part of the file system gets deadlocked. The
NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks
on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to
the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are
causing
the problem. Maybe someone has an idea what could be wrong and I
can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and
r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid
of
the hangs, it would be useful, although from the procstats, I doubt
it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode
lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop
into
the debugger when the NFS mounts are hung and do a ```show
alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or
network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue
started to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list,
please let me know)

Is your show alllocks complete? If not, a complete list of locks
would definitely help. As for ps axhl, a complete list of processes/threads
might be useful, but not as much, I think.

The first PID showing up having that problem is 14001. Certainly the
show alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the
output to see if there is something obvious to me. But I don't know.
:-)
I hope that information helps you to dig deeper into the issue what
might be causing those deadlocks.

Well, pid 14001 is interesting in that it holds both the sleep lock
acquired by sblock() and an NFS vnode lock, but is waiting for another
NFS vnode lock, if I read the pastebin stuff correctly.

I suspect that this process is somewhere in kern_sendfile(), since that
seems to be where sblock() gets called before vn_lock().

It's just a shot in the dark, but if you can revert
r250907 (dated May 22), it might be worth a try. It adds a bunch of
stuff

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Konstantin Belousov

On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote:
Michael Tratz wrote:

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca
wrote:

Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both
are
running FreeBSD r253506. The NFS client starts to deadlock
processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until
the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client.
Even
an ls to the path which is deadlocked, will deadlock ls itself.
It's
totally random what part of the file system gets deadlocked. The
NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks
on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to
the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are
causing
the problem. Maybe someone has an idea what could be wrong and I
can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and
r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid
of
the hangs, it would be useful, although from the procstats, I doubt
it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode
lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop
into
the debugger when the NFS mounts are hung and do a ```show
alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or
network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue
started to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list,
please let me know)

Is your show alllocks complete? If not, a complete list of locks
would definitely help. As for ps axhl, a complete list of processes/threads
might be useful, but not as much, I think.

The first PID showing up having that problem is 14001. Certainly the
show alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the
output to see if there is something obvious to me. But I don't know.
:-)
I hope that information helps you to dig deeper into the issue what
might be causing those deadlocks.

Well, pid 14001 is interesting in that it holds both the sleep lock
acquired by sblock() and an NFS vnode lock, but is waiting for another
NFS vnode lock, if I read the pastebin stuff correctly.

I suspect that this process is somewhere in kern_sendfile(), since

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Michael Tratz

On Jul 27, 2013, at 1:20 PM, Rick Macklem rmack...@uoguelph.ca wrote:

Michael Tratz wrote:

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca
wrote:

Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both
are
running FreeBSD r253506. The NFS client starts to deadlock
processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until
the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client.
Even
an ls to the path which is deadlocked, will deadlock ls itself.
It's
totally random what part of the file system gets deadlocked. The
NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks
on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to
the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are
causing
the problem. Maybe someone has an idea what could be wrong and I
can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and
r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid
of
the hangs, it would be useful, although from the procstats, I doubt
it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode
lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop
into
the debugger when the NFS mounts are hung and do a ```show
alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or
network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue
started to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list,
please let me know)

Is your show alllocks complete? If not, a complete list of locks
would definitely help. As for ps axhl, a complete list of processes/threads
might be useful, but not as much, I think.

Yes that was the entire output for show alllocks.

The first PID showing up having that problem is 14001. Certainly the
show alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the
output to see if there is something obvious to me. But I don't know.
:-)
I hope that information helps you to dig deeper into the issue what
might be causing those deadlocks.

Well, pid 14001 is interesting in that it holds both the sleep lock
acquired by sblock() and an NFS vnode lock, but is waiting for another
NFS vnode lock, if I read the pastebin stuff correctly.

I suspect that this process is somewhere in kern_sendfile(), since that
seems to be where sblock() gets called before vn_lock().

It's just a shot in the dark, but if you can revert
r250907 (dated May

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Michael Tratz

On Jul 27, 2013, at 1:58 PM, Konstantin Belousov kostik...@gmail.com wrote:

On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote:
Michael Tratz wrote:

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca
wrote:

Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both
are
running FreeBSD r253506. The NFS client starts to deadlock
processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until
the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client.
Even
an ls to the path which is deadlocked, will deadlock ls itself.
It's
totally random what part of the file system gets deadlocked. The
NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks
on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to
the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are
causing
the problem. Maybe someone has an idea what could be wrong and I
can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and
r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid
of
the hangs, it would be useful, although from the procstats, I doubt
it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode
lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop
into
the debugger when the NFS mounts are hung and do a ```show
alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or
network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue
started to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list,
please let me know)

Is your show alllocks complete? If not, a complete list of locks
would definitely help. As for ps axhl, a complete list of processes/threads
might be useful, but not as much, I think.

The first PID showing up having that problem is 14001. Certainly the
show alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the
output to see if there is something obvious to me. But I don't know.
:-)
I hope that information helps you to dig deeper into the issue what
might be causing those deadlocks.

Well, pid 14001 is interesting in that it holds both the sleep lock
acquired by sblock() and an NFS vnode lock, but is waiting for another
NFS vnode lock, if I read the pastebin stuff correctly.

I suspect that this process is somewhere in kern_sendfile(), since that
seems to be where sblock() gets called before vn_lock().
14001 is multithreaded, two its threads own shared vnode

Re: NFS deadlock on 9.2-Beta1

2013-07-25 Thread Michael Tratz

On Jul 24, 2013, at 5:25 PM, Rick Macklem rmack...@uoguelph.ca wrote:

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue started
to show up after about 1 day and a few hours of uptime.

ps axhl shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list, please
let me know)

The first PID showing up having that problem is 14001. Certainly the show
alllocks command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the output to
see if there is something obvious to me. But I don't know. :-)
I hope that information helps you to dig deeper into the issue what might be
causing those deadlocks.

I also went to r251611. That's before r251641 (The NFS FHA changes). Same
problem. Here is another debugging output from that kernel:

http://pastebin.com/ryv8BYc4

If I should test something else or provide some other output, please let me
know.

Again thank you!

Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-24 Thread Rick Macklem

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

rick

I have tried to mount the file system with and without nolockd. It
didn't make a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other
output is required. I would have to go back to the problem kernel
and wait until the deadlock occurs to get that information.

Thanks for your help,

Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to
freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: NFS deadlock on 9.2-Beta1

2013-07-24 Thread Steven Hartland

- Original Message - 
From: Rick Macklem rmack...@uoguelph.ca

To: Michael Tratz mich...@esosoft.com
Cc: freebsd-stable@freebsd.org
Sent: Thursday, July 25, 2013 1:25 AM
Subject: Re: NFS deadlock on 9.2-Beta1

Michael Tratz wrote:

Two machines (NFS Server: running ZFS / Client: disk-less), both are
running FreeBSD r253506. The NFS client starts to deadlock processes
within a few hours. It usually gets worse from there on. The
processes stay in D state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client. Even
an ls to the path which is deadlocked, will deadlock ls itself. It's
totally random what part of the file system gets deadlocked. The NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are causing
the problem. Maybe someone has an idea what could be wrong and I can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop into
the debugger when the NFS mounts are hung and do a ```show alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

You could break to the debugger when it happens and run:
show sleepchain
and
show lockchain
to see whats waiting on what.

   Regards
   Steve

This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

38 matches

Mail list logo