Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread J David
On Sun, Aug 25, 2013 at 7:42 AM, Adrian Chadd  wrote:
> Does -HEAD have this same problem?

If I understood kib@ correctly, this is fixed in -HEAD by r253927.

> If so, we should likely just revert the patch entirely from -HEAD and -9
> until it's resolved.

It was not too difficult to prepare a releng/9.2 build with r254754
reverted (reverting the revert) and then applying kib@'s suggested
backport fix.

So far that is running on 9 nodes with no reported problems, but only
since last night.  We were hesitant to do the significant work
involved to push it out to dozens of nodes if nobody was going to
consider it for 9.2 anyway.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Rick Macklem
Michael Tratz wrote:
> 
> On Aug 15, 2013, at 2:39 PM, Rick Macklem 
> wrote:
> 
> > Michael Tratz wrote:
> >> 
> >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
> >>  wrote:
> >> 
> >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
>  Let's assume the pid which started the deadlock is 14001 (it
>  will
>  be a different pid when we get the results, because the machine
>  has been restarted)
>  
>  I type:
>  
>  show proc 14001
>  
>  I get the thread numbers from that output and type:
>  
>  show thread x
>  
>  for each one.
>  
>  And a trace for each thread with the command?
>  
>  tr 
>  
>  Anything else I should try to get or do? Or is that not the data
>  at all you are looking for?
>  
> >>> Yes, everything else which is listed in the 'debugging deadlocks'
> >>> page
> >>> must be provided, otherwise the deadlock cannot be tracked.
> >>> 
> >>> The investigator should be able to see the whole deadlock chain
> >>> (loop)
> >>> to make any useful advance.
> >> 
> >> Ok, I have made some excellent progress in debugging the NFS
> >> deadlock.
> >> 
> >> Rick! You are genius. :-) You found the right commit r250907
> >> (dated
> >> May 22) is the definitely the problem.
> >> 
> >> Here is how I did the testing: One machine received a kernel
> >> before
> >> r250907, the second machine received a kernel after r250907. Sure
> >> enough within a few hours the machine with r250907 went into the
> >> usual deadlock state. The machine without that commit kept on
> >> working fine. Then I went back to the latest revision (r253726),
> >> but
> >> leaving r250907 out. The machines have been running happy and rock
> >> solid without any deadlocks. I have expanded the testing to 3
> >> machines now and no reports of any issues.
> >> 
> >> I guess now Konstantin has to figure out why that commit is
> >> causing
> >> the deadlock. Lovely! :-) I will get that information as soon as
> >> possible. I'm a little behind with normal work load, but I expect
> >> to
> >> have the data by Tuesday evening or Wednesday.
> >> 
> > Have you been able to pass the debugging info on to Kostik?
> > 
> > It would be really nice to get this fixed for FreeBSD9.2.
> > 
> > Thanks for your help with this, rick
> 
> Sorry Rick, I wasn't able to get you guys that info quickly enough. I
> thought I would have enough time, before my own wedding and
> honeymoon came along, but everything went a little crazy and
> stressful. I didn't think it would be this nuts. :-)
> 
> I'm caught up with everything and from what I can see from the
> discussions is that we know now what the problem is.
> 
> I can report that the machines which I have had without r250907 have
> been running without any problems for 27+ days.
> 
> If you need me to test any new patches, please let me know. If I
> should test with the partial merge of r253927 I'll be happy to do
> so.
> 
It's up to you, but you might want to wait until the other tester (J. David?)
reports back on success/failure.

Thanks for your help with this, rick

> Thanks,
> 
> Michael
> 
> 
> 
> 
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Adrian Chadd
Hi,

Does -HEAD have this same problem?

If so, we should likely just revert the patch entirely from -HEAD and -9
until it's resolved.



-adrian



On 24 August 2013 23:51, Michael Tratz  wrote:

>
> On Aug 15, 2013, at 2:39 PM, Rick Macklem  wrote:
>
> > Michael Tratz wrote:
> >>
> >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
> >>  wrote:
> >>
> >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
>  Let's assume the pid which started the deadlock is 14001 (it will
>  be a different pid when we get the results, because the machine
>  has been restarted)
> 
>  I type:
> 
>  show proc 14001
> 
>  I get the thread numbers from that output and type:
> 
>  show thread x
> 
>  for each one.
> 
>  And a trace for each thread with the command?
> 
>  tr 
> 
>  Anything else I should try to get or do? Or is that not the data
>  at all you are looking for?
> 
> >>> Yes, everything else which is listed in the 'debugging deadlocks'
> >>> page
> >>> must be provided, otherwise the deadlock cannot be tracked.
> >>>
> >>> The investigator should be able to see the whole deadlock chain
> >>> (loop)
> >>> to make any useful advance.
> >>
> >> Ok, I have made some excellent progress in debugging the NFS
> >> deadlock.
> >>
> >> Rick! You are genius. :-) You found the right commit r250907 (dated
> >> May 22) is the definitely the problem.
> >>
> >> Here is how I did the testing: One machine received a kernel before
> >> r250907, the second machine received a kernel after r250907. Sure
> >> enough within a few hours the machine with r250907 went into the
> >> usual deadlock state. The machine without that commit kept on
> >> working fine. Then I went back to the latest revision (r253726), but
> >> leaving r250907 out. The machines have been running happy and rock
> >> solid without any deadlocks. I have expanded the testing to 3
> >> machines now and no reports of any issues.
> >>
> >> I guess now Konstantin has to figure out why that commit is causing
> >> the deadlock. Lovely! :-) I will get that information as soon as
> >> possible. I'm a little behind with normal work load, but I expect to
> >> have the data by Tuesday evening or Wednesday.
> >>
> > Have you been able to pass the debugging info on to Kostik?
> >
> > It would be really nice to get this fixed for FreeBSD9.2.
> >
> > Thanks for your help with this, rick
>
> Sorry Rick, I wasn't able to get you guys that info quickly enough. I
> thought I would have enough time, before my own wedding and honeymoon came
> along, but everything went a little crazy and stressful. I didn't think it
> would be this nuts. :-)
>
> I'm caught up with everything and from what I can see from the discussions
> is that we know now what the problem is.
>
> I can report that the machines which I have had without r250907 have been
> running without any problems for 27+ days.
>
> If you need me to test any new patches, please let me know. If I should
> test with the partial merge of r253927 I'll be happy to do so.
>
> Thanks,
>
> Michael
>
>
>
>
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-25 Thread Michael Tratz

On Aug 15, 2013, at 2:39 PM, Rick Macklem  wrote:

> Michael Tratz wrote:
>> 
>> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
>>  wrote:
>> 
>>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
 Let's assume the pid which started the deadlock is 14001 (it will
 be a different pid when we get the results, because the machine
 has been restarted)
 
 I type:
 
 show proc 14001
 
 I get the thread numbers from that output and type:
 
 show thread x
 
 for each one.
 
 And a trace for each thread with the command?
 
 tr 
 
 Anything else I should try to get or do? Or is that not the data
 at all you are looking for?
 
>>> Yes, everything else which is listed in the 'debugging deadlocks'
>>> page
>>> must be provided, otherwise the deadlock cannot be tracked.
>>> 
>>> The investigator should be able to see the whole deadlock chain
>>> (loop)
>>> to make any useful advance.
>> 
>> Ok, I have made some excellent progress in debugging the NFS
>> deadlock.
>> 
>> Rick! You are genius. :-) You found the right commit r250907 (dated
>> May 22) is the definitely the problem.
>> 
>> Here is how I did the testing: One machine received a kernel before
>> r250907, the second machine received a kernel after r250907. Sure
>> enough within a few hours the machine with r250907 went into the
>> usual deadlock state. The machine without that commit kept on
>> working fine. Then I went back to the latest revision (r253726), but
>> leaving r250907 out. The machines have been running happy and rock
>> solid without any deadlocks. I have expanded the testing to 3
>> machines now and no reports of any issues.
>> 
>> I guess now Konstantin has to figure out why that commit is causing
>> the deadlock. Lovely! :-) I will get that information as soon as
>> possible. I'm a little behind with normal work load, but I expect to
>> have the data by Tuesday evening or Wednesday.
>> 
> Have you been able to pass the debugging info on to Kostik?
> 
> It would be really nice to get this fixed for FreeBSD9.2.
> 
> Thanks for your help with this, rick

Sorry Rick, I wasn't able to get you guys that info quickly enough. I thought I 
would have enough time, before my own wedding and honeymoon came along, but 
everything went a little crazy and stressful. I didn't think it would be this 
nuts. :-)

I'm caught up with everything and from what I can see from the discussions is 
that we know now what the problem is.

I can report that the machines which I have had without r250907 have been 
running without any problems for 27+ days.

If you need me to test any new patches, please let me know. If I should test 
with the partial merge of r253927 I'll be happy to do so.

Thanks,

Michael




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Rick Macklem
Kostik wrote:
> On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote:
> > The requested information about the deadlock was finally obtained
> > and
> > provided off-list to the requested parties due to size.
> 
> Thank you, the problem is clear now.
> 
> The problematic process backtrace is
> 
> Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900
> sched_switch() at sched_switch+0x234/frame 0xff834c442360
> mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0
> sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0
> sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410
> sleeplk() at sleeplk+0x11a/frame 0xff834c442460
> __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580
> nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0
> VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0
> _vn_lock() at _vn_lock+0x63/frame 0xff834c442640
> ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame
> 0xff834c442670
> ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0
> VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810
> vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0
> kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0
> do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20
> amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30
> --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c,
> rsp = 0x7fffce98, rbp = 0x7fffd1d0 ---
> 
> It tries to do the upgrade of the nfs vnode lock, and for this, the
> lock
> is dropped and re-acquired. Since this happens with the vnode vm
> object'
> page busied, we get a reversal between vnode lock and page busy
> state.
> So effectively, my suspicion that NFS read path drops vnode lock was
> true, and in fact I knew about the upgrade.
> 
Ouch. I had forgotten that LK_UPGRADE could result in the shared lock
being dropped.

I'll admit I've never liked the lock upgrade in nfs_read(), but I'm
not sure how to avoid it. I just looked at the commit log message
for r138469, which is where this appeared in the old NFS client.
(The new NFS client just cloned this code.)

It basically notes that with a shared lock, new pages can be faulted
in for the vnode while vinvalbuf() is in progress, causing it to fail
(I suspect "fail" means never completed?).

At the very least, I don't think the lock upgrade is needed unless
a call to vinvalbuf() is going to be done. (I'm wondering is a dedicated
lock used to serialize this case might be better than using a node LK_UPGRADE?)
I think I'll take a closer look at the vinvalbuf() code in head.

Do others have any comments on this? (I added jhb@ to the cc list, since he
may be familiar with this?)

But none of this can happen quickly, so it wouldn't be feasible for stable/9
or even 10.0 at this point in time.

rick

> I think the easiest route is to a partial merge of the r253927 from
> HEAD.
> 
> Index: fs
> ===
> --- fs(revision 254800)
> +++ fs(working copy)
> 
> Property changes on: fs
> ___
> Modified: svn:mergeinfo
>Merged /head/sys/fs:r253927
> Index: kern/uipc_syscalls.c
> ===
> --- kern/uipc_syscalls.c  (revision 254800)
> +++ kern/uipc_syscalls.c  (working copy)
> @@ -2124,11 +2124,6 @@
>   else {
>   ssize_t resid;
>  
> - /*
> -  * Ensure that our page is still around
> -  * when the I/O completes.
> -  */
> - vm_page_io_start(pg);
>   VM_OBJECT_UNLOCK(obj);
>  
>   /*
> @@ -2144,10 +2139,8 @@
>   IO_VMIO | ((MAXBSIZE / bsize) << 
> IO_SEQSHIFT),
>   td->td_ucred, NOCRED, &resid, td);
>   VFS_UNLOCK_GIANT(vfslocked);
> - VM_OBJECT_LOCK(obj);
> - vm_page_io_finish(pg);
> - if (!error)
> - VM_OBJECT_UNLOCK(obj);
> + if (error)
> + VM_OBJECT_LOCK(obj);
>   mbstat.sf_iocnt++;
>   }
>   if (error) {
> Index: .
> ===
> --- . (revision 254800)
> +++ . (working copy)
> 
> Property changes on: .
> ___
> Modified: svn:mergeinfo
>Merged /head/sys:r253927
> 
___
freebsd-stable@freebsd.org 

Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David
On Sat, Aug 24, 2013 at 4:55 PM, Konstantin Belousov
 wrote:
> On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote:
>> On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
>>  wrote:
>> > No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
>> > is not critical there.
>>
>> There is lots of other stuff in r250907 / reverted by r254754.  Some
>> of it looks important for sendfile() performance.  If testing this
>> extensively in the next few days could help get that work back into
>> 9.2 we are happy to do it, but if it's too late then we can leave it
>> for those on stable/9.
>
> The revert in the r254754 is only a workaround for your workload, it does
> not fix the real issue, which can be reproduced by other means.
>
> I am not sure would re allow to merge the proper fix, since we are already
> somewhere in RC3.

Well, let's ask them. :)

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov
On Sat, Aug 24, 2013 at 04:11:09PM -0400, J David wrote:
> On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
>  wrote:
> > No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
> > is not critical there.
> 
> There is lots of other stuff in r250907 / reverted by r254754.  Some
> of it looks important for sendfile() performance.  If testing this
> extensively in the next few days could help get that work back into
> 9.2 we are happy to do it, but if it's too late then we can leave it
> for those on stable/9.

The revert in the r254754 is only a workaround for your workload, it does
not fix the real issue, which can be reproduced by other means.

I am not sure would re allow to merge the proper fix, since we are already
somewhere in RC3.


pgpLSIemB6rTj.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David
On Sat, Aug 24, 2013 at 3:38 PM, Konstantin Belousov
 wrote:
> No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
> is not critical there.

There is lots of other stuff in r250907 / reverted by r254754.  Some
of it looks important for sendfile() performance.  If testing this
extensively in the next few days could help get that work back into
9.2 we are happy to do it, but if it's too late then we can leave it
for those on stable/9.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov
On Sat, Aug 24, 2013 at 02:03:50PM -0400, J David wrote:
> On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov
>  wrote:
> > I think the easiest route is to a partial merge of the r253927 from HEAD.
> 
> Is it helpful if we restart testing releng/9.2 using your suggested
> fix?  And if so, the IGN_SBUSY patch you posted earlier be applied as
> well or no?
No, at least not without reverting the r254754 first.  The IGN_SBUSY patch
is not critical there.

> 
> If it ran successfully on a bunch of machines for next few days, maybe
> that would still be in time to be useful feedback for 9.2.
> 
> Thanks!


pgp3JVF_QAbg0.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David
On Sat, Aug 24, 2013 at 1:41 PM, Konstantin Belousov
 wrote:
> I think the easiest route is to a partial merge of the r253927 from HEAD.

Is it helpful if we restart testing releng/9.2 using your suggested
fix?  And if so, the IGN_SBUSY patch you posted earlier be applied as
well or no?

If it ran successfully on a bunch of machines for next few days, maybe
that would still be in time to be useful feedback for 9.2.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread Konstantin Belousov
On Sat, Aug 24, 2013 at 01:08:05PM -0400, J David wrote:
> The requested information about the deadlock was finally obtained and
> provided off-list to the requested parties due to size.

Thank you, the problem is clear now.

The problematic process backtrace is

Tracing command httpd pid 86383 tid 100138 td 0xfe000b7b2900
sched_switch() at sched_switch+0x234/frame 0xff834c442360
mi_switch() at mi_switch+0x15c/frame 0xff834c4423a0
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c4423e0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c442410
sleeplk() at sleeplk+0x11a/frame 0xff834c442460
__lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c442580
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c4425b0
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c4425e0
_vn_lock() at _vn_lock+0x63/frame 0xff834c442640
ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c442670
ncl_bioread() at ncl_bioread+0x195/frame 0xff834c4427e0
VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c442810
vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c4428d0
kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c442ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c442b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c442c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c442c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp = 
0x7fffce98, rbp = 0x7fffd1d0 ---

It tries to do the upgrade of the nfs vnode lock, and for this, the lock
is dropped and re-acquired. Since this happens with the vnode vm object'
page busied, we get a reversal between vnode lock and page busy state.
So effectively, my suspicion that NFS read path drops vnode lock was
true, and in fact I knew about the upgrade.

I think the easiest route is to a partial merge of the r253927 from HEAD.

Index: fs
===
--- fs  (revision 254800)
+++ fs  (working copy)

Property changes on: fs
___
Modified: svn:mergeinfo
   Merged /head/sys/fs:r253927
Index: kern/uipc_syscalls.c
===
--- kern/uipc_syscalls.c(revision 254800)
+++ kern/uipc_syscalls.c(working copy)
@@ -2124,11 +2124,6 @@
else {
ssize_t resid;
 
-   /*
-* Ensure that our page is still around
-* when the I/O completes.
-*/
-   vm_page_io_start(pg);
VM_OBJECT_UNLOCK(obj);
 
/*
@@ -2144,10 +2139,8 @@
IO_VMIO | ((MAXBSIZE / bsize) << 
IO_SEQSHIFT),
td->td_ucred, NOCRED, &resid, td);
VFS_UNLOCK_GIANT(vfslocked);
-   VM_OBJECT_LOCK(obj);
-   vm_page_io_finish(pg);
-   if (!error)
-   VM_OBJECT_UNLOCK(obj);
+   if (error)
+   VM_OBJECT_LOCK(obj);
mbstat.sf_iocnt++;
}
if (error) {
Index: .
===
--- .   (revision 254800)
+++ .   (working copy)

Property changes on: .
___
Modified: svn:mergeinfo
   Merged /head/sys:r253927


pgp75rUqLWx27.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-24 Thread J David
The requested information about the deadlock was finally obtained and
provided off-list to the requested parties due to size.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-23 Thread Rick Macklem
J. David wrote:
> One deadlocked process cropped up overnight, but I managed to panic
> the box before getting too much debugging info. :(
> 
> The process was in state T instead of D, which I guess must be a side
> effect of some of the debugging code compiled in.
> 
> Here are the details I was able to capture:
> 
> db>  show proc 7692
> Process 7692 (httpd) at 0xfe0158793000:
>  state: NORMAL
>  uid: 25000  gids: 25000
>  parent: pid 1 at 0xfe00039c3950
>  ABI: FreeBSD ELF64
>  arguments: /nfsn/apps/tapache22/bin/httpd
>  threads: 3
> 100674   D   newnfs   0xfe021cdd9848 httpd
> 100597   D   pgrbwt   0xfe02fda788b8 httpd
> 100910   s   httpd
> 
> db> show thread 100674
> Thread 100674 at 0xfe0108c79480:
>  proc (pid 7692): 0xfe0158793000
>  name: httpd
>  stack: 0xff834c80f000-0xff834c812fff
>  flags: 0x2a804  pflags: 0
>  state: INHIBITED: {SLEEPING}
>  wmesg: newnfs  wchan: 0xfe021cdd9848
>  priority: 96
>  container lock: sleepq chain (0x813c03c8)
> 
> db> tr 100674
> Tracing pid 7692 tid 100674 td 0xfe0108c79480
> sched_switch() at sched_switch+0x234/frame 0xff834c812360
> mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0
> sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0
> sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410
> sleeplk() at sleeplk+0x11a/frame 0xff834c812460
> __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580
> nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0
> VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0
> _vn_lock() at _vn_lock+0x63/frame 0xff834c812640
> ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame
> 0xff834c812670
> ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0
> VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810
> vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0
> kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0
> do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20
> amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30
> --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
> rsp
> = 0x7e9f43c8, rbp = 0x7e9f4700 ---
> 
> db> show lockchain 100674
> thread 100674 (pid 7692, httpd) inhibited
> 
> db> show thread 100597
> Thread 100597 at 0xfe021c976000:
>  proc (pid 7692): 0xfe0158793000
>  name: httpd
>  stack: 0xff834c80a000-0xff834c80dfff
>  flags: 0x28804  pflags: 0
>  state: INHIBITED: {SLEEPING}
>  wmesg: pgrbwt  wchan: 0xfe02fda788b8
>  priority: 84
>  container lock: sleepq chain (0x813c0148)
> 
> db> tr 100597
> Tracing pid 7692 tid 100597 td 0xfe021c976000
> sched_switch() at sched_switch+0x234/frame 0xff834c80d750
> mi_switch() at mi_switch+0x15c/frame 0xff834c80d790
> sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0
> sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800
> _sleep() at _sleep+0x30f/frame 0xff834c80d890
> vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0
> kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0
> do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20
> amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30
> --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
> rsp
> = 0x7ebf53c8, rbp = 0x7ebf5700 ---
> 
> db> show lockchain 100597
> thread 100597 (pid 7692, httpd) inhibited
> 
> The "inhibited" is not something I'm familiar with and didn't match
> the example output; I thought that maybe the T state was overpowering
> the locks, and that maybe I should continue the system and then -CONT
> the process.  However, a few seconds after I issued "c" at the DDB
> prompt, the system panicked in the console driver ("mtx_lock_spin:
> recursed on non-recursive mutex cnputs_mtx @
> /usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to
> do. :(
> 
> Sorry my stupidity and ignorance is dragging this out. :(  This is
> all
> well outside my comfort zone, but next time I'll get it for sure.
> 
No problem. Thanks for trying to capture this stuff.

Unfortunately, what you have above doesn't tell me anything more about
the problem.
The main question to me is "Why is the thread stuck in "pgrbwt" permanently?".

To figure this out, we need the info on all threads on the system. In 
particular,
the status (the output of "ps axHl" would be a start, before going into the
debugger) of the "nfsiod" threads might point to the cause, although it may
involve other threads as well.

If you are running a serial console, just start "script" and then type the
commands "ps axHl" followed by going into the debugger and doing the commands
here: (basically everything with "all"):
http://www.freebsd.org/doc/e

Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread J David
One deadlocked process cropped up overnight, but I managed to panic
the box before getting too much debugging info. :(

The process was in state T instead of D, which I guess must be a side
effect of some of the debugging code compiled in.

Here are the details I was able to capture:

db>  show proc 7692
Process 7692 (httpd) at 0xfe0158793000:
 state: NORMAL
 uid: 25000  gids: 25000
 parent: pid 1 at 0xfe00039c3950
 ABI: FreeBSD ELF64
 arguments: /nfsn/apps/tapache22/bin/httpd
 threads: 3
100674   D   newnfs   0xfe021cdd9848 httpd
100597   D   pgrbwt   0xfe02fda788b8 httpd
100910   s   httpd

db> show thread 100674
Thread 100674 at 0xfe0108c79480:
 proc (pid 7692): 0xfe0158793000
 name: httpd
 stack: 0xff834c80f000-0xff834c812fff
 flags: 0x2a804  pflags: 0
 state: INHIBITED: {SLEEPING}
 wmesg: newnfs  wchan: 0xfe021cdd9848
 priority: 96
 container lock: sleepq chain (0x813c03c8)

db> tr 100674
Tracing pid 7692 tid 100674 td 0xfe0108c79480
sched_switch() at sched_switch+0x234/frame 0xff834c812360
mi_switch() at mi_switch+0x15c/frame 0xff834c8123a0
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c8123e0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c812410
sleeplk() at sleeplk+0x11a/frame 0xff834c812460
__lockmgr_args() at __lockmgr_args+0x9a9/frame 0xff834c812580
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c8125b0
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c8125e0
_vn_lock() at _vn_lock+0x63/frame 0xff834c812640
ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame 0xff834c812670
ncl_bioread() at ncl_bioread+0x195/frame 0xff834c8127e0
VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xff834c812810
vn_rdwr() at vn_rdwr+0x2bc/frame 0xff834c8128d0
kern_sendfile() at kern_sendfile+0xa90/frame 0xff834c812ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c812b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c812c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c812c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp
= 0x7e9f43c8, rbp = 0x7e9f4700 ---

db> show lockchain 100674
thread 100674 (pid 7692, httpd) inhibited

db> show thread 100597
Thread 100597 at 0xfe021c976000:
 proc (pid 7692): 0xfe0158793000
 name: httpd
 stack: 0xff834c80a000-0xff834c80dfff
 flags: 0x28804  pflags: 0
 state: INHIBITED: {SLEEPING}
 wmesg: pgrbwt  wchan: 0xfe02fda788b8
 priority: 84
 container lock: sleepq chain (0x813c0148)

db> tr 100597
Tracing pid 7692 tid 100597 td 0xfe021c976000
sched_switch() at sched_switch+0x234/frame 0xff834c80d750
mi_switch() at mi_switch+0x15c/frame 0xff834c80d790
sleepq_switch() at sleepq_switch+0x17d/frame 0xff834c80d7d0
sleepq_wait() at sleepq_wait+0x43/frame 0xff834c80d800
_sleep() at _sleep+0x30f/frame 0xff834c80d890
vm_page_grab() at vm_page_grab+0x120/frame 0xff834c80d8d0
kern_sendfile() at kern_sendfile+0x992/frame 0xff834c80dac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c80db20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c80dc30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c80dc30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, rsp
= 0x7ebf53c8, rbp = 0x7ebf5700 ---

db> show lockchain 100597
thread 100597 (pid 7692, httpd) inhibited

The "inhibited" is not something I'm familiar with and didn't match
the example output; I thought that maybe the T state was overpowering
the locks, and that maybe I should continue the system and then -CONT
the process.  However, a few seconds after I issued "c" at the DDB
prompt, the system panicked in the console driver ("mtx_lock_spin:
recursed on non-recursive mutex cnputs_mtx @
/usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to
do. :(

Sorry my stupidity and ignorance is dragging this out. :(  This is all
well outside my comfort zone, but next time I'll get it for sure.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread Konstantin Belousov
On Wed, Aug 21, 2013 at 09:08:10PM -0400, Rick Macklem wrote:
> Kostik wrote:
> > On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
> > > J David wrote:
> > > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
> > > > 
> > > > wrote:
> > > > > Have you been able to pass the debugging info on to Kostik?
> > > > >
> > > > > It would be really nice to get this fixed for FreeBSD9.2.
> > > > 
> > > > You're probably not talking to me, but headway here is slow.  At
> > > > our
> > > > location, we have been continuing to test releng/9.2 extensively,
> > > > but
> > > > with r250907 reverted.  Since reverting it solves the issue, and
> > > > since
> > > > there haven't been any further changes to releng/9.2 that might
> > > > also
> > > > resolve this issue, re-applying r250907 is perceived here as
> > > > un-fixing
> > > > a problem.  Enthusiasm for doing so is correspondingly low, even
> > > > if
> > > > the purpose is to gather debugging info. :(
> > > > 
> > > > However, after finally having clearance to test releng/9.2
> > > > r254540
> > > > with r250907 included and with DDB on five nodes.  The problem
> > > > cropped
> > > > up in about an hour.  Two threads in one process deadlocked, was
> > > > perfect.  Got it into DDB and saw the stack trace was scrolling
> > > > off
> > > > so
> > > > there was no way to copy it by hand.  Also, the machine's disk is
> > > > smaller than physical RAM, so no dump file. :(
> > > > 
> > > > Here's what is available so far:
> > > > 
> > > > db> show proc 33362
> > > > 
> > > > Process 33362 (httpd) at 0xcd225b50:
> > > > 
> > > >  state: NORMAL
> > > > 
> > > >  uid: 25000 gids: 25000
> > > > 
> > > >  parent: pid 25104 at 0xc95f92d4
> > > > 
> > > >  ABI: FreeBSD ELF32
> > > > 
> > > >  arguments: /usr/local/libexec/httpd
> > > > 
> > > >  threads: 3
> > > > 
> > > > 100405 D newnfs 0xc9b875e4 httpd
> > > > 
> > > Ok, so this one is waiting for an NFS vnode lock.
> > > 
> > > > 100393 D pgrbwt 0xc43a30c0 httpd
> > > > 
> > > This one is sleeping in vm_page_grab() { which I suspect has
> > > been called from kern_sendfile() with a shared vnode lock held,
> > > from what I saw on the previous debug info }.
> > > 
> > > > 100755 S uwait 0xc84b7c80 httpd
> > > > 
> > > > 
> > > > Not much to go on. :(  Maybe these five can be configured with
> > > > serial
> > > > consoles.
> > > > 
> > > > So, inquiries are continuing, but the answer to "does this still
> > > > happen on 9.2-RC2?" is definitely yes.
> > > > 
> > > Since r250027 moves a vn_lock() to before the vm_page_grab() call
> > > in
> > > kern_sendfile(), I suspect that is the cause of the deadlock.
> > > (r250027
> > > is one of the 3 commits MFC'd by r250907)
> > > 
> > > I don't know if it would be safe to VOP_UNLOCK() the vnode after
> > > VOP_GETATTR()
> > > and then put the vn_lock() call that comes after vm_page_grab()
> > > back in or whether
> > > r250027 should be reverted (getting rid of the VOP_GETATTR() and
> > > going back to
> > > using the size in the vm stuff).
> > > 
> > > Hopefully Kostik will know what is best to do with it now, rick
> > 
> > I already described what to do with this.  I need the debugging
> > information to see what is going on.  Without the data, it is only
> > wasted time of everybody involved.
> > 
> Sorry, I didn't make what I was asking clear. I was referring specifically
> to stopping the hang from occurring in the soon to be released 9.2.
> 
> I think you indirectly answered the question, in that you don't know
> of a fix for the hangs without more debugging information. This
> implies that reverting r250907 is the main option to resolve this
> for the 9.2 release (unless more debugging info arrives very soon),
> since that is the only fix that has been confirmed to work.
> Does this sound reasonable?
I do not object against reverting it for 9.2.  Please go ahead.

On the other hand, I do not want to revert it in stable/9, at least
until the cause is understood.

> 
> > Some technical notes.  The sendfile() uses shared lock for the
> > duration
> > of vnode i/o, so any thread which is sleeping on the vnode lock
> > cannot
> > be in the sendfile path, at least for UFS and NFS which do support
> > true
> > shared locks.
> > 
> > The right lock order is vnode lock -> page busy wait. From this PoV,
> > the ordering in the sendfile is correct. Rick, are you aware of any
> > situation where the VOP_READ in nfs client could drop vnode lock
> > and then re-acquire it ? I was not able to find this from the code
> > inspection. But, if such situation exists, it would be problematic in
> > 9.
> > 
> I am not aware of a case where nfs_read() drops/re-acquires the vnode
> lock.
> 
> However, readaheads will still be in progress when nfs_read() returns,
> so those can still be in progress after the vnode lock is dropped.
> 
> vfs_busy_pages() will have been called on the page(s) that readahead
> is in progress on (I think that means the shared busy bit will be set,
> if I understo

Re: NFS deadlock on 9.2-Beta1

2013-08-22 Thread J David
Now that a kernel with INVARIANTS/WITNESS is finally available on a
machine with serial console I am having terrible trouble provoking
this to happen.  (Machine grinds to a halt if I put the usual test
load on it due to all the debug code in the kernel.)

Did get this interesting LOR, though it did not cause a deadlock:

lock order reversal:
 1st 0xfe000adb9f30 so_snd_sx (so_snd_sx) @
/usr/src/sys/kern/uipc_sockbuf.c:145
 2nd 0xfe000aa5b098 newnfs (newnfs) @ /usr/src/sys/kern/uipc_syscalls.c:2062
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xff834c3995c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xff834c399670
witness_checkorder() at witness_checkorder+0xc0a/frame 0xff834c3996f0
__lockmgr_args() at __lockmgr_args+0x390/frame 0xff834c399810
nfs_lock1() at nfs_lock1+0x87/frame 0xff834c399840
VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xff834c399870
_vn_lock() at _vn_lock+0x63/frame 0xff834c3998d0
kern_sendfile() at kern_sendfile+0x812/frame 0xff834c399ac0
do_sendfile() at do_sendfile+0x92/frame 0xff834c399b20
amd64_syscall() at amd64_syscall+0x259/frame 0xff834c399c30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xff834c399c30
--- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b24f4c, rsp
= 0x7fffcf58, rbp = 0x7fffd290 ---

Once the real deal pops up, collecting the full requested info should
be no problem, but it could take awhile to happen with only one
machine that can't run the full test battery.  So if a "real" fix is
dependent on this, reverting r250907 for 9.2-RELEASE is probably the
way to go. With that configuration, releng/9.2 continues to be pretty
solid for us.

Thanks!

(Since this doesn't contain the requested info, I heavily trimmed the
Cc: list.  It is not my intention to waste the time of everybody
involved.)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Rick Macklem
Kostik wrote:
> On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
> > J David wrote:
> > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
> > > 
> > > wrote:
> > > > Have you been able to pass the debugging info on to Kostik?
> > > >
> > > > It would be really nice to get this fixed for FreeBSD9.2.
> > > 
> > > You're probably not talking to me, but headway here is slow.  At
> > > our
> > > location, we have been continuing to test releng/9.2 extensively,
> > > but
> > > with r250907 reverted.  Since reverting it solves the issue, and
> > > since
> > > there haven't been any further changes to releng/9.2 that might
> > > also
> > > resolve this issue, re-applying r250907 is perceived here as
> > > un-fixing
> > > a problem.  Enthusiasm for doing so is correspondingly low, even
> > > if
> > > the purpose is to gather debugging info. :(
> > > 
> > > However, after finally having clearance to test releng/9.2
> > > r254540
> > > with r250907 included and with DDB on five nodes.  The problem
> > > cropped
> > > up in about an hour.  Two threads in one process deadlocked, was
> > > perfect.  Got it into DDB and saw the stack trace was scrolling
> > > off
> > > so
> > > there was no way to copy it by hand.  Also, the machine's disk is
> > > smaller than physical RAM, so no dump file. :(
> > > 
> > > Here's what is available so far:
> > > 
> > > db> show proc 33362
> > > 
> > > Process 33362 (httpd) at 0xcd225b50:
> > > 
> > >  state: NORMAL
> > > 
> > >  uid: 25000 gids: 25000
> > > 
> > >  parent: pid 25104 at 0xc95f92d4
> > > 
> > >  ABI: FreeBSD ELF32
> > > 
> > >  arguments: /usr/local/libexec/httpd
> > > 
> > >  threads: 3
> > > 
> > > 100405 D newnfs 0xc9b875e4 httpd
> > > 
> > Ok, so this one is waiting for an NFS vnode lock.
> > 
> > > 100393 D pgrbwt 0xc43a30c0 httpd
> > > 
> > This one is sleeping in vm_page_grab() { which I suspect has
> > been called from kern_sendfile() with a shared vnode lock held,
> > from what I saw on the previous debug info }.
> > 
> > > 100755 S uwait 0xc84b7c80 httpd
> > > 
> > > 
> > > Not much to go on. :(  Maybe these five can be configured with
> > > serial
> > > consoles.
> > > 
> > > So, inquiries are continuing, but the answer to "does this still
> > > happen on 9.2-RC2?" is definitely yes.
> > > 
> > Since r250027 moves a vn_lock() to before the vm_page_grab() call
> > in
> > kern_sendfile(), I suspect that is the cause of the deadlock.
> > (r250027
> > is one of the 3 commits MFC'd by r250907)
> > 
> > I don't know if it would be safe to VOP_UNLOCK() the vnode after
> > VOP_GETATTR()
> > and then put the vn_lock() call that comes after vm_page_grab()
> > back in or whether
> > r250027 should be reverted (getting rid of the VOP_GETATTR() and
> > going back to
> > using the size in the vm stuff).
> > 
> > Hopefully Kostik will know what is best to do with it now, rick
> 
> I already described what to do with this.  I need the debugging
> information to see what is going on.  Without the data, it is only
> wasted time of everybody involved.
> 
Sorry, I didn't make what I was asking clear. I was referring specifically
to stopping the hang from occurring in the soon to be released 9.2.

I think you indirectly answered the question, in that you don't know
of a fix for the hangs without more debugging information. This
implies that reverting r250907 is the main option to resolve this
for the 9.2 release (unless more debugging info arrives very soon),
since that is the only fix that has been confirmed to work.
Does this sound reasonable?

> Some technical notes.  The sendfile() uses shared lock for the
> duration
> of vnode i/o, so any thread which is sleeping on the vnode lock
> cannot
> be in the sendfile path, at least for UFS and NFS which do support
> true
> shared locks.
> 
> The right lock order is vnode lock -> page busy wait. From this PoV,
> the ordering in the sendfile is correct. Rick, are you aware of any
> situation where the VOP_READ in nfs client could drop vnode lock
> and then re-acquire it ? I was not able to find this from the code
> inspection. But, if such situation exists, it would be problematic in
> 9.
> 
I am not aware of a case where nfs_read() drops/re-acquires the vnode
lock.

However, readaheads will still be in progress when nfs_read() returns,
so those can still be in progress after the vnode lock is dropped.

vfs_busy_pages() will have been called on the page(s) that readahead
is in progress on (I think that means the shared busy bit will be set,
if I understood vfs_busy_pages()). When the readahead is completed,
bufdone() is called, so I don't understand why the page wouldn't become
unbusied (waking up the thread sleeping on "pgrbwt").
I can't see why not being able to acquire the vnode lock would affect
this, but my hunch is that it somehow does have this effect, since that
is the only way I can see that r250907 would cause the hangs.

> Last note.  The HEAD dropped pre-busying pages in the sendfile()
> syscall.
> As I understand

Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Konstantin Belousov
On Wed, Aug 21, 2013 at 08:03:35PM +0200, Yamagi Burmeister wrote:
> Could the problem be related to this deadlock / LOR? -
> http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html
This is not related.

> 
> My test setup is still in place. Will test with r250907 reverted
> tomorrow morning and report back. Additional informations could be
> provided if necessary. I just need to know what exactly.

Just follow the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
and collect all the information listed there after the apache/sendfile/nfs
deadlock is reproduced.


pgpkkcUXOfUil.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Yamagi Burmeister
On Wed, 21 Aug 2013 16:10:32 +0300
Konstantin Belousov  wrote:

> I already described what to do with this.  I need the debugging
> information to see what is going on.  Without the data, it is only
> wasted time of everybody involved.
> 
> Some technical notes.  The sendfile() uses shared lock for the duration
> of vnode i/o, so any thread which is sleeping on the vnode lock cannot
> be in the sendfile path, at least for UFS and NFS which do support true
> shared locks.
> 
> The right lock order is vnode lock -> page busy wait. From this PoV,
> the ordering in the sendfile is correct. Rick, are you aware of any
> situation where the VOP_READ in nfs client could drop vnode lock
> and then re-acquire it ? I was not able to find this from the code
> inspection. But, if such situation exists, it would be problematic in 9.
> 
> Last note.  The HEAD dropped pre-busying pages in the sendfile() syscall.
> As I understand, this is because new Attilio' busy implementation cannot
> support both busy and sbusy states simultaneously, and vfs_busy_pages()/
> vfs_drain_busy_pages() actually created such situation. I think that
> because the sbusy is removed from the sendfile(), and the vm object
> lock is dropped, there is no sense to require vm_page_grab() to wait
> for the busy state to clean.  It is done by buffer cache or filesystem
> code later. See the patch at the end.
> 
> Still, I do not know what happens in the supposedly reported deadlock.
> 
> diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
> index 4797444..b974f53 100644
> --- a/sys/kern/uipc_syscalls.c
> +++ b/sys/kern/uipc_syscalls.c
> @@ -2230,7 +2230,8 @@ retry_space:
>   pindex = OFF_TO_IDX(off);
>   VM_OBJECT_WLOCK(obj);
>   pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
> - VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
> + VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
> + VM_ALLOC_WIRED | VM_ALLOC_RETRY);
>  
>   /*
>* Check if page is valid for what we need,

Could the problem be related to this deadlock / LOR? -
http://lists.freebsd.org/pipermail/freebsd-fs/2013-August/018052.html

My test setup is still in place. Will test with r250907 reverted
tomorrow morning and report back. Additional informations could be
provided if necessary. I just need to know what exactly.

Ciao,
Yamagi

-- 
Homepage:  www.yamagi.org
XMPP:  yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB


pgpiQ7wf0tTdP.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-21 Thread Konstantin Belousov
On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
> J David wrote:
> > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem 
> > wrote:
> > > Have you been able to pass the debugging info on to Kostik?
> > >
> > > It would be really nice to get this fixed for FreeBSD9.2.
> > 
> > You're probably not talking to me, but headway here is slow.  At our
> > location, we have been continuing to test releng/9.2 extensively, but
> > with r250907 reverted.  Since reverting it solves the issue, and
> > since
> > there haven't been any further changes to releng/9.2 that might also
> > resolve this issue, re-applying r250907 is perceived here as
> > un-fixing
> > a problem.  Enthusiasm for doing so is correspondingly low, even if
> > the purpose is to gather debugging info. :(
> > 
> > However, after finally having clearance to test releng/9.2 r254540
> > with r250907 included and with DDB on five nodes.  The problem
> > cropped
> > up in about an hour.  Two threads in one process deadlocked, was
> > perfect.  Got it into DDB and saw the stack trace was scrolling off
> > so
> > there was no way to copy it by hand.  Also, the machine's disk is
> > smaller than physical RAM, so no dump file. :(
> > 
> > Here's what is available so far:
> > 
> > db> show proc 33362
> > 
> > Process 33362 (httpd) at 0xcd225b50:
> > 
> >  state: NORMAL
> > 
> >  uid: 25000 gids: 25000
> > 
> >  parent: pid 25104 at 0xc95f92d4
> > 
> >  ABI: FreeBSD ELF32
> > 
> >  arguments: /usr/local/libexec/httpd
> > 
> >  threads: 3
> > 
> > 100405 D newnfs 0xc9b875e4 httpd
> > 
> Ok, so this one is waiting for an NFS vnode lock.
> 
> > 100393 D pgrbwt 0xc43a30c0 httpd
> > 
> This one is sleeping in vm_page_grab() { which I suspect has
> been called from kern_sendfile() with a shared vnode lock held,
> from what I saw on the previous debug info }.
> 
> > 100755 S uwait 0xc84b7c80 httpd
> > 
> > 
> > Not much to go on. :(  Maybe these five can be configured with serial
> > consoles.
> > 
> > So, inquiries are continuing, but the answer to "does this still
> > happen on 9.2-RC2?" is definitely yes.
> > 
> Since r250027 moves a vn_lock() to before the vm_page_grab() call in
> kern_sendfile(), I suspect that is the cause of the deadlock. (r250027
> is one of the 3 commits MFC'd by r250907)
> 
> I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR()
> and then put the vn_lock() call that comes after vm_page_grab() back in or 
> whether
> r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to
> using the size in the vm stuff).
> 
> Hopefully Kostik will know what is best to do with it now, rick

I already described what to do with this.  I need the debugging
information to see what is going on.  Without the data, it is only
wasted time of everybody involved.

Some technical notes.  The sendfile() uses shared lock for the duration
of vnode i/o, so any thread which is sleeping on the vnode lock cannot
be in the sendfile path, at least for UFS and NFS which do support true
shared locks.

The right lock order is vnode lock -> page busy wait. From this PoV,
the ordering in the sendfile is correct. Rick, are you aware of any
situation where the VOP_READ in nfs client could drop vnode lock
and then re-acquire it ? I was not able to find this from the code
inspection. But, if such situation exists, it would be problematic in 9.

Last note.  The HEAD dropped pre-busying pages in the sendfile() syscall.
As I understand, this is because new Attilio' busy implementation cannot
support both busy and sbusy states simultaneously, and vfs_busy_pages()/
vfs_drain_busy_pages() actually created such situation. I think that
because the sbusy is removed from the sendfile(), and the vm object
lock is dropped, there is no sense to require vm_page_grab() to wait
for the busy state to clean.  It is done by buffer cache or filesystem
code later. See the patch at the end.

Still, I do not know what happens in the supposedly reported deadlock.

diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
index 4797444..b974f53 100644
--- a/sys/kern/uipc_syscalls.c
+++ b/sys/kern/uipc_syscalls.c
@@ -2230,7 +2230,8 @@ retry_space:
pindex = OFF_TO_IDX(off);
VM_OBJECT_WLOCK(obj);
pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
-   VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
+   VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
+   VM_ALLOC_WIRED | VM_ALLOC_RETRY);
 
/*
 * Check if page is valid for what we need,


pgpWozKzpdFz5.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-08-20 Thread Oliver Pinter
On 8/20/13, J David  wrote:
> On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem  wrote:
>> Have you been able to pass the debugging info on to Kostik?
>>
>> It would be really nice to get this fixed for FreeBSD9.2.
>
> You're probably not talking to me, but headway here is slow.  At our
> location, we have been continuing to test releng/9.2 extensively, but
> with r250907 reverted.  Since reverting it solves the issue, and since
> there haven't been any further changes to releng/9.2 that might also
> resolve this issue, re-applying r250907 is perceived here as un-fixing
> a problem.  Enthusiasm for doing so is correspondingly low, even if
> the purpose is to gather debugging info. :(
>
> However, after finally having clearance to test releng/9.2 r254540
> with r250907 included and with DDB on five nodes.  The problem cropped
> up in about an hour.  Two threads in one process deadlocked, was
> perfect.  Got it into DDB and saw the stack trace was scrolling off so
> there was no way to copy it by hand.  Also, the machine's disk is
> smaller than physical RAM, so no dump file. :(
>
> Here's what is available so far:
>
> db> show proc 33362
>
> Process 33362 (httpd) at 0xcd225b50:
>
>  state: NORMAL
>
>  uid: 25000 gids: 25000
>
>  parent: pid 25104 at 0xc95f92d4
>
>  ABI: FreeBSD ELF32
>
>  arguments: /usr/local/libexec/httpd
>
>  threads: 3
>
> 100405 D newnfs 0xc9b875e4 httpd
>
> 100393 D pgrbwt 0xc43a30c0 httpd
>
> 100755 S uwait 0xc84b7c80 httpd
>
>
> Not much to go on. :(  Maybe these five can be configured with serial
> consoles.

try this with serial console:

host # script debug-output-file
host # cu -s 9600 -l /dev/ttyU0
~^B
KDB: enter: Break to debugger
[ thread pid 11 tid 15 ]
Stopped at  kdb_alt_break_internal+0x17f:   movq$0,kdb_why
db>   show msgbuf

...
~.
^D

>
> So, inquiries are continuing, but the answer to "does this still
> happen on 9.2-RC2?" is definitely yes.
>
> Thanks!
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-20 Thread Rick Macklem
J David wrote:
> On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem 
> wrote:
> > Have you been able to pass the debugging info on to Kostik?
> >
> > It would be really nice to get this fixed for FreeBSD9.2.
> 
> You're probably not talking to me, but headway here is slow.  At our
> location, we have been continuing to test releng/9.2 extensively, but
> with r250907 reverted.  Since reverting it solves the issue, and
> since
> there haven't been any further changes to releng/9.2 that might also
> resolve this issue, re-applying r250907 is perceived here as
> un-fixing
> a problem.  Enthusiasm for doing so is correspondingly low, even if
> the purpose is to gather debugging info. :(
> 
> However, after finally having clearance to test releng/9.2 r254540
> with r250907 included and with DDB on five nodes.  The problem
> cropped
> up in about an hour.  Two threads in one process deadlocked, was
> perfect.  Got it into DDB and saw the stack trace was scrolling off
> so
> there was no way to copy it by hand.  Also, the machine's disk is
> smaller than physical RAM, so no dump file. :(
> 
> Here's what is available so far:
> 
> db> show proc 33362
> 
> Process 33362 (httpd) at 0xcd225b50:
> 
>  state: NORMAL
> 
>  uid: 25000 gids: 25000
> 
>  parent: pid 25104 at 0xc95f92d4
> 
>  ABI: FreeBSD ELF32
> 
>  arguments: /usr/local/libexec/httpd
> 
>  threads: 3
> 
> 100405 D newnfs 0xc9b875e4 httpd
> 
Ok, so this one is waiting for an NFS vnode lock.

> 100393 D pgrbwt 0xc43a30c0 httpd
> 
This one is sleeping in vm_page_grab() { which I suspect has
been called from kern_sendfile() with a shared vnode lock held,
from what I saw on the previous debug info }.

> 100755 S uwait 0xc84b7c80 httpd
> 
> 
> Not much to go on. :(  Maybe these five can be configured with serial
> consoles.
> 
> So, inquiries are continuing, but the answer to "does this still
> happen on 9.2-RC2?" is definitely yes.
> 
Since r250027 moves a vn_lock() to before the vm_page_grab() call in
kern_sendfile(), I suspect that is the cause of the deadlock. (r250027
is one of the 3 commits MFC'd by r250907)

I don't know if it would be safe to VOP_UNLOCK() the vnode after VOP_GETATTR()
and then put the vn_lock() call that comes after vm_page_grab() back in or 
whether
r250027 should be reverted (getting rid of the VOP_GETATTR() and going back to
using the size in the vm stuff).

Hopefully Kostik will know what is best to do with it now, rick

> Thanks!
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscr...@freebsd.org"
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-19 Thread J David
On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem  wrote:
> Have you been able to pass the debugging info on to Kostik?
>
> It would be really nice to get this fixed for FreeBSD9.2.

You're probably not talking to me, but headway here is slow.  At our
location, we have been continuing to test releng/9.2 extensively, but
with r250907 reverted.  Since reverting it solves the issue, and since
there haven't been any further changes to releng/9.2 that might also
resolve this issue, re-applying r250907 is perceived here as un-fixing
a problem.  Enthusiasm for doing so is correspondingly low, even if
the purpose is to gather debugging info. :(

However, after finally having clearance to test releng/9.2 r254540
with r250907 included and with DDB on five nodes.  The problem cropped
up in about an hour.  Two threads in one process deadlocked, was
perfect.  Got it into DDB and saw the stack trace was scrolling off so
there was no way to copy it by hand.  Also, the machine's disk is
smaller than physical RAM, so no dump file. :(

Here's what is available so far:

db> show proc 33362

Process 33362 (httpd) at 0xcd225b50:

 state: NORMAL

 uid: 25000 gids: 25000

 parent: pid 25104 at 0xc95f92d4

 ABI: FreeBSD ELF32

 arguments: /usr/local/libexec/httpd

 threads: 3

100405 D newnfs 0xc9b875e4 httpd

100393 D pgrbwt 0xc43a30c0 httpd

100755 S uwait 0xc84b7c80 httpd


Not much to go on. :(  Maybe these five can be configured with serial consoles.

So, inquiries are continuing, but the answer to "does this still
happen on 9.2-RC2?" is definitely yes.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-15 Thread Rick Macklem
Michael Tratz wrote:
> 
> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
>  wrote:
> 
> > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
> >> Let's assume the pid which started the deadlock is 14001 (it will
> >> be a different pid when we get the results, because the machine
> >> has been restarted)
> >> 
> >> I type:
> >> 
> >> show proc 14001
> >> 
> >> I get the thread numbers from that output and type:
> >> 
> >> show thread x
> >> 
> >> for each one.
> >> 
> >> And a trace for each thread with the command?
> >> 
> >> tr 
> >> 
> >> Anything else I should try to get or do? Or is that not the data
> >> at all you are looking for?
> >> 
> > Yes, everything else which is listed in the 'debugging deadlocks'
> > page
> > must be provided, otherwise the deadlock cannot be tracked.
> > 
> > The investigator should be able to see the whole deadlock chain
> > (loop)
> > to make any useful advance.
> 
> Ok, I have made some excellent progress in debugging the NFS
> deadlock.
> 
> Rick! You are genius. :-) You found the right commit r250907 (dated
> May 22) is the definitely the problem.
> 
> Here is how I did the testing: One machine received a kernel before
> r250907, the second machine received a kernel after r250907. Sure
> enough within a few hours the machine with r250907 went into the
> usual deadlock state. The machine without that commit kept on
> working fine. Then I went back to the latest revision (r253726), but
> leaving r250907 out. The machines have been running happy and rock
> solid without any deadlocks. I have expanded the testing to 3
> machines now and no reports of any issues.
> 
> I guess now Konstantin has to figure out why that commit is causing
> the deadlock. Lovely! :-) I will get that information as soon as
> possible. I'm a little behind with normal work load, but I expect to
> have the data by Tuesday evening or Wednesday.
> 
Have you been able to pass the debugging info on to Kostik?

It would be really nice to get this fixed for FreeBSD9.2.

Thanks for your help with this, rick

> Thanks again!!
> 
> Michael
> 
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-05 Thread J David
On Mon, Aug 5, 2013 at 12:06 PM, Mark Saad  wrote:
>   Is there any updates on this issue ? Has anyone tested it or see it happen 
> on the release candidate ?

It's a bit premature for that; the RC has been out for a few hours.
We put BETA2 on 25 nodes and only saw the problem on five after 24
hours.  At that point we switched to a build that reverts the patch
that causes the deadlock and no node on which that was done (at this
point, all of them) has had the problem since.

We'll get some machines on releng/9.2 today, but I didn't see anything
in the release candidate announcement to indicate that relevant
changes had been made.

Is there anything in the release candidate that is believe to address
this issue?  If so, let us know with svn revision it's in and we'll
try to accelerate test deployment.

Thanks!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-08-05 Thread Mark Saad


On Jul 29, 2013, at 10:48 PM, J David  wrote:

> If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and
> without especially trying to exercise this bug, we found
> sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25
> nodes.  The ones with highest uptime (about 3 days) seem most
> affected, so it does seem like a "sooner or later" type of thing.
> Hopefully the fix is easy and it won't be an issue, but it definitely
> does seem like a problem 9.2-RELEASE would be better off without.
> 
> Unfortunately we are not in a position to capture the requested
> debugging information at this time; none of those nodes are running a
> debug version of the kernel.  If Michael is unable to get the
> information as he hopes, we can try to do that, possibly over the
> weekend.  For the time being, we will convert half the machines to
> rollback r250907 to try to confirm that resolves the issue.
> 
> Thanks all!  If one has to encounter a problem like this, it is nice
> to come to the list and find the research already so well underway!
> __

All
  Is there any updates on this issue ? Has anyone tested it or see it happen on 
the release candidate ? 
---
Mark saad | mark.s...@longcount.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread J David
If it is helpful, we have 25 nodes testing the 9.2-BETA1 build and
without especially trying to exercise this bug, we found
sendfile()-using processes deadlocked in WCHAN newnfs on 5 of the 25
nodes.  The ones with highest uptime (about 3 days) seem most
affected, so it does seem like a "sooner or later" type of thing.
Hopefully the fix is easy and it won't be an issue, but it definitely
does seem like a problem 9.2-RELEASE would be better off without.

Unfortunately we are not in a position to capture the requested
debugging information at this time; none of those nodes are running a
debug version of the kernel.  If Michael is unable to get the
information as he hopes, we can try to do that, possibly over the
weekend.  For the time being, we will convert half the machines to
rollback r250907 to try to confirm that resolves the issue.

Thanks all!  If one has to encounter a problem like this, it is nice
to come to the list and find the research already so well underway!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread Rick Macklem
Michael Tratz wrote:
> 
> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
>  wrote:
> 
> > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
> >> Let's assume the pid which started the deadlock is 14001 (it will
> >> be a different pid when we get the results, because the machine
> >> has been restarted)
> >> 
> >> I type:
> >> 
> >> show proc 14001
> >> 
> >> I get the thread numbers from that output and type:
> >> 
> >> show thread x
> >> 
> >> for each one.
> >> 
> >> And a trace for each thread with the command?
> >> 
> >> tr 
> >> 
> >> Anything else I should try to get or do? Or is that not the data
> >> at all you are looking for?
> >> 
> > Yes, everything else which is listed in the 'debugging deadlocks'
> > page
> > must be provided, otherwise the deadlock cannot be tracked.
> > 
> > The investigator should be able to see the whole deadlock chain
> > (loop)
> > to make any useful advance.
> 
> Ok, I have made some excellent progress in debugging the NFS
> deadlock.
> 
> Rick! You are genius. :-) You found the right commit r250907 (dated
> May 22) is the definitely the problem.
> 
Nowhere close, take my word for it;-) (At least you put a smiley after it.)
(I've never actually even been employed as a software developer, but that's off 
topic.)

I just got lucky (basically there wasn't any other commit that seemed it might 
cause this).

But, the good news is that it is partially isolated. Hopefully the debugging 
stuff
you get for Kostik will allow him (I suspect he is a genius) to solve the 
problem.
(If I was going to take another "shot in the dark", I'd guess its r250027 moving
 the vn_lock() call. Maybe calling vm_page_grab() with the shared vnode lock 
held?)

I've added re@ to the cc list, since I think this might be a show stopper for 
9.2?

Thanks for reporting this and all your help with tracking it down, rick

> Here is how I did the testing: One machine received a kernel before
> r250907, the second machine received a kernel after r250907. Sure
> enough within a few hours the machine with r250907 went into the
> usual deadlock state. The machine without that commit kept on
> working fine. Then I went back to the latest revision (r253726), but
> leaving r250907 out. The machines have been running happy and rock
> solid without any deadlocks. I have expanded the testing to 3
> machines now and no reports of any issues.
> 
> I guess now Konstantin has to figure out why that commit is causing
> the deadlock. Lovely! :-) I will get that information as soon as
> possible. I'm a little behind with normal work load, but I expect to
> have the data by Tuesday evening or Wednesday.
> 
> Thanks again!!
> 
> Michael
> 
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-07-29 Thread Michael Tratz

On Jul 27, 2013, at 11:25 PM, Konstantin Belousov  wrote:

> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
>> Let's assume the pid which started the deadlock is 14001 (it will be a 
>> different pid when we get the results, because the machine has been 
>> restarted)
>> 
>> I type:
>> 
>> show proc 14001
>> 
>> I get the thread numbers from that output and type:
>> 
>> show thread x
>> 
>> for each one.
>> 
>> And a trace for each thread with the command?
>> 
>> tr 
>> 
>> Anything else I should try to get or do? Or is that not the data at all you 
>> are looking for?
>> 
> Yes, everything else which is listed in the 'debugging deadlocks' page
> must be provided, otherwise the deadlock cannot be tracked.
> 
> The investigator should be able to see the whole deadlock chain (loop)
> to make any useful advance.

Ok, I have made some excellent progress in debugging the NFS deadlock.

Rick! You are genius. :-) You found the right commit r250907 (dated May 22) is 
the definitely the problem.

Here is how I did the testing: One machine received a kernel before r250907, 
the second machine received a kernel after r250907. Sure enough within a few 
hours the machine with r250907 went into the usual deadlock state. The machine 
without that commit kept on working fine. Then I went back to the latest 
revision (r253726), but leaving r250907 out. The machines have been running 
happy and rock solid without any deadlocks. I have expanded the testing to 3 
machines now and no reports of any issues.

I guess now Konstantin has to figure out why that commit is causing the 
deadlock. Lovely! :-) I will get that information as soon as possible. I'm a 
little behind with normal work load, but I expect to have the data by Tuesday 
evening or Wednesday.

Thanks again!!

Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Konstantin Belousov
On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
> Let's assume the pid which started the deadlock is 14001 (it will be a 
> different pid when we get the results, because the machine has been restarted)
> 
> I type:
> 
> show proc 14001
> 
> I get the thread numbers from that output and type:
> 
> show thread x
> 
> for each one.
> 
> And a trace for each thread with the command?
> 
> tr 
> 
> Anything else I should try to get or do? Or is that not the data at all you 
> are looking for?
> 
Yes, everything else which is listed in the 'debugging deadlocks' page
must be provided, otherwise the deadlock cannot be tracked.

The investigator should be able to see the whole deadlock chain (loop)
to make any useful advance.


pgp32XmLnF34j.pgp
Description: PGP signature


Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Michael Tratz

On Jul 27, 2013, at 1:58 PM, Konstantin Belousov  wrote:

> On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote:
>> Michael Tratz wrote:
>>> 
>>> On Jul 24, 2013, at 5:25 PM, Rick Macklem 
>>> wrote:
>>> 
 Michael Tratz wrote:
> Two machines (NFS Server: running ZFS / Client: disk-less), both
> are
> running FreeBSD r253506. The NFS client starts to deadlock
> processes
> within a few hours. It usually gets worse from there on. The
> processes stay in "D" state. I haven't been able to reproduce it
> when I want it to happen. I only have to wait a few hours until
> the
> deadlocks occur when traffic to the client machine starts to pick
> up. The only way to fix the deadlocks is to reboot the client.
> Even
> an ls to the path which is deadlocked, will deadlock ls itself.
> It's
> totally random what part of the file system gets deadlocked. The
> NFS
> server itself has no problem at all to access the files/path when
> something is deadlocked on the client.
> 
> Last night I decided to put an older kernel on the system r252025
> (June 20th). The NFS server stayed untouched. So far 0 deadlocks
> on
> the client machine (it should have deadlocked by now). FreeBSD is
> working hard like it always does. :-) There are a few changes to
> the
> NFS code from the revision which seems to work until Beta1. I
> haven't tried to narrow it down if one of those commits are
> causing
> the problem. Maybe someone has an idea what could be wrong and I
> can
> test a patch or if it's something else, because I'm not a kernel
> expert. :-)
> 
 Well, the only NFS client change committed between r252025 and
 r253506
 is r253124. It fixes a file corruption problem caused by a previous
 commit that delayed the vnode_pager_setsize() call until after the
 nfs node mutex lock was unlocked.
 
 If you can test with only r253124 reverted to see if that gets rid
 of
 the hangs, it would be useful, although from the procstats, I doubt
 it.
 
> I have run several procstat -kk on the processes including the ls
> which deadlocked. You can see them here:
> 
> http://pastebin.com/1RPnFT6r
 
 All the processes you show seem to be stuck waiting for a vnode
 lock
 or in __utmx_op_wait. (I`m not sure what the latter means.)
 
 What is missing is what processes are holding the vnode locks and
 what they are stuck on.
 
 A starting point might be ``ps axhl``, to see what all the threads
 are doing (particularily the WCHAN for them all). If you can drop
 into
 the debugger when the NFS mounts are hung and do a ```show
 alllocks``
 that could help. See:
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
 I`ll admit I`d be surprised if r253124 caused this, but who knows.
 
 If there have been changes to your network device driver between
 r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
 waiting for a reply while holding a vnode lock, that would do it.)
 
 Good luck with it and maybe someone else can think of a commit
 between r252025 and r253506 that could cause vnode locking or
 network
 problems.
 
 rick
 
> 
> I have tried to mount the file system with and without nolockd. It
> didn't make a difference. Other than that it is mounted with:
> 
> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> 
> Let me know if you need me to do something else or if some other
> output is required. I would have to go back to the problem kernel
> and wait until the deadlock occurs to get that information.
> 
>>> 
>>> Thanks Rick and Steven for your quick replies.
>>> 
>>> I spoke too soon regarding r252025 fixing the problem. The same issue
>>> started to show up after about 1 day and a few hours of uptime.
>>> 
>>> "ps axhl" shows all those stuck processes in newnfs
>>> 
>>> I recompiled the GENERIC kernel for Beta1 with the debugging options:
>>> 
>>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>>> 
>>> ps and debugging output:
>>> 
>>> http://pastebin.com/1v482Dfw
>>> 
>>> (I only listed processes matching newnfs, if you need the whole list,
>>> please let me know)
>>> 
>> Is your "show alllocks" complete? If not, a complete list of locks
>> would definitely help. As for "ps axhl", a complete list of processes/threads
>> might be useful, but not as much, I think.
>> 
>>> The first PID showing up having that problem is 14001. Certainly the
>>> "show alllocks" command shows interesting information for that PID.
>>> I looked through the commit history for those files mentioned in the
>>> output to see if there is something obvious to me. But I don't know.
>>> :-)
>>> I hope that information helps you to

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Michael Tratz

On Jul 27, 2013, at 1:20 PM, Rick Macklem  wrote:

> Michael Tratz wrote:
>> 
>> On Jul 24, 2013, at 5:25 PM, Rick Macklem 
>> wrote:
>> 
>>> Michael Tratz wrote:
 Two machines (NFS Server: running ZFS / Client: disk-less), both
 are
 running FreeBSD r253506. The NFS client starts to deadlock
 processes
 within a few hours. It usually gets worse from there on. The
 processes stay in "D" state. I haven't been able to reproduce it
 when I want it to happen. I only have to wait a few hours until
 the
 deadlocks occur when traffic to the client machine starts to pick
 up. The only way to fix the deadlocks is to reboot the client.
 Even
 an ls to the path which is deadlocked, will deadlock ls itself.
 It's
 totally random what part of the file system gets deadlocked. The
 NFS
 server itself has no problem at all to access the files/path when
 something is deadlocked on the client.
 
 Last night I decided to put an older kernel on the system r252025
 (June 20th). The NFS server stayed untouched. So far 0 deadlocks
 on
 the client machine (it should have deadlocked by now). FreeBSD is
 working hard like it always does. :-) There are a few changes to
 the
 NFS code from the revision which seems to work until Beta1. I
 haven't tried to narrow it down if one of those commits are
 causing
 the problem. Maybe someone has an idea what could be wrong and I
 can
 test a patch or if it's something else, because I'm not a kernel
 expert. :-)
 
>>> Well, the only NFS client change committed between r252025 and
>>> r253506
>>> is r253124. It fixes a file corruption problem caused by a previous
>>> commit that delayed the vnode_pager_setsize() call until after the
>>> nfs node mutex lock was unlocked.
>>> 
>>> If you can test with only r253124 reverted to see if that gets rid
>>> of
>>> the hangs, it would be useful, although from the procstats, I doubt
>>> it.
>>> 
 I have run several procstat -kk on the processes including the ls
 which deadlocked. You can see them here:
 
 http://pastebin.com/1RPnFT6r
>>> 
>>> All the processes you show seem to be stuck waiting for a vnode
>>> lock
>>> or in __utmx_op_wait. (I`m not sure what the latter means.)
>>> 
>>> What is missing is what processes are holding the vnode locks and
>>> what they are stuck on.
>>> 
>>> A starting point might be ``ps axhl``, to see what all the threads
>>> are doing (particularily the WCHAN for them all). If you can drop
>>> into
>>> the debugger when the NFS mounts are hung and do a ```show
>>> alllocks``
>>> that could help. See:
>>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>>> 
>>> I`ll admit I`d be surprised if r253124 caused this, but who knows.
>>> 
>>> If there have been changes to your network device driver between
>>> r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
>>> waiting for a reply while holding a vnode lock, that would do it.)
>>> 
>>> Good luck with it and maybe someone else can think of a commit
>>> between r252025 and r253506 that could cause vnode locking or
>>> network
>>> problems.
>>> 
>>> rick
>>> 
 
 I have tried to mount the file system with and without nolockd. It
 didn't make a difference. Other than that it is mounted with:
 
 rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
 
 Let me know if you need me to do something else or if some other
 output is required. I would have to go back to the problem kernel
 and wait until the deadlock occurs to get that information.
 
>> 
>> Thanks Rick and Steven for your quick replies.
>> 
>> I spoke too soon regarding r252025 fixing the problem. The same issue
>> started to show up after about 1 day and a few hours of uptime.
>> 
>> "ps axhl" shows all those stuck processes in newnfs
>> 
>> I recompiled the GENERIC kernel for Beta1 with the debugging options:
>> 
>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>> 
>> ps and debugging output:
>> 
>> http://pastebin.com/1v482Dfw
>> 
>> (I only listed processes matching newnfs, if you need the whole list,
>> please let me know)
>> 
> Is your "show alllocks" complete? If not, a complete list of locks
> would definitely help. As for "ps axhl", a complete list of processes/threads
> might be useful, but not as much, I think.

Yes that was the entire output for show alllocks.  

> 
>> The first PID showing up having that problem is 14001. Certainly the
>> "show alllocks" command shows interesting information for that PID.
>> I looked through the commit history for those files mentioned in the
>> output to see if there is something obvious to me. But I don't know.
>> :-)
>> I hope that information helps you to dig deeper into the issue what
>> might be causing those deadlocks.
>> 
> Well, pid 14001 is interesting in that it holds both the sleep 

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Konstantin Belousov
On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote:
> Michael Tratz wrote:
> > 
> > On Jul 24, 2013, at 5:25 PM, Rick Macklem 
> > wrote:
> > 
> > > Michael Tratz wrote:
> > >> Two machines (NFS Server: running ZFS / Client: disk-less), both
> > >> are
> > >> running FreeBSD r253506. The NFS client starts to deadlock
> > >> processes
> > >> within a few hours. It usually gets worse from there on. The
> > >> processes stay in "D" state. I haven't been able to reproduce it
> > >> when I want it to happen. I only have to wait a few hours until
> > >> the
> > >> deadlocks occur when traffic to the client machine starts to pick
> > >> up. The only way to fix the deadlocks is to reboot the client.
> > >> Even
> > >> an ls to the path which is deadlocked, will deadlock ls itself.
> > >> It's
> > >> totally random what part of the file system gets deadlocked. The
> > >> NFS
> > >> server itself has no problem at all to access the files/path when
> > >> something is deadlocked on the client.
> > >> 
> > >> Last night I decided to put an older kernel on the system r252025
> > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks
> > >> on
> > >> the client machine (it should have deadlocked by now). FreeBSD is
> > >> working hard like it always does. :-) There are a few changes to
> > >> the
> > >> NFS code from the revision which seems to work until Beta1. I
> > >> haven't tried to narrow it down if one of those commits are
> > >> causing
> > >> the problem. Maybe someone has an idea what could be wrong and I
> > >> can
> > >> test a patch or if it's something else, because I'm not a kernel
> > >> expert. :-)
> > >> 
> > > Well, the only NFS client change committed between r252025 and
> > > r253506
> > > is r253124. It fixes a file corruption problem caused by a previous
> > > commit that delayed the vnode_pager_setsize() call until after the
> > > nfs node mutex lock was unlocked.
> > > 
> > > If you can test with only r253124 reverted to see if that gets rid
> > > of
> > > the hangs, it would be useful, although from the procstats, I doubt
> > > it.
> > > 
> > >> I have run several procstat -kk on the processes including the ls
> > >> which deadlocked. You can see them here:
> > >> 
> > >> http://pastebin.com/1RPnFT6r
> > > 
> > > All the processes you show seem to be stuck waiting for a vnode
> > > lock
> > > or in __utmx_op_wait. (I`m not sure what the latter means.)
> > > 
> > > What is missing is what processes are holding the vnode locks and
> > > what they are stuck on.
> > > 
> > > A starting point might be ``ps axhl``, to see what all the threads
> > > are doing (particularily the WCHAN for them all). If you can drop
> > > into
> > > the debugger when the NFS mounts are hung and do a ```show
> > > alllocks``
> > > that could help. See:
> > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > > 
> > > I`ll admit I`d be surprised if r253124 caused this, but who knows.
> > > 
> > > If there have been changes to your network device driver between
> > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> > > waiting for a reply while holding a vnode lock, that would do it.)
> > > 
> > > Good luck with it and maybe someone else can think of a commit
> > > between r252025 and r253506 that could cause vnode locking or
> > > network
> > > problems.
> > > 
> > > rick
> > > 
> > >> 
> > >> I have tried to mount the file system with and without nolockd. It
> > >> didn't make a difference. Other than that it is mounted with:
> > >> 
> > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> > >> 
> > >> Let me know if you need me to do something else or if some other
> > >> output is required. I would have to go back to the problem kernel
> > >> and wait until the deadlock occurs to get that information.
> > >> 
> > 
> > Thanks Rick and Steven for your quick replies.
> > 
> > I spoke too soon regarding r252025 fixing the problem. The same issue
> > started to show up after about 1 day and a few hours of uptime.
> > 
> > "ps axhl" shows all those stuck processes in newnfs
> > 
> > I recompiled the GENERIC kernel for Beta1 with the debugging options:
> > 
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > 
> > ps and debugging output:
> > 
> > http://pastebin.com/1v482Dfw
> > 
> > (I only listed processes matching newnfs, if you need the whole list,
> > please let me know)
> > 
> Is your "show alllocks" complete? If not, a complete list of locks
> would definitely help. As for "ps axhl", a complete list of processes/threads
> might be useful, but not as much, I think.
> 
> > The first PID showing up having that problem is 14001. Certainly the
> > "show alllocks" command shows interesting information for that PID.
> > I looked through the commit history for those files mentioned in the
> > output to see if there is something obvious to me. But I don't know.
> > :-)
> > I hope that inf

Re: NFS deadlock on 9.2-Beta1

2013-07-27 Thread Rick Macklem
Michael Tratz wrote:
> 
> On Jul 24, 2013, at 5:25 PM, Rick Macklem 
> wrote:
> 
> > Michael Tratz wrote:
> >> Two machines (NFS Server: running ZFS / Client: disk-less), both
> >> are
> >> running FreeBSD r253506. The NFS client starts to deadlock
> >> processes
> >> within a few hours. It usually gets worse from there on. The
> >> processes stay in "D" state. I haven't been able to reproduce it
> >> when I want it to happen. I only have to wait a few hours until
> >> the
> >> deadlocks occur when traffic to the client machine starts to pick
> >> up. The only way to fix the deadlocks is to reboot the client.
> >> Even
> >> an ls to the path which is deadlocked, will deadlock ls itself.
> >> It's
> >> totally random what part of the file system gets deadlocked. The
> >> NFS
> >> server itself has no problem at all to access the files/path when
> >> something is deadlocked on the client.
> >> 
> >> Last night I decided to put an older kernel on the system r252025
> >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks
> >> on
> >> the client machine (it should have deadlocked by now). FreeBSD is
> >> working hard like it always does. :-) There are a few changes to
> >> the
> >> NFS code from the revision which seems to work until Beta1. I
> >> haven't tried to narrow it down if one of those commits are
> >> causing
> >> the problem. Maybe someone has an idea what could be wrong and I
> >> can
> >> test a patch or if it's something else, because I'm not a kernel
> >> expert. :-)
> >> 
> > Well, the only NFS client change committed between r252025 and
> > r253506
> > is r253124. It fixes a file corruption problem caused by a previous
> > commit that delayed the vnode_pager_setsize() call until after the
> > nfs node mutex lock was unlocked.
> > 
> > If you can test with only r253124 reverted to see if that gets rid
> > of
> > the hangs, it would be useful, although from the procstats, I doubt
> > it.
> > 
> >> I have run several procstat -kk on the processes including the ls
> >> which deadlocked. You can see them here:
> >> 
> >> http://pastebin.com/1RPnFT6r
> > 
> > All the processes you show seem to be stuck waiting for a vnode
> > lock
> > or in __utmx_op_wait. (I`m not sure what the latter means.)
> > 
> > What is missing is what processes are holding the vnode locks and
> > what they are stuck on.
> > 
> > A starting point might be ``ps axhl``, to see what all the threads
> > are doing (particularily the WCHAN for them all). If you can drop
> > into
> > the debugger when the NFS mounts are hung and do a ```show
> > alllocks``
> > that could help. See:
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > 
> > I`ll admit I`d be surprised if r253124 caused this, but who knows.
> > 
> > If there have been changes to your network device driver between
> > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> > waiting for a reply while holding a vnode lock, that would do it.)
> > 
> > Good luck with it and maybe someone else can think of a commit
> > between r252025 and r253506 that could cause vnode locking or
> > network
> > problems.
> > 
> > rick
> > 
> >> 
> >> I have tried to mount the file system with and without nolockd. It
> >> didn't make a difference. Other than that it is mounted with:
> >> 
> >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> >> 
> >> Let me know if you need me to do something else or if some other
> >> output is required. I would have to go back to the problem kernel
> >> and wait until the deadlock occurs to get that information.
> >> 
> 
> Thanks Rick and Steven for your quick replies.
> 
> I spoke too soon regarding r252025 fixing the problem. The same issue
> started to show up after about 1 day and a few hours of uptime.
> 
> "ps axhl" shows all those stuck processes in newnfs
> 
> I recompiled the GENERIC kernel for Beta1 with the debugging options:
> 
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> 
> ps and debugging output:
> 
> http://pastebin.com/1v482Dfw
> 
> (I only listed processes matching newnfs, if you need the whole list,
> please let me know)
> 
Is your "show alllocks" complete? If not, a complete list of locks
would definitely help. As for "ps axhl", a complete list of processes/threads
might be useful, but not as much, I think.

> The first PID showing up having that problem is 14001. Certainly the
> "show alllocks" command shows interesting information for that PID.
> I looked through the commit history for those files mentioned in the
> output to see if there is something obvious to me. But I don't know.
> :-)
> I hope that information helps you to dig deeper into the issue what
> might be causing those deadlocks.
> 
Well, pid 14001 is interesting in that it holds both the sleep lock
acquired by sblock() and an NFS vnode lock, but is waiting for another
NFS vnode lock, if I read the pastebin stuff correctly.

I suspect that t

Re: NFS deadlock on 9.2-Beta1

2013-07-26 Thread Daniel Braniss
> 
> On Jul 24, 2013, at 5:25 PM, Rick Macklem  wrote:
> 
> > Michael Tratz wrote:
> >> Two machines (NFS Server: running ZFS / Client: disk-less), both are
> >> running FreeBSD r253506. The NFS client starts to deadlock processes
> >> within a few hours. It usually gets worse from there on. The
> >> processes stay in "D" state. I haven't been able to reproduce it
> >> when I want it to happen. I only have to wait a few hours until the
> >> deadlocks occur when traffic to the client machine starts to pick
> >> up. The only way to fix the deadlocks is to reboot the client. Even
> >> an ls to the path which is deadlocked, will deadlock ls itself. It's
> >> totally random what part of the file system gets deadlocked. The NFS
> >> server itself has no problem at all to access the files/path when
> >> something is deadlocked on the client.
> >> 
> >> Last night I decided to put an older kernel on the system r252025
> >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on
> >> the client machine (it should have deadlocked by now). FreeBSD is
> >> working hard like it always does. :-) There are a few changes to the
> >> NFS code from the revision which seems to work until Beta1. I
> >> haven't tried to narrow it down if one of those commits are causing
> >> the problem. Maybe someone has an idea what could be wrong and I can
> >> test a patch or if it's something else, because I'm not a kernel
> >> expert. :-)
> >> 
> > Well, the only NFS client change committed between r252025 and r253506
> > is r253124. It fixes a file corruption problem caused by a previous
> > commit that delayed the vnode_pager_setsize() call until after the
> > nfs node mutex lock was unlocked.
> > 
> > If you can test with only r253124 reverted to see if that gets rid of
> > the hangs, it would be useful, although from the procstats, I doubt it.
> > 
> >> I have run several procstat -kk on the processes including the ls
> >> which deadlocked. You can see them here:
> >> 
> >> http://pastebin.com/1RPnFT6r
> > 
> > All the processes you show seem to be stuck waiting for a vnode lock
> > or in __utmx_op_wait. (I`m not sure what the latter means.)
> > 
> > What is missing is what processes are holding the vnode locks and
> > what they are stuck on.
> > 
> > A starting point might be ``ps axhl``, to see what all the threads
> > are doing (particularily the WCHAN for them all). If you can drop into
> > the debugger when the NFS mounts are hung and do a ```show alllocks``
> > that could help. See:
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > 
> > I`ll admit I`d be surprised if r253124 caused this, but who knows.
> > 
> > If there have been changes to your network device driver between
> > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> > waiting for a reply while holding a vnode lock, that would do it.)
> > 
> > Good luck with it and maybe someone else can think of a commit
> > between r252025 and r253506 that could cause vnode locking or network
> > problems.
> > 
> > rick
> > 
> >> 
> >> I have tried to mount the file system with and without nolockd. It
> >> didn't make a difference. Other than that it is mounted with:
> >> 
> >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> >> 
> >> Let me know if you need me to do something else or if some other
> >> output is required. I would have to go back to the problem kernel
> >> and wait until the deadlock occurs to get that information.
> >> 
> 
> Thanks Rick and Steven for your quick replies.
> 
> I spoke too soon regarding r252025 fixing the problem. The same issue started 
> to show up after about 1 day and a few hours of uptime.
> 
> "ps axhl" shows all those stuck processes in newnfs
> 
> I recompiled the GENERIC kernel for Beta1 with the debugging options:
> 
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> 
> ps and debugging output:
> 
> http://pastebin.com/1v482Dfw
> 
> (I only listed processes matching newnfs, if you need the whole list, please 
> let me know)
> 
> The first PID showing up having that problem is 14001. Certainly the "show 
> alllocks" command shows interesting information for that PID.
> I looked through the commit history for those files mentioned in the output 
> to see if there is something obvious to me. But I don't know. :-)
> I hope that information helps you to dig deeper into the issue what might be 
> causing those deadlocks.
> 
> I did include the pciconf -lv, because you mentioned network device drivers. 
> It's Intel igb. The same hardware is running a kernel from January 19th, 2013 
> also as an NFS client. That machine is rock solid. No problems at all.
> 
> I also went to r251611. That's before r251641 (The NFS FHA changes). Same 
> problem. Here is another debugging output from that kernel:
> 
> http://pastebin.com/ryv8BYc4
> 
> If I should test something else or provide some other output, please let me 
> know.

Re: NFS deadlock on 9.2-Beta1

2013-07-25 Thread Michael Tratz

On Jul 24, 2013, at 5:25 PM, Rick Macklem  wrote:

> Michael Tratz wrote:
>> Two machines (NFS Server: running ZFS / Client: disk-less), both are
>> running FreeBSD r253506. The NFS client starts to deadlock processes
>> within a few hours. It usually gets worse from there on. The
>> processes stay in "D" state. I haven't been able to reproduce it
>> when I want it to happen. I only have to wait a few hours until the
>> deadlocks occur when traffic to the client machine starts to pick
>> up. The only way to fix the deadlocks is to reboot the client. Even
>> an ls to the path which is deadlocked, will deadlock ls itself. It's
>> totally random what part of the file system gets deadlocked. The NFS
>> server itself has no problem at all to access the files/path when
>> something is deadlocked on the client.
>> 
>> Last night I decided to put an older kernel on the system r252025
>> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on
>> the client machine (it should have deadlocked by now). FreeBSD is
>> working hard like it always does. :-) There are a few changes to the
>> NFS code from the revision which seems to work until Beta1. I
>> haven't tried to narrow it down if one of those commits are causing
>> the problem. Maybe someone has an idea what could be wrong and I can
>> test a patch or if it's something else, because I'm not a kernel
>> expert. :-)
>> 
> Well, the only NFS client change committed between r252025 and r253506
> is r253124. It fixes a file corruption problem caused by a previous
> commit that delayed the vnode_pager_setsize() call until after the
> nfs node mutex lock was unlocked.
> 
> If you can test with only r253124 reverted to see if that gets rid of
> the hangs, it would be useful, although from the procstats, I doubt it.
> 
>> I have run several procstat -kk on the processes including the ls
>> which deadlocked. You can see them here:
>> 
>> http://pastebin.com/1RPnFT6r
> 
> All the processes you show seem to be stuck waiting for a vnode lock
> or in __utmx_op_wait. (I`m not sure what the latter means.)
> 
> What is missing is what processes are holding the vnode locks and
> what they are stuck on.
> 
> A starting point might be ``ps axhl``, to see what all the threads
> are doing (particularily the WCHAN for them all). If you can drop into
> the debugger when the NFS mounts are hung and do a ```show alllocks``
> that could help. See:
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> 
> I`ll admit I`d be surprised if r253124 caused this, but who knows.
> 
> If there have been changes to your network device driver between
> r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> waiting for a reply while holding a vnode lock, that would do it.)
> 
> Good luck with it and maybe someone else can think of a commit
> between r252025 and r253506 that could cause vnode locking or network
> problems.
> 
> rick
> 
>> 
>> I have tried to mount the file system with and without nolockd. It
>> didn't make a difference. Other than that it is mounted with:
>> 
>> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
>> 
>> Let me know if you need me to do something else or if some other
>> output is required. I would have to go back to the problem kernel
>> and wait until the deadlock occurs to get that information.
>> 

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue started 
to show up after about 1 day and a few hours of uptime.

"ps axhl" shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list, please 
let me know)

The first PID showing up having that problem is 14001. Certainly the "show 
alllocks" command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the output to 
see if there is something obvious to me. But I don't know. :-)
I hope that information helps you to dig deeper into the issue what might be 
causing those deadlocks.

I did include the pciconf -lv, because you mentioned network device drivers. 
It's Intel igb. The same hardware is running a kernel from January 19th, 2013 
also as an NFS client. That machine is rock solid. No problems at all.

I also went to r251611. That's before r251641 (The NFS FHA changes). Same 
problem. Here is another debugging output from that kernel:

http://pastebin.com/ryv8BYc4

If I should test something else or provide some other output, please let me 
know.

Again thank you!

Michael


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-sta

Re: NFS deadlock on 9.2-Beta1

2013-07-24 Thread Steven Hartland


- Original Message - 
From: "Rick Macklem" 

To: "Michael Tratz" 
Cc: 
Sent: Thursday, July 25, 2013 1:25 AM
Subject: Re: NFS deadlock on 9.2-Beta1



Michael Tratz wrote:

Two machines (NFS Server: running ZFS / Client: disk-less), both are
running FreeBSD r253506. The NFS client starts to deadlock processes
within a few hours. It usually gets worse from there on. The
processes stay in "D" state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client. Even
an ls to the path which is deadlocked, will deadlock ls itself. It's
totally random what part of the file system gets deadlocked. The NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are causing
the problem. Maybe someone has an idea what could be wrong and I can
test a patch or if it's something else, because I'm not a kernel
expert. :-)


Well, the only NFS client change committed between r252025 and r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.


I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r


All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop into
the debugger when the NFS mounts are hung and do a ```show alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.


You could break to the debugger when it happens and run:
show sleepchain
and
show lockchain
to see whats waiting on what.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: NFS deadlock on 9.2-Beta1

2013-07-24 Thread Rick Macklem
Michael Tratz wrote:
> Two machines (NFS Server: running ZFS / Client: disk-less), both are
> running FreeBSD r253506. The NFS client starts to deadlock processes
> within a few hours. It usually gets worse from there on. The
> processes stay in "D" state. I haven't been able to reproduce it
> when I want it to happen. I only have to wait a few hours until the
> deadlocks occur when traffic to the client machine starts to pick
> up. The only way to fix the deadlocks is to reboot the client. Even
> an ls to the path which is deadlocked, will deadlock ls itself. It's
> totally random what part of the file system gets deadlocked. The NFS
> server itself has no problem at all to access the files/path when
> something is deadlocked on the client.
> 
> Last night I decided to put an older kernel on the system r252025
> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on
> the client machine (it should have deadlocked by now). FreeBSD is
> working hard like it always does. :-) There are a few changes to the
> NFS code from the revision which seems to work until Beta1. I
> haven't tried to narrow it down if one of those commits are causing
> the problem. Maybe someone has an idea what could be wrong and I can
> test a patch or if it's something else, because I'm not a kernel
> expert. :-)
> 
Well, the only NFS client change committed between r252025 and r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

> I have run several procstat -kk on the processes including the ls
> which deadlocked. You can see them here:
> 
> http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop into
the debugger when the NFS mounts are hung and do a ```show alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

rick

> 
> I have tried to mount the file system with and without nolockd. It
> didn't make a difference. Other than that it is mounted with:
> 
> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> 
> Let me know if you need me to do something else or if some other
> output is required. I would have to go back to the problem kernel
> and wait until the deadlock occurs to get that information.
> 
> Thanks for your help,
> 
> Michael
> 
> 
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscr...@freebsd.org"
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


NFS deadlock on 9.2-Beta1

2013-07-24 Thread Michael Tratz
Two machines (NFS Server: running ZFS / Client: disk-less), both are running 
FreeBSD r253506. The NFS client starts to deadlock processes within a few 
hours. It usually gets worse from there on. The processes stay in "D" state. I 
haven't been able to reproduce it when I want it to happen. I only have to wait 
a few hours until the deadlocks occur when traffic to the client machine starts 
to pick up. The only way to fix the deadlocks is to reboot the client. Even an 
ls to the path which is deadlocked, will deadlock ls itself. It's totally 
random what part of the file system gets deadlocked. The NFS server itself has 
no problem at all to access the files/path when something is deadlocked on the 
client.

Last night I decided to put an older kernel on the system r252025 (June 20th). 
The NFS server stayed untouched. So far 0 deadlocks on the client machine (it 
should have deadlocked by now). FreeBSD is working hard like it always does. 
:-) There are a few changes to the NFS code from the revision which seems to 
work until Beta1. I haven't tried to narrow it down if one of those commits are 
causing the problem. Maybe someone has an idea what could be wrong and I can 
test a patch or if it's something else, because I'm not a kernel expert. :-)

I have run several procstat -kk on the processes including the ls which 
deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

I have tried to mount the file system with and without nolockd. It didn't make 
a difference. Other than that it is mounted with:

rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768

Let me know if you need me to do something else or if some other output is 
required. I would have to go back to the problem kernel and wait until the 
deadlock occurs to get that information.

Thanks for your help,

Michael


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"