Re: 9.1-stable crashes while copying data from a NFS mounted directory

2013-01-28 Thread Christian Gusenbauer
On Monday 28 January 2013 07:35:31 YongHyeon PYUN wrote:
 On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote:
  On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
   On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
 On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
  On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer 
wrote:
   On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian
Gusenbauer
  
  wrote:
 On Thursday 24 January 2013 19:07:23 Konstantin Belousov 
wrote:
  On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin
  Belousov
  
  wrote:
   On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian
  
  Gusenbauer wrote:
Hi!

I'm using 9.1 stable svn revision 245605 and I get
the panic below if I execute the following commands
(as single user):

# swapon -a
# dumpon /dev/ada0s3b
# mount -u /
# ifconfig age0 inet 192.168.2.2 mtu 6144 up
# mount -t nfs -o rsize=32768 data:/multimedia /mnt
# cp /mnt/Movies/test/a.m2ts /tmp

then the system panics almost immediately. I'll
attach the stack trace.

Note, that I'm using jumbo frames (6144 byte) on a
1Gbit network, maybe that's the cause for the panic,
because the bcopy (see stack frame #15) fails.

Any clues?
   
   I tried a similar operation with the nfs mount of
   rsize=32768 and mtu 6144, but the machine runs HEAD and
   em instead of age. I was unable to reproduce the panic
   on the copy of the 5GB file from nfs mount.
 
 Hmmm, I did a quick test. If I do not change the MTU, so
 just configuring age0 with
 
 # ifconfig age0 inet 192.168.2.2 up
 
 then I can copy all files from the mounted directory
 without any problems, too. So it's probably age0 related?

From your backtrace and the buffer printout, I see somewhat
strange thing. The buffer data address is 0xff8171418000,
while kernel faulted at the attempt to write at
0xff8171413000, which is is lower then the buffer data
pointer, at the attempt to bcopy to the buffer.

The other data suggests that there were no overflow of the
data from the server response. So it might be that
mbuf_len(mp) returned negative number ? I am not sure is it
possible at all.

Try this debugging patch, please. You need to add INVARIANTS
etc to the kernel config.

diff --git a/sys/fs/nfs/nfs_commonsubs.c
b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
--- a/sys/fs/nfs/nfs_commonsubs.c
+++ b/sys/fs/nfs/nfs_commonsubs.c
@@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
struct uio *uiop, int siz) }

mbufcp = NFSMTOD(mp, caddr_t);
len = mbuf_len(mp);

+   KASSERT(len  0, (len %d, 
len));

}
xfer = (left  len) ? len : left;
 
 #ifdef notdef

@@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
struct uio *uiop, int siz) uiop-uio_resid -= xfer;

}
if (uiop-uio_iov-iov_len = siz) {

+   KASSERT(uiop-uio_iovcnt  1, 
(uio_iovcnt %d,
+   uiop-uio_iovcnt));

uiop-uio_iovcnt--;
uiop-uio_iov++;

} else {

I thought that server have returned too long response, but it
seems to be not the case from your data. Still, I think the
patch below might be due.

diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907
100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c
+++ b/sys/fs/nfsclient/nfs_clrpcops.c
@@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
*uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
NFSX_UNSIGNED);

eof = fxdr_unsigned(int, *tl);

}

-   NFSM_STRSIZ(retlen, rsize);
+   NFSM_STRSIZ(retlen, len);

error = nfsm_mbufuio(nd, 

Re: 9.1-stable crashes while copying data from a NFS mounted directory

2013-01-27 Thread YongHyeon PYUN
On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote:
 On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
  On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
   On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
 On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote:
  On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
   On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer 
 wrote:
On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote:
 On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov 
 wrote:
  On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian 
 Gusenbauer wrote:
   Hi!
   
   I'm using 9.1 stable svn revision 245605 and I get the
   panic below if I execute the following commands (as
   single user):
   
   # swapon -a
   # dumpon /dev/ada0s3b
   # mount -u /
   # ifconfig age0 inet 192.168.2.2 mtu 6144 up
   # mount -t nfs -o rsize=32768 data:/multimedia /mnt
   # cp /mnt/Movies/test/a.m2ts /tmp
   
   then the system panics almost immediately. I'll attach
   the stack trace.
   
   Note, that I'm using jumbo frames (6144 byte) on a 1Gbit
   network, maybe that's the cause for the panic, because
   the bcopy (see stack frame #15) fails.
   
   Any clues?
  
  I tried a similar operation with the nfs mount of
  rsize=32768 and mtu 6144, but the machine runs HEAD and em
  instead of age. I was unable to reproduce the panic on the
  copy of the 5GB file from nfs mount.

Hmmm, I did a quick test. If I do not change the MTU, so just
configuring age0 with

# ifconfig age0 inet 192.168.2.2 up

then I can copy all files from the mounted directory without
any problems, too. So it's probably age0 related?
   
   From your backtrace and the buffer printout, I see somewhat
   strange thing. The buffer data address is 0xff8171418000,
   while kernel faulted at the attempt to write at
   0xff8171413000, which is is lower then the buffer data
   pointer, at the attempt to bcopy to the buffer.
   
   The other data suggests that there were no overflow of the data
   from the server response. So it might be that mbuf_len(mp)
   returned negative number ? I am not sure is it possible at all.
   
   Try this debugging patch, please. You need to add INVARIANTS etc
   to the kernel config.
   
   diff --git a/sys/fs/nfs/nfs_commonsubs.c
   b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
   --- a/sys/fs/nfs/nfs_commonsubs.c
   +++ b/sys/fs/nfs/nfs_commonsubs.c
   @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
   struct uio *uiop, int siz) }
   
 mbufcp = NFSMTOD(mp, caddr_t);
 len = mbuf_len(mp);
   
   + KASSERT(len  0, (len %d, len));
   
 }
 xfer = (left  len) ? len : left;

#ifdef notdef
   
   @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
   struct uio *uiop, int siz) uiop-uio_resid -= xfer;
   
 }
 if (uiop-uio_iov-iov_len = siz) {
   
   + KASSERT(uiop-uio_iovcnt  1, (uio_iovcnt %d,
   + uiop-uio_iovcnt));
   
 uiop-uio_iovcnt--;
 uiop-uio_iov++;
 
 } else {
   
   I thought that server have returned too long response, but it
   seems to be not the case from your data. Still, I think the
   patch below might be due.
   
   diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
   b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
   --- a/sys/fs/nfsclient/nfs_clrpcops.c
   +++ b/sys/fs/nfsclient/nfs_clrpcops.c
   @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
   *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
   NFSX_UNSIGNED);
   
 eof = fxdr_unsigned(int, *tl);
 
 }
   
   - NFSM_STRSIZ(retlen, rsize);
   + NFSM_STRSIZ(retlen, len);
   
 error = nfsm_mbufuio(nd, uiop, retlen);
 if (error)
 
 goto nfsmout;
  
  I applied your patches and now I get a
  
  panic: len -4
  cpuid = 1
  KDB: enter: panic
  Dumping 377 out of 6116
  MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
 
 This means that 

Re: 9.1-stable crashes while copying data from a NFS mounted directory

2013-01-25 Thread Christian Gusenbauer
On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
 On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
  On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
   On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote:
 On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
  On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer 
wrote:
   On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote:
On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov 
wrote:
 On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian 
Gusenbauer wrote:
  Hi!
  
  I'm using 9.1 stable svn revision 245605 and I get the
  panic below if I execute the following commands (as
  single user):
  
  # swapon -a
  # dumpon /dev/ada0s3b
  # mount -u /
  # ifconfig age0 inet 192.168.2.2 mtu 6144 up
  # mount -t nfs -o rsize=32768 data:/multimedia /mnt
  # cp /mnt/Movies/test/a.m2ts /tmp
  
  then the system panics almost immediately. I'll attach
  the stack trace.
  
  Note, that I'm using jumbo frames (6144 byte) on a 1Gbit
  network, maybe that's the cause for the panic, because
  the bcopy (see stack frame #15) fails.
  
  Any clues?
 
 I tried a similar operation with the nfs mount of
 rsize=32768 and mtu 6144, but the machine runs HEAD and em
 instead of age. I was unable to reproduce the panic on the
 copy of the 5GB file from nfs mount.
   
   Hmmm, I did a quick test. If I do not change the MTU, so just
   configuring age0 with
   
   # ifconfig age0 inet 192.168.2.2 up
   
   then I can copy all files from the mounted directory without
   any problems, too. So it's probably age0 related?
  
  From your backtrace and the buffer printout, I see somewhat
  strange thing. The buffer data address is 0xff8171418000,
  while kernel faulted at the attempt to write at
  0xff8171413000, which is is lower then the buffer data
  pointer, at the attempt to bcopy to the buffer.
  
  The other data suggests that there were no overflow of the data
  from the server response. So it might be that mbuf_len(mp)
  returned negative number ? I am not sure is it possible at all.
  
  Try this debugging patch, please. You need to add INVARIANTS etc
  to the kernel config.
  
  diff --git a/sys/fs/nfs/nfs_commonsubs.c
  b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
  --- a/sys/fs/nfs/nfs_commonsubs.c
  +++ b/sys/fs/nfs/nfs_commonsubs.c
  @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
  struct uio *uiop, int siz) }
  
  mbufcp = NFSMTOD(mp, caddr_t);
  len = mbuf_len(mp);
  
  +   KASSERT(len  0, (len %d, len));
  
  }
  xfer = (left  len) ? len : left;
   
   #ifdef notdef
  
  @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
  struct uio *uiop, int siz) uiop-uio_resid -= xfer;
  
  }
  if (uiop-uio_iov-iov_len = siz) {
  
  +   KASSERT(uiop-uio_iovcnt  1, (uio_iovcnt %d,
  +   uiop-uio_iovcnt));
  
  uiop-uio_iovcnt--;
  uiop-uio_iov++;
  
  } else {
  
  I thought that server have returned too long response, but it
  seems to be not the case from your data. Still, I think the
  patch below might be due.
  
  diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
  b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
  --- a/sys/fs/nfsclient/nfs_clrpcops.c
  +++ b/sys/fs/nfsclient/nfs_clrpcops.c
  @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
  *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
  NFSX_UNSIGNED);
  
  eof = fxdr_unsigned(int, *tl);
  
  }
  
  -   NFSM_STRSIZ(retlen, rsize);
  +   NFSM_STRSIZ(retlen, len);
  
  error = nfsm_mbufuio(nd, uiop, retlen);
  if (error)
  
  goto nfsmout;
 
 I applied your patches and now I get a
 
 panic: len -4
 cpuid = 1
 KDB: enter: panic
 Dumping 377 out of 6116
 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%

This means that the age driver either produced corrupted mbuf chain,
or filled wrong negative value into the mbuf len field. I am quite
certain that the issue 

Re: 9.1-stable crashes while copying data from a NFS mounted directory

2013-01-24 Thread Konstantin Belousov
On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote:
 On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
  On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote:
   On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote:
On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote:
 On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote:
  Hi!
  
  I'm using 9.1 stable svn revision 245605 and I get the panic below
  if I execute the following commands (as single user):
  
  # swapon -a
  # dumpon /dev/ada0s3b
  # mount -u /
  # ifconfig age0 inet 192.168.2.2 mtu 6144 up
  # mount -t nfs -o rsize=32768 data:/multimedia /mnt
  # cp /mnt/Movies/test/a.m2ts /tmp
  
  then the system panics almost immediately. I'll attach the stack
  trace.
  
  Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network,
  maybe that's the cause for the panic, because the bcopy (see stack
  frame #15) fails.
  
  Any clues?
 
 I tried a similar operation with the nfs mount of rsize=32768 and mtu
 6144, but the machine runs HEAD and em instead of age. I was unable
 to reproduce the panic on the copy of the 5GB file from nfs mount.
   
   Hmmm, I did a quick test. If I do not change the MTU, so just configuring
   age0 with
   
   # ifconfig age0 inet 192.168.2.2 up
   
   then I can copy all files from the mounted directory without any
   problems, too. So it's probably age0 related?
  
  From your backtrace and the buffer printout, I see somewhat strange thing.
  The buffer data address is 0xff8171418000, while kernel faulted
  at the attempt to write at 0xff8171413000, which is is lower then
  the buffer data pointer, at the attempt to bcopy to the buffer.
  
  The other data suggests that there were no overflow of the data from the
  server response. So it might be that mbuf_len(mp) returned negative number
  ? I am not sure is it possible at all.
  
  Try this debugging patch, please. You need to add INVARIANTS etc to the
  kernel config.
  
  diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c
  index efc0786..9a6bda5 100644
  --- a/sys/fs/nfs/nfs_commonsubs.c
  +++ b/sys/fs/nfs/nfs_commonsubs.c
  @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio
  *uiop, int siz) }
  mbufcp = NFSMTOD(mp, caddr_t);
  len = mbuf_len(mp);
  +   KASSERT(len  0, (len %d, len));
  }
  xfer = (left  len) ? len : left;
   #ifdef notdef
  @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio
  *uiop, int siz) uiop-uio_resid -= xfer;
  }
  if (uiop-uio_iov-iov_len = siz) {
  +   KASSERT(uiop-uio_iovcnt  1, (uio_iovcnt %d,
  +   uiop-uio_iovcnt));
  uiop-uio_iovcnt--;
  uiop-uio_iov++;
  } else {
  
  I thought that server have returned too long response, but it seems to
  be not the case from your data. Still, I think the patch below might be
  due.
  
  diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
  b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
  --- a/sys/fs/nfsclient/nfs_clrpcops.c
  +++ b/sys/fs/nfsclient/nfs_clrpcops.c
  @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct
  ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED);
  eof = fxdr_unsigned(int, *tl);
  }
  -   NFSM_STRSIZ(retlen, rsize);
  +   NFSM_STRSIZ(retlen, len);
  error = nfsm_mbufuio(nd, uiop, retlen);
  if (error)
  goto nfsmout;
 
 I applied your patches and now I get a
 
 panic: len -4
 cpuid = 1
 KDB: enter: panic
 Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
 
This means that the age driver either produced corrupted mbuf chain,
or filled wrong negative value into the mbuf len field. I am quite
certain that the issue is in the driver.

I added the net@ to Cc:, hopefully you could get help there.
 
 #0  doadump (textdump=0)
 at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265
 265 if (textdump  textdump_pending) {
 (kgdb) #0  doadump (textdump=0)
 at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265
 #1  0x802a7490 in db_dump (dummy=value optimized out,
 dummy2=value optimized out, dummy3=value optimized out,
 dummy4=value optimized out)
 at /spare/tmp/src-stable9/sys/ddb/db_command.c:538
 #2  0x802a6a7e in db_command (last_cmdp=0x808ca140,
 cmd_table=value optimized out, dopager=1)
 at /spare/tmp/src-stable9/sys/ddb/db_command.c:449
 #3  0x802a6cd0 in db_command_loop ()
 at /spare/tmp/src-stable9/sys/ddb/db_command.c:502
 #4  0x802a8e29 in db_trap