Re: 9.1-stable crashes while copying data from a NFS mounted directory
On Monday 28 January 2013 07:35:31 YongHyeon PYUN wrote: On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote: On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: Hi! I'm using 9.1 stable svn revision 245605 and I get the panic below if I execute the following commands (as single user): # swapon -a # dumpon /dev/ada0s3b # mount -u / # ifconfig age0 inet 192.168.2.2 mtu 6144 up # mount -t nfs -o rsize=32768 data:/multimedia /mnt # cp /mnt/Movies/test/a.m2ts /tmp then the system panics almost immediately. I'll attach the stack trace. Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe that's the cause for the panic, because the bcopy (see stack frame #15) fails. Any clues? I tried a similar operation with the nfs mount of rsize=32768 and mtu 6144, but the machine runs HEAD and em instead of age. I was unable to reproduce the panic on the copy of the 5GB file from nfs mount. Hmmm, I did a quick test. If I do not change the MTU, so just configuring age0 with # ifconfig age0 inet 192.168.2.2 up then I can copy all files from the mounted directory without any problems, too. So it's probably age0 related? From your backtrace and the buffer printout, I see somewhat strange thing. The buffer data address is 0xff8171418000, while kernel faulted at the attempt to write at 0xff8171413000, which is is lower then the buffer data pointer, at the attempt to bcopy to the buffer. The other data suggests that there were no overflow of the data from the server response. So it might be that mbuf_len(mp) returned negative number ? I am not sure is it possible at all. Try this debugging patch, please. You need to add INVARIANTS etc to the kernel config. diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 --- a/sys/fs/nfs/nfs_commonsubs.c +++ b/sys/fs/nfs/nfs_commonsubs.c @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) } mbufcp = NFSMTOD(mp, caddr_t); len = mbuf_len(mp); + KASSERT(len 0, (len %d, len)); } xfer = (left len) ? len : left; #ifdef notdef @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) uiop-uio_resid -= xfer; } if (uiop-uio_iov-iov_len = siz) { + KASSERT(uiop-uio_iovcnt 1, (uio_iovcnt %d, + uiop-uio_iovcnt)); uiop-uio_iovcnt--; uiop-uio_iov++; } else { I thought that server have returned too long response, but it seems to be not the case from your data. Still, I think the patch below might be due. diff --git a/sys/fs/nfsclient/nfs_clrpcops.c b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c +++ b/sys/fs/nfsclient/nfs_clrpcops.c @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); eof = fxdr_unsigned(int, *tl); } - NFSM_STRSIZ(retlen, rsize); + NFSM_STRSIZ(retlen, len); error = nfsm_mbufuio(nd,
Re: 9.1-stable crashes while copying data from a NFS mounted directory
On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote: On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: Hi! I'm using 9.1 stable svn revision 245605 and I get the panic below if I execute the following commands (as single user): # swapon -a # dumpon /dev/ada0s3b # mount -u / # ifconfig age0 inet 192.168.2.2 mtu 6144 up # mount -t nfs -o rsize=32768 data:/multimedia /mnt # cp /mnt/Movies/test/a.m2ts /tmp then the system panics almost immediately. I'll attach the stack trace. Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe that's the cause for the panic, because the bcopy (see stack frame #15) fails. Any clues? I tried a similar operation with the nfs mount of rsize=32768 and mtu 6144, but the machine runs HEAD and em instead of age. I was unable to reproduce the panic on the copy of the 5GB file from nfs mount. Hmmm, I did a quick test. If I do not change the MTU, so just configuring age0 with # ifconfig age0 inet 192.168.2.2 up then I can copy all files from the mounted directory without any problems, too. So it's probably age0 related? From your backtrace and the buffer printout, I see somewhat strange thing. The buffer data address is 0xff8171418000, while kernel faulted at the attempt to write at 0xff8171413000, which is is lower then the buffer data pointer, at the attempt to bcopy to the buffer. The other data suggests that there were no overflow of the data from the server response. So it might be that mbuf_len(mp) returned negative number ? I am not sure is it possible at all. Try this debugging patch, please. You need to add INVARIANTS etc to the kernel config. diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 --- a/sys/fs/nfs/nfs_commonsubs.c +++ b/sys/fs/nfs/nfs_commonsubs.c @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) } mbufcp = NFSMTOD(mp, caddr_t); len = mbuf_len(mp); + KASSERT(len 0, (len %d, len)); } xfer = (left len) ? len : left; #ifdef notdef @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) uiop-uio_resid -= xfer; } if (uiop-uio_iov-iov_len = siz) { + KASSERT(uiop-uio_iovcnt 1, (uio_iovcnt %d, + uiop-uio_iovcnt)); uiop-uio_iovcnt--; uiop-uio_iov++; } else { I thought that server have returned too long response, but it seems to be not the case from your data. Still, I think the patch below might be due. diff --git a/sys/fs/nfsclient/nfs_clrpcops.c b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c +++ b/sys/fs/nfsclient/nfs_clrpcops.c @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); eof = fxdr_unsigned(int, *tl); } - NFSM_STRSIZ(retlen, rsize); + NFSM_STRSIZ(retlen, len); error = nfsm_mbufuio(nd, uiop, retlen); if (error) goto nfsmout; I applied your patches and now I get a panic: len -4 cpuid = 1 KDB: enter: panic Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% This means that
Re: 9.1-stable crashes while copying data from a NFS mounted directory
On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: Hi! I'm using 9.1 stable svn revision 245605 and I get the panic below if I execute the following commands (as single user): # swapon -a # dumpon /dev/ada0s3b # mount -u / # ifconfig age0 inet 192.168.2.2 mtu 6144 up # mount -t nfs -o rsize=32768 data:/multimedia /mnt # cp /mnt/Movies/test/a.m2ts /tmp then the system panics almost immediately. I'll attach the stack trace. Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe that's the cause for the panic, because the bcopy (see stack frame #15) fails. Any clues? I tried a similar operation with the nfs mount of rsize=32768 and mtu 6144, but the machine runs HEAD and em instead of age. I was unable to reproduce the panic on the copy of the 5GB file from nfs mount. Hmmm, I did a quick test. If I do not change the MTU, so just configuring age0 with # ifconfig age0 inet 192.168.2.2 up then I can copy all files from the mounted directory without any problems, too. So it's probably age0 related? From your backtrace and the buffer printout, I see somewhat strange thing. The buffer data address is 0xff8171418000, while kernel faulted at the attempt to write at 0xff8171413000, which is is lower then the buffer data pointer, at the attempt to bcopy to the buffer. The other data suggests that there were no overflow of the data from the server response. So it might be that mbuf_len(mp) returned negative number ? I am not sure is it possible at all. Try this debugging patch, please. You need to add INVARIANTS etc to the kernel config. diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 --- a/sys/fs/nfs/nfs_commonsubs.c +++ b/sys/fs/nfs/nfs_commonsubs.c @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) } mbufcp = NFSMTOD(mp, caddr_t); len = mbuf_len(mp); + KASSERT(len 0, (len %d, len)); } xfer = (left len) ? len : left; #ifdef notdef @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) uiop-uio_resid -= xfer; } if (uiop-uio_iov-iov_len = siz) { + KASSERT(uiop-uio_iovcnt 1, (uio_iovcnt %d, + uiop-uio_iovcnt)); uiop-uio_iovcnt--; uiop-uio_iov++; } else { I thought that server have returned too long response, but it seems to be not the case from your data. Still, I think the patch below might be due. diff --git a/sys/fs/nfsclient/nfs_clrpcops.c b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c +++ b/sys/fs/nfsclient/nfs_clrpcops.c @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); eof = fxdr_unsigned(int, *tl); } - NFSM_STRSIZ(retlen, rsize); + NFSM_STRSIZ(retlen, len); error = nfsm_mbufuio(nd, uiop, retlen); if (error) goto nfsmout; I applied your patches and now I get a panic: len -4 cpuid = 1 KDB: enter: panic Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% This means that the age driver either produced corrupted mbuf chain, or filled wrong negative value into the mbuf len field. I am quite certain that the issue
Re: 9.1-stable crashes while copying data from a NFS mounted directory
On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: Hi! I'm using 9.1 stable svn revision 245605 and I get the panic below if I execute the following commands (as single user): # swapon -a # dumpon /dev/ada0s3b # mount -u / # ifconfig age0 inet 192.168.2.2 mtu 6144 up # mount -t nfs -o rsize=32768 data:/multimedia /mnt # cp /mnt/Movies/test/a.m2ts /tmp then the system panics almost immediately. I'll attach the stack trace. Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe that's the cause for the panic, because the bcopy (see stack frame #15) fails. Any clues? I tried a similar operation with the nfs mount of rsize=32768 and mtu 6144, but the machine runs HEAD and em instead of age. I was unable to reproduce the panic on the copy of the 5GB file from nfs mount. Hmmm, I did a quick test. If I do not change the MTU, so just configuring age0 with # ifconfig age0 inet 192.168.2.2 up then I can copy all files from the mounted directory without any problems, too. So it's probably age0 related? From your backtrace and the buffer printout, I see somewhat strange thing. The buffer data address is 0xff8171418000, while kernel faulted at the attempt to write at 0xff8171413000, which is is lower then the buffer data pointer, at the attempt to bcopy to the buffer. The other data suggests that there were no overflow of the data from the server response. So it might be that mbuf_len(mp) returned negative number ? I am not sure is it possible at all. Try this debugging patch, please. You need to add INVARIANTS etc to the kernel config. diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 --- a/sys/fs/nfs/nfs_commonsubs.c +++ b/sys/fs/nfs/nfs_commonsubs.c @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) } mbufcp = NFSMTOD(mp, caddr_t); len = mbuf_len(mp); + KASSERT(len 0, (len %d, len)); } xfer = (left len) ? len : left; #ifdef notdef @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uiop, int siz) uiop-uio_resid -= xfer; } if (uiop-uio_iov-iov_len = siz) { + KASSERT(uiop-uio_iovcnt 1, (uio_iovcnt %d, + uiop-uio_iovcnt)); uiop-uio_iovcnt--; uiop-uio_iov++; } else { I thought that server have returned too long response, but it seems to be not the case from your data. Still, I think the patch below might be due. diff --git a/sys/fs/nfsclient/nfs_clrpcops.c b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c +++ b/sys/fs/nfsclient/nfs_clrpcops.c @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); eof = fxdr_unsigned(int, *tl); } - NFSM_STRSIZ(retlen, rsize); + NFSM_STRSIZ(retlen, len); error = nfsm_mbufuio(nd, uiop, retlen); if (error) goto nfsmout; I applied your patches and now I get a panic: len -4 cpuid = 1 KDB: enter: panic Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% This means that the age driver either produced corrupted mbuf chain, or filled wrong negative value into the mbuf len field. I am quite certain that the issue is in the driver. I added the net@ to Cc:, hopefully you could get help there. #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 265 if (textdump textdump_pending) { (kgdb) #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 #1 0x802a7490 in db_dump (dummy=value optimized out, dummy2=value optimized out, dummy3=value optimized out, dummy4=value optimized out) at /spare/tmp/src-stable9/sys/ddb/db_command.c:538 #2 0x802a6a7e in db_command (last_cmdp=0x808ca140, cmd_table=value optimized out, dopager=1) at /spare/tmp/src-stable9/sys/ddb/db_command.c:449 #3 0x802a6cd0 in db_command_loop () at /spare/tmp/src-stable9/sys/ddb/db_command.c:502 #4 0x802a8e29 in db_trap