Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Scott Long wrote: Now, uranus has all the various kernel debugging enabled right now, and a serial console, so we're good for the debugging side of things ... and I believe that I can fairly easily "recreate" the issue by just moving a whack of vServers onto that machine to give it the load that seems to kill it ... *and* uranus is one of my newer machines, so the card that is in it is fairly new ... but, since I have a full BIOS serial console working on it, I should be able to get full model # and firmware version, which I take it will help some? What exact version of FreeBSD are you dealing with? 6-STABLE from ~Jun 28th ... but, I can upgrade it to the latest -STABLE if you feel that that might either help, or at least make debugging easier ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, User Freebsd wrote: 'k, first question is with the core file provide any insight into this? ie. provide further confirmation that it looks like the driver vs file system? Quite possibly, yes. second question, who is currently maintaining the iir driver? I've CC'd Achim in this, as he's listed in the man page as being the maintainer ... The last big change I see in there was from Scott Long in March, 2006, so quite recently. I'd give Scott a ping and ask him if he'd be interested in helping determine whether the source of a file system wedge might be an iir driver bug of some sort. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
User Freebsd wrote: On Wed, 19 Jul 2006, Robert Watson wrote: On Wed, 19 Jul 2006, User Freebsd wrote: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and only acting up under load ... There are two normal approaches: (1) Switch controllers and see if the problem goes away, then blame the controller that was replaced. :-) (2) Debug the driver when the system is in the wedged state. When Scott Long helped me out with an identical problem with the 3ware driver a few years ago, he basically added debugging output for the driver in the debugger to list the state of outstanding I/Os, count the number of in-bound, out-bound I/Os, etc, to try and find where the missing one was leaked. My impression is that once he had confirmed the presence of the problem, it was fairly easy to fix, but that confirming it required quite a bit of paperwork. 'k, first question is with the core file provide any insight into this? ie. provide further confirmation that it looks like the driver vs file system? second question, who is currently maintaining the iir driver? I've CC'd Achim in this, as he's listed in the man page as being the maintainer ... Now, uranus has all the various kernel debugging enabled right now, and a serial console, so we're good for the debugging side of things ... and I believe that I can fairly easily "recreate" the issue by just moving a whack of vServers onto that machine to give it the load that seems to kill it ... *and* uranus is one of my newer machines, so the card that is in it is fairly new ... but, since I have a full BIOS serial console working on it, I should be able to get full model # and firmware version, which I take it will help some? What exact version of FreeBSD are you dealing with? Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Robert Watson wrote: On Wed, 19 Jul 2006, User Freebsd wrote: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and only acting up under load ... There are two normal approaches: (1) Switch controllers and see if the problem goes away, then blame the controller that was replaced. :-) (2) Debug the driver when the system is in the wedged state. When Scott Long helped me out with an identical problem with the 3ware driver a few years ago, he basically added debugging output for the driver in the debugger to list the state of outstanding I/Os, count the number of in-bound, out-bound I/Os, etc, to try and find where the missing one was leaked. My impression is that once he had confirmed the presence of the problem, it was fairly easy to fix, but that confirming it required quite a bit of paperwork. 'k, first question is with the core file provide any insight into this? ie. provide further confirmation that it looks like the driver vs file system? second question, who is currently maintaining the iir driver? I've CC'd Achim in this, as he's listed in the man page as being the maintainer ... Now, uranus has all the various kernel debugging enabled right now, and a serial console, so we're good for the debugging side of things ... and I believe that I can fairly easily "recreate" the issue by just moving a whack of vServers onto that machine to give it the load that seems to kill it ... *and* uranus is one of my newer machines, so the card that is in it is fairly new ... but, since I have a full BIOS serial console working on it, I should be able to get full model # and firmware version, which I take it will help some? Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Kostik Belousov wrote: On Wed, Jul 19, 2006 at 11:23:21AM -0300, User Freebsd wrote: On Wed, 19 Jul 2006, Robert Watson wrote: On Wed, 19 Jul 2006, User Freebsd wrote: Also note that under FreeBSD 4.x, all three of these machines were pretty much my more solid machines, with even more vServers running on them then I'm able to run with 6.x ... once I got rid of using unionfs, stability skyrocketed :( Hr ... but, your 'controller driver' comment ... that is one common thing amongst all three servers ... they are all running the iir driver ... not sure the *exact* controller, but pluto (older Dual-PIII) shows it as: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and only acting up under load ... Obvious step would be to replace controller by some different kind. Unfortunately, that one isn't an option ... these aren't local machines that I can easily swap hardware in :( Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, User Freebsd wrote: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and only acting up under load ... There are two normal approaches: (1) Switch controllers and see if the problem goes away, then blame the controller that was replaced. :-) (2) Debug the driver when the system is in the wedged state. When Scott Long helped me out with an identical problem with the 3ware driver a few years ago, he basically added debugging output for the driver in the debugger to list the state of outstanding I/Os, count the number of in-bound, out-bound I/Os, etc, to try and find where the missing one was leaked. My impression is that once he had confirmed the presence of the problem, it was fairly easy to fix, but that confirming it required quite a bit of paperwork. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, Jul 19, 2006 at 11:23:21AM -0300, User Freebsd wrote: > On Wed, 19 Jul 2006, Robert Watson wrote: > > > > >On Wed, 19 Jul 2006, User Freebsd wrote: > > > >>Also note that under FreeBSD 4.x, all three of these machines were pretty > >>much my more solid machines, with even more vServers running on them then > >>I'm able to run with 6.x ... once I got rid of using unionfs, stability > >>skyrocketed :( > >> > >>Hr ... but, your 'controller driver' comment ... that is one common > >>thing amongst all three servers ... they are all running the iir driver > >>... not sure the *exact* controller, but pluto (older Dual-PIII) shows it > >>as: > > > >Yes, this was going to be my next question -- if you're seeing wedges > >under load and there's a common controller in use, maybe we're looking at > >a driver bug. Bugs of those sort typically look a lot like what you > >describe: an I/O is "lost" and so eveything that depends on the I/O wedges > >waiting for it, leading to a lot of processes hanging around waiting for > >vnode locks, etc. > > 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware > ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and > only acting up under load ... Obvious step would be to replace controller by some different kind. pgp0JCbHQnIXy.pgp Description: PGP signature
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Robert Watson wrote: On Wed, 19 Jul 2006, User Freebsd wrote: Also note that under FreeBSD 4.x, all three of these machines were pretty much my more solid machines, with even more vServers running on them then I'm able to run with 6.x ... once I got rid of using unionfs, stability skyrocketed :( Hr ... but, your 'controller driver' comment ... that is one common thing amongst all three servers ... they are all running the iir driver ... not sure the *exact* controller, but pluto (older Dual-PIII) shows it as: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. 'k, but how do we debug *that*? :( If it was one, I'd suspect hardware ... but *three*, and only acting up *after* upgrading to FreeBSD 6.x, and only acting up under load ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, User Freebsd wrote: Also note that under FreeBSD 4.x, all three of these machines were pretty much my more solid machines, with even more vServers running on them then I'm able to run with 6.x ... once I got rid of using unionfs, stability skyrocketed :( Hr ... but, your 'controller driver' comment ... that is one common thing amongst all three servers ... they are all running the iir driver ... not sure the *exact* controller, but pluto (older Dual-PIII) shows it as: Yes, this was going to be my next question -- if you're seeing wedges under load and there's a common controller in use, maybe we're looking at a driver bug. Bugs of those sort typically look a lot like what you describe: an I/O is "lost" and so eveything that depends on the I/O wedges waiting for it, leading to a lot of processes hanging around waiting for vnode locks, etc. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Kostik Belousov wrote: You did not provided the output of "show lockedbufs", Added to my debug list ... but, even without that data, I doubt that the buf subsystem deadlocked by itself. I make an conjecture that the problem is either with you disk hardware (i.e., actual hard drive or disk controller), or in the controller driver. The problem that I have with this theory is that it isn't just one server doing this, or one type of hardware ... all three of the servers that I've upgraded to FreeBSD 6.x are doing it at some point or another ... I'm just getting jupiter (older Dual-PIII server) rebooted now :( Also note that under FreeBSD 4.x, all three of these machines were pretty much my more solid machines, with even more vServers running on them then I'm able to run with 6.x ... once I got rid of using unionfs, stability skyrocketed :( Hr ... but, your 'controller driver' comment ... that is one common thing amongst all three servers ... they are all running the iir driver ... not sure the *exact* controller, but pluto (older Dual-PIII) shows it as: iir0: mem 0xfc8f-0xfc8f3fff irq 30 at device 9.0 on pci1 iir0: [GIANT-LOCKED] Beyond that controller, jupiter/pluto are Dual-PIII with 36G Seagate drives, uranus is a Dual-Xeon with 72G Seagate drives ... At least, you could show us the dmesg. I'll have to get that for you after next reboot, as /var/run/dmesg.boot shows: uranus# less /var/run/dmesg.boot WARNING: /tmp was not properly dismounted WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted And that's it :( Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Wed, 19 Jul 2006, Kostik Belousov wrote: On Wed, Jul 19, 2006 at 01:31:17AM -0300, User Freebsd wrote: Kostik/Robert ... does this provide enough (any?) information concerning the deadlock situation(s) that are being reported? is there anything else I should do the next time it happens? I tried to submit a GnATs report on this also, but fear that the attachment was a wee bit too big :( Marc, thank you for the report. It does contain useful information, I'm looking into it. I see at least one obvius deadlock (you shell becomes unresponible when you tried to make auto-completion, right ?). Yup, that was when I first noticed it this time through, actually ... On Tue, 18 Jul 2006, User Freebsd wrote: 'k, had a bunch of fun tonight, but one of the results is that I was able to achieve file system deadlock, or so it appears ... Using the following from DDB: set $lines=0 show pcpu show allpcpu ps trace alltrace show locks show alllocks show uma show malloc show lockedvnods call doadump I've been able to produce the attached output, as well as have a core dump that can hopefully be used to gather any that I may have missed this time *cross fingers* Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: file system deadlock - the whole story?
On Tue, Jul 18, 2006 at 07:51:52AM -0300, User Freebsd wrote: > > 'k, had a bunch of fun tonight, but one of the results is that I was able > to achieve file system deadlock, or so it appears ... > > Using the following from DDB: > > set $lines=0 > show pcpu > show allpcpu > ps > trace > alltrace > show locks > show alllocks > show uma > show malloc > show lockedvnods > call doadump > > I've been able to produce the attached output, as well as have a core dump > that can hopefully be used to gather any that I may have missed this time > *cross fingers* Marc, I seriously doubt that the problems machine experiencing is deadlock. At the http://people.freebsd.org/~kib/e1.gif is the graph of the locking dependencies for the vnode locks. The edge from process a to process b means that process a holds a lock and process b is waiting for the lock. Black edge means dependency by the vnode lock, red edge - by the buffer lock. As you see, graph is acyclic. Basically, there are two groups of the processes that a blocked: one hierarchy rooted in the pid 66575, this one includes shell 806. Second one is rooted in the process 32. What are they doing ? Pid 66575: Tracing command smtpd pid 66575 tid 101396 td 0xceb0a180 sched_switch(ceb0a180,0,1) at sched_switch+0x177 mi_switch(1,0) at mi_switch+0x270 sleepq_switch(dc5b5b20,c0661b60,0,c05fd078,20c) at sleepq_switch+0xc1 sleepq_wait(dc5b5b20,0,c0601d10,e59,8) at sleepq_wait+0x46 msleep(dc5b5b20,c06afde0,44,c061021d,0) at msleep+0x279 bwait(dc5b5b20,44,c061021d) at bwait+0x47 vnode_pager_generic_getpages(c8e85000,ed347c80,1000,0,c8e22000) at vnode_pager_generic_getpages+0x777 ffs_getpages(ed347bbc,c8e85000,0,ed347be8,c0597c41) at ffs_getpages+0x100 VOP_GETPAGES_APV(c063c100,ed347bbc) at VOP_GETPAGES_APV+0xa9 vnode_pager_getpages(c8e22000,ed347c80,1,0) at vnode_pager_getpages+0xa5 vm_fault(c88da4a0,280bb000,1,0,ceb0a180) at vm_fault+0x980 trap_pfault(ed347d38,1,280bb000,280bb000,0) at trap_pfault+0xce trap(3b,3b,3b,8078d1c,807952c) at trap+0x1eb calltrap() at calltrap+0x5 --- trap 0xc, eip = 0x280baffd, esp = 0xbfbfe894, ebp = 0xbfbfe8d8 --- This process waits for the data to be paged in. Pid 32 (syncer) Tracing command syncer pid 32 tid 100033 td 0xc8544780 sched_switch(c8544780,0,1) at sched_switch+0x177 mi_switch(1,0) at mi_switch+0x270 sleepq_switch(dc79fe68,c0661b60,0,c05fd078,20c) at sleepq_switch+0xc1 sleepq_wait(dc79fe68,0,c0601d10,e59,c06039a0) at sleepq_wait+0x46 msleep(dc79fe68,c06afde0,4c,c06024dc,0) at msleep+0x279 bwait(dc79fe68,4c,c06024dc) at bwait+0x47 bufwait(dc79fe68,1,0,0,0) at bufwait+0x1a breadn(c8a0b414,6537700,0,4000,0) at breadn+0x266 bread(c8a0b414,6537700,0,4000,0) at bread+0x20 ffs_update(c9992000,0,6,0,0) at ffs_update+0x228 ffs_syncvnode(c9992000,3) at ffs_syncvnode+0x3be ffs_sync(c8831400,3,c8544780,c8831400,2) at ffs_sync+0x209 sync_fsync(e817fcbc,c8a11ae0,c8a11bec,e817fcd8,c04ed586) at sync_fsync+0x126 VOP_FSYNC_APV(c0634220,e817fcbc) at VOP_FSYNC_APV+0x9b sync_vnode(c8a11bec,c8544780) at sync_vnode+0x106 sched_sync(0,e817fd38,0,c04ed614,0) at sched_sync+0x1ed fork_exit(c04ed614,0,e817fd38) at fork_exit+0xa0 fork_trampoline() at fork_trampoline+0x8 --- trap 0x1, eip = 0, esp = 0xe817fd6c, ebp = 0 --- also waits for the data. What happens with blocks ? syncer (pid 32) locked block 0xc8a0b414 and waits for data (as shown before). Processes 33 (softdepflush), umount (pid 73338) waits for this block. You did not provided the output of "show lockedbufs", but, even without that data, I doubt that the buf subsystem deadlocked by itself. I make an conjecture that the problem is either with you disk hardware (i.e., actual hard drive or disk controller), or in the controller driver. At least, you could show us the dmesg. pgpBGmm7iDLMA.pgp Description: PGP signature
Re: file system deadlock - the whole story?
On Wed, Jul 19, 2006 at 01:31:17AM -0300, User Freebsd wrote: > > Kostik/Robert ... does this provide enough (any?) information concerning > the deadlock situation(s) that are being reported? is there anything else > I should do the next time it happens? > > I tried to submit a GnATs report on this also, but fear that the > attachment was a wee bit too big :( > Marc, thank you for the report. It does contain useful information, I'm looking into it. I see at least one obvius deadlock (you shell becomes unresponible when you tried to make auto-completion, right ?). > On Tue, 18 Jul 2006, User Freebsd wrote: > > > > >'k, had a bunch of fun tonight, but one of the results is that I was able > >to achieve file system deadlock, or so it appears ... > > > >Using the following from DDB: > > > >set $lines=0 > >show pcpu > >show allpcpu > >ps > >trace > >alltrace > >show locks > >show alllocks > >show uma > >show malloc > >show lockedvnods > >call doadump > > > >I've been able to produce the attached output, as well as have a core dump > >that can hopefully be used to gather any that I may have missed this time > >*cross fingers* > > > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] > Yahoo . yscrappy Skype: hub.orgICQ . 7615664 pgpv5ndWAIPds.pgp Description: PGP signature
Re: file system deadlock - the whole story?
Kostik/Robert ... does this provide enough (any?) information concerning the deadlock situation(s) that are being reported? is there anything else I should do the next time it happens? I tried to submit a GnATs report on this also, but fear that the attachment was a wee bit too big :( On Tue, 18 Jul 2006, User Freebsd wrote: 'k, had a bunch of fun tonight, but one of the results is that I was able to achieve file system deadlock, or so it appears ... Using the following from DDB: set $lines=0 show pcpu show allpcpu ps trace alltrace show locks show alllocks show uma show malloc show lockedvnods call doadump I've been able to produce the attached output, as well as have a core dump that can hopefully be used to gather any that I may have missed this time *cross fingers* Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"