Re: System deadlock when using mksnap_ffs
On Thu, Nov 13, 2008 at 05:08:10PM +0100, Greg Byshenk wrote: On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. As an answer to the question (and additional information), I am experiencing the problem, but not on all filesystems. This is under FreeBSD 7.1-PRERELEASE #7: Thu Nov 6 11:29:52 CET 2008, amd64 (from sources csup'ed immediately prior to the build). I have four filesystems used for data storage: /dev/da1p196850470 7866026 81236408 9%/export/mail /dev/da1p2 1937058312 972070320 81002332855%/export/home /dev/da1p3 1937058312 79027008 1703066640 4%/export/misc /dev/da1p4 2598991534 271980564 211909164811%/export/spare I can successfully mksnap_ffs the first (smaller) partition, but an attempt to do so on any of the others causes a lock. Note: this is a lockup, not a slow. The system becomes unresponsive to any input, and there is no hard drive activity, and this does not change over a period of more than 12 hours. As a followup to my own post, after reading this discussion, I applied the patches and rebuild my system last night. As of today, with the patched ffs_snapshot.c, I can now make snapshots of all the filesystems listed above. It takes rather a long time, but that is to be expected, I think, and the snapshots finish normally. -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
Jeremy, On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. You missed the part where I said I removed the patch. I did that so I could provide details with it wedged. I agree that there's still some fundamental speed issues with snapshotting though. And I'm sure the FS itself will still be locked out for a while during the snapshot. But with the patch at least the whole thing doesn't lock up. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 10:05:21PM -0800, Jeremy Chadwick wrote: On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote: On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: ... On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. ... Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. Also, during my own testing, I am always able to use Ctrl-T to get SIGINFO from the running process (mksnap_ffs). That behaviour does not change for me. The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? I hadn't ever tried until I saw your message. Granted, I'm using a smaller file system (I doubt that I have a toital of as much as 2 TB in all my machines combined), and I'm running i386, vs. amd64. But it ran just fine. I wasn't able to test SIGINFO; it finished before I had a chance. (I ran it under time(1); wall clock time was 0.91 sec.) Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. Well, I routinely use dump/restore pipelines to copy file systems around; never had a problem with it. ... For reference: freebeast(7.1-P)[9] uname -a FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008 [EMAIL PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST i386 freebeast(7.1-P)[10] ls -la total 4 drwxrwxr-x 2 root operator 512 Nov 12 20:53 . drwxr-xr-x 14 root wheel 512 Jan 22 2008 .. freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1 0.91 real 0.00 user 0.05 sys 976 maximum resident set size 3 average shared memory size 627 average unshared data size 109 average unshared stack size 104 page reclaims 0 page faults 0 swaps 1 block input operations 230 block output operations 0 messages sent 0 messages received 0 signals received 101 voluntary context switches 34 involuntary context switches freebeast(7.1-P)[12] ls -la total 1460 drwxrwxr-x 2 root operator 512 Nov 12 20:54 . drwxr-xr-x 14 root wheel512 Jan 22 2008 .. -r--r- 1 root operator 2410791056 Nov 12 20:54 test.1 freebeast(7.1-P)[13] David, thanks for chiming in. This is exactly what I was fearing/worried about. It would be greatly beneficial if we could figure out what triggers the slowdown for a lot of us, since for others (proof above) mksnap_ffs behaves as expected. Since I'm able to reproduce this pretty much everywhere, here's information: # df -ki /usr Filesystem 1024-blocksUsed Avail Capacity iusedifree %iused Mounted on /dev/ad4s1f 163815904 3835274 146875358 3% 254864 209419341% /usr # cd /usr/.snap # /usr/bin/time -l mksnap_ffs /usr test.1 after about 20 seconds, hitting Ctrl-T load: 1.90 cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k 23.25 real 0.00 user 0.00 sys 135.98 real 0.00 user 0.62 sys 1092 maximum resident set size 4 average shared memory size 1081 average unshared data size 135 average unshared stack size 101 page reclaims 0 page faults 0 swaps 895 block input operations 13444 block output operations 0 messages sent 0 messages received 0 signals received 6433 voluntary context switches 197 involuntary context switches # ls -l test.1 -r--r- 1 root operator 173203463240 Nov 12 21:42
Re: System deadlock when using mksnap_ffs
On Thu, Nov 13, 2008 at 12:26:42PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. You systematically mix two completely different issues: - first one is the _deadlock_ experienced by Tim; Re-read what he wrote. Quote: Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. If he can press Control-T, it means SIGINFO can be sent to the mksnap_ffs process, and the process responds with that information. So, the system is not deadlocked -- meaning, I believe what he experiences is what others experience (the system becomes completely unusable during mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so god-awful slow that processes on the machine literally sit and spin for minutes at a time). - second one is the slowdown during snapshot creation. In fact, I may count third, where dump itself hangs, as a usermode process, but kernel still normally operates. Patch posted should fix or paper over the first issue for practical means. Third issue most likely fixed by the subr_sleepqueue race fix. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Thu, Nov 13, 2008 at 02:45:14AM -0800, Jeremy Chadwick wrote: On Thu, Nov 13, 2008 at 12:26:42PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. You systematically mix two completely different issues: - first one is the _deadlock_ experienced by Tim; Re-read what he wrote. Quote: Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. If he can press Control-T, it means SIGINFO can be sent to the mksnap_ffs process, and the process responds with that information. So, the system is not deadlocked -- meaning, I believe what he experiences is what others experience (the system becomes completely unusable during mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so god-awful slow that processes on the machine literally sit and spin for minutes at a time). Unless NOKERNINFO is specified in the local flags in the controlling terminal termios, kernel prints one line summary as shown above. This is done from the tty discipline input handler (or whatever it is in new tty code). No process cooperation is required. On the other hand, actually delivering SIGINFO and getting output from the process-installed handler do require process to either executing usermode or sleeping interruptible. pgpLcGHHYtlZZ.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. You systematically mix two completely different issues: - first one is the _deadlock_ experienced by Tim; - second one is the slowdown during snapshot creation. In fact, I may count third, where dump itself hangs, as a usermode process, but kernel still normally operates. Patch posted should fix or paper over the first issue for practical means. Third issue most likely fixed by the subr_sleepqueue race fix. pgp0BffFqWeFE.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. As an answer to the question (and additional information), I am experiencing the problem, but not on all filesystems. This is under FreeBSD 7.1-PRERELEASE #7: Thu Nov 6 11:29:52 CET 2008, amd64 (from sources csup'ed immediately prior to the build). I have four filesystems used for data storage: /dev/da1p196850470 7866026 81236408 9%/export/mail /dev/da1p2 1937058312 972070320 81002332855%/export/home /dev/da1p3 1937058312 79027008 1703066640 4%/export/misc /dev/da1p4 2598991534 271980564 211909164811%/export/spare I can successfully mksnap_ffs the first (smaller) partition, but an attempt to do so on any of the others causes a lock. Note: this is a lockup, not a slow. The system becomes unresponsive to any input, and there is no hard drive activity, and this does not change over a period of more than 12 hours. -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
Kostik Belousov writes: | On Thu, Nov 13, 2008 at 02:45:14AM -0800, Jeremy Chadwick wrote: [snip] | If he can press Control-T, it means SIGINFO can be sent to the | mksnap_ffs process, and the process responds with that information. So, | the system is not deadlocked -- meaning, I believe what he experiences | is what others experience (the system becomes completely unusable during | mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so | god-awful slow that processes on the machine literally sit and spin for | minutes at a time). | | Unless NOKERNINFO is specified in the local flags in the controlling | terminal termios, kernel prints one line summary as shown above. This is | done from the tty discipline input handler (or whatever it is in new tty | code). No process cooperation is required. On the other hand, actually | delivering SIGINFO and getting output from the process-installed | handler do require process to either executing usermode or sleeping | interruptible. Also note that dead-lock is not just a locking issue but can be WRT to other chains such as, hit the max buffer cache usage so the buffer daemon needs to flush things out but it can't since it needs a buffer but the buffer daemon can't get it since need to flush some. Things get really bad when the buffer daemon needs a buffer but can't! In theory it can go and use emergency space just for it to get out of this situation but it the buffer cache is fragmented such that all available buffers are to small then the buffer daemon is stuck on itself. Note that all stuff works except for anything that touch the buffer cache, such as a program coming off disk. A program in memory is okay. To really get a good picture of this you need to look at the various buffer cache variables via ddb (ie. hi, low, running etc.) A while back I wrote a debugging function to dump that state of things every minute or so. There are various loops you can get into. So then you start playing wack a mole. Usually due to the first bug you can't hit the 2nd, 3rd and so one adding to the fun. Unfortunately there isn't one magic bullet. These are not new problems since we hit them in 4.X. I did start to go over some of this issue with Tor but ran into ENOTIME on my side :-( Snap shots can take a very long time to make depending on the amount of stuff it has to snap shot and during that time it has to effectively lock out everything from the file system or the snap shot will be wrong. This then leads to a need for a good journaling fs that can be used on big disks (big, isn't that big anymore). Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
I'll just chime in briefly. I contacted Jeremy off the list about this issue a few days ago. I have one spare box i386 sitting here that I can happily test patches against; if I can be of help, let me know. uname -a FreeBSD localhost.localdomain 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Tue Nov 11 21:40:27 CST 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 ident /boot/kernel/kernel | grep sleepqueue $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb Exp $ Suffers from the description given by Jeremy: the box is not deadlocked during snapshot but I might as well walk away from it because I can't use it. I'd really like to see this get fixed; I rely on dump for backups. Regards, Pat -- Jesus, can't I count on you people!? --Oh Brother, Where Art Thou, George Clooney ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
System deadlock when using mksnap_ffs
I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. Any advice? I'm happy to help debug this further to get to the bottom of it. Thanks, Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: System deadlock when using mksnap_ffs
-Original Message- From: [EMAIL PROTECTED] [mailto:owner-freebsd- [EMAIL PROTECTED] On Behalf Of Tim Bishop Sent: 12 November 2008 07:58 PM To: freebsd-stable@freebsd.org Cc: [EMAIL PROTECTED] Subject: System deadlock when using mksnap_ffs FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: If the file system is UFS2 it's a known problem but should have been fixed. http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues ident /boot/kernel/kernel | grep subr_sleepqueue version should be greater than 1.39.2.3? Regards -- David Peall :: IT Manager e-Schools' Network :: http://www.esn.org.za/ Phone +27 (021) 674-9140 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 06:22:35PM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 08:10:50PM +0200, David Peall wrote: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: If the file system is UFS2 it's a known problem but should have been fixed. http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues ident /boot/kernel/kernel | grep subr_sleepqueue version should be greater than 1.39.2.3? Yes it's UFS2, and yes it's greater than 1.39.2.3: $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb Exp $ Are you sure the problem referenced on that page is the same? It talks about dog slow snapshotting, which I see on other filesystems and machines. But in this particular case the system is dead, and does not recover. This problem gets brought up every few weeks on average, I think. The problem still exists regardless of subr_sleepqueue.c 1.39.2.3. I can still reproduce it on every FreeBSD box I have access to. The last time I tried it was on 2008/10/24, on a RELENG_7 system built from source csup'd on the same day. The result of dump -L -0 -a -f /someplace/fs.dump /usr was the same: the system became more or less unusable (meaning to the point where you might as well not try to do anything with it because it's so incredibly slow, especially with anything I/O-bound), and mksnap_ffs remained in the following state for many, many minutes: load: 0.00 cmd: mksnap_ffs 10480 [wdrain] 0.00u 0.06s 0% 1076k Hitting ^C at this point took 4-5 *full minutes* to execute. While ^C was (hopefully) executing, the process remained in wdrain state as well. After the process was terminated fully, the system was again responsive. That filesystem (/usr) I was dumping: Filesystem 1K-blocksUsed Avail Capacity iusedifree %iused Mounted on /dev/ad4s1f 163815904 3834316 146876316 3% 254752 209420461% /usr -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. Any advice? I'm happy to help debug this further to get to the bottom of it. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html pgp1bwpyRLXdz.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 Someone (not named because they choose not to reply to the list) gave me the following patch: --- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006 +++ sys/ufs/ffs/ffs_snapshot.c Mon Nov 20 14:59:13 2006 @@ -282,6 +282,8 @@ restart: if (error) goto out; bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); } /* * Copy all the cylinder group maps. Although the @@ -303,6 +305,8 @@ restart: goto out; error = cgaccount(cg, vp, nbp, 1); bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); if (error) goto out; } With the description: What can happen is on a big file system it will fill up the buffer cache with I/O and then run out. When the buffer cache fills up then no more disk I/O can happen :-( When you do a sync, it flushes that out to disk so things don't hang. It seems to work too. But it seems more like a workaround than a fix? Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 07:49:28PM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 Someone (not named because they choose not to reply to the list) gave me the following patch: --- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006 +++ sys/ufs/ffs/ffs_snapshot.cMon Nov 20 14:59:13 2006 @@ -282,6 +282,8 @@ restart: if (error) goto out; bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); } /* * Copy all the cylinder group maps. Although the @@ -303,6 +305,8 @@ restart: goto out; error = cgaccount(cg, vp, nbp, 1); bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); if (error) goto out; } With the description: What can happen is on a big file system it will fill up the buffer cache with I/O and then run out. When the buffer cache fills up then no more disk I/O can happen :-( When you do a sync, it flushes that out to disk so things don't hang. It seems to work too. But it seems more like a workaround than a fix? It looks hackish, but in fact it is not that wrong, and I even say that it provides reasonable workaround. The usual way to prevent wdrain deadlock is to issue bwillwrite() call before any vnode lock is taken. This is sufficient for most VFS syscalls that typically put dozen or less dirty buffers into delayed write queue. Snapshot creation does not call bwillwrite() at all, and then does a lot of async writes, completely saturating buffer cache with dirty buffers. bwillwrite cannot be called after the vnode is locked, and just forcing a sync for the embrionic snapshot vnode is good enough. The 10 counter is debatable, but debate shall be postponed until the patch goes into tree. I ask an anonymous submitter to commit it. Thanks ! pgpUWHqVm17Bs.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 08:10:50PM +0200, David Peall wrote: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: If the file system is UFS2 it's a known problem but should have been fixed. http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues ident /boot/kernel/kernel | grep subr_sleepqueue version should be greater than 1.39.2.3? Yes it's UFS2, and yes it's greater than 1.39.2.3: $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb Exp $ Are you sure the problem referenced on that page is the same? It talks about dog slow snapshotting, which I see on other filesystems and machines. But in this particular case the system is dead, and does not recover. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Also the following kernel message came out: Expensive timeout(9) function: 0x802ce380(0xff000677ca50) 0.006121001 s There is also still some disk I/O. Dropping to ddb worked, but I don't have a serial console so I can't paste the output. ps shows mksnap_ffs in newbuf, as we already saw. A trace of mksnap_ffs looks like this: Tracing pid 2603 tid 100214 td 0xff0006a0e370 sched_switch() at sched_switch+0x2a1 mi_switch() at mi_switch+0x233 sleepq_switch() at sleepq_switch+0xe9 sleepq_wait() at sleepq_wait+0x44 _sleep() at _sleep+0x351 getnewbuf() at getnewbuf+0x2e1 getblk() at getblk+0x30d setup_allocindir_phase2() at setup_allocindir_phase2+0x338 softdep_setup_allocindir_page() at softdep_setup_allocindir_page+0xa7 ffs_balloc_ufs2() at ffs_balloc_ufs2+0x121e ffs_snapshot() at ffs_snapshot+0xc52 ffs_mount() at ffs_mount+0x735 vfs_donmount() at vfs_donmount+0xeb5 kernel_mount() at kernel_mount+0xa1 ffs_cmount() at ffs_cmount+0x92 mount() at mount+0x1cc syscall() at syscall+0x1f6 Xfast_syscall() at Xfast_syscall+0xab --- syscall (21, FreeBSD ELF64, mount), rip = 0x80068636c, rsp = 0x7fffe518, rbp = 0x8008447a0 --- show pcpu shows cpuid 3 (quad core machine) in thread swi6: Giant taskq. All the other cpus are idle. show locks shows: exclusive sleep mutex Giant r = 0 (0x806ae040) locked @ /usr/src/sys/kern/kern_intr.c:1087 There are two other locks shown by show all locks, one for sshd and one for mysqld, both in kern/uipc_sockbuf.c. show lockedvnods shows mksnap_ffs has a lock on da0s1a with ffs_vget at the top of the stack. Sorry for any typos. I'll sort out a serial cable if more is needed :-) Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. Also, during my own testing, I am always able to use Ctrl-T to get SIGINFO from the running process (mksnap_ffs). That behaviour does not change for me. The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. Also, a fellow on -fs just mentioned he's having this exact problem: http://lists.freebsd.org/pipermail/freebsd-fs/2008-November/005324.html Also the following kernel message came out: Expensive timeout(9) function: 0x802ce380(0xff000677ca50) 0.006121001 s There is also still some disk I/O. Dropping to ddb worked, but I don't have a serial console so I can't paste the output. ps shows mksnap_ffs in newbuf, as we already saw. A trace of mksnap_ffs looks like this: Tracing pid 2603 tid 100214 td 0xff0006a0e370 sched_switch() at sched_switch+0x2a1 mi_switch() at mi_switch+0x233 sleepq_switch() at sleepq_switch+0xe9 sleepq_wait() at sleepq_wait+0x44 _sleep() at _sleep+0x351 getnewbuf() at getnewbuf+0x2e1 getblk() at getblk+0x30d setup_allocindir_phase2() at setup_allocindir_phase2+0x338 softdep_setup_allocindir_page() at softdep_setup_allocindir_page+0xa7 ffs_balloc_ufs2() at ffs_balloc_ufs2+0x121e ffs_snapshot() at ffs_snapshot+0xc52 ffs_mount() at ffs_mount+0x735 vfs_donmount() at vfs_donmount+0xeb5 kernel_mount() at kernel_mount+0xa1 ffs_cmount() at ffs_cmount+0x92 mount() at mount+0x1cc syscall() at syscall+0x1f6 Xfast_syscall() at Xfast_syscall+0xab --- syscall (21, FreeBSD ELF64, mount), rip = 0x80068636c, rsp = 0x7fffe518, rbp = 0x8008447a0 --- show pcpu shows cpuid 3 (quad core machine) in thread swi6: Giant taskq. All the other cpus are idle. show locks shows: exclusive sleep mutex Giant r = 0 (0x806ae040) locked @ /usr/src/sys/kern/kern_intr.c:1087 There are two other locks shown by show all locks, one for sshd and one for mysqld, both in kern/uipc_sockbuf.c. show lockedvnods shows mksnap_ffs has a lock on da0s1a with ffs_vget at the top of the stack. Sorry for any typos. I'll sort out a serial cable if more is needed :-) Tim. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
Jeremy Chadwick writes: [snip] | The rest of the below information is good -- but I'm confused about | something: is there anyone out there who can use mksnap_ffs on a | filesystem (/usr is a good test source) and NOT experience this | deadlocking problem? Literally *every* FreeBSD box I have root access | to suffers from this problem, so I'm a little baffled why we end-users | need to keep providing debugging output when it should be easy as pie | for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch | their system wedge. We can at work, but we have a bunch of other patches. There are a few problems with the buffer cache: 1) The buffer daemon can't use the space that is reserved for it since to flush some stuff it needs to use more buffers. 2) The buffer cache can get fragmented to prevent large I/O which the buffer daemon may need. 3) Other issues ... I have fix for 1. It is pretty easy. I have a hack'ish fix for 2 in the I make all request use max size so it can't get fragmented since there is no code to defrag and it isn't trivial to defrag the memory. I have some fixes for some other issues, but there were some review issues with them. I might just commit the fixes for 1 and 2. It makes things better and there was no-objections at the time. We have the patches in shipping products. I can try to do some experiments at work like you said since I had similar things working before and it is pretty easy to put in printf's to see the issue. Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: ... On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. ... Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. Also, during my own testing, I am always able to use Ctrl-T to get SIGINFO from the running process (mksnap_ffs). That behaviour does not change for me. The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? I hadn't ever tried until I saw your message. Granted, I'm using a smaller file system (I doubt that I have a toital of as much as 2 TB in all my machines combined), and I'm running i386, vs. amd64. But it ran just fine. I wasn't able to test SIGINFO; it finished before I had a chance. (I ran it under time(1); wall clock time was 0.91 sec.) Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. Well, I routinely use dump/restore pipelines to copy file systems around; never had a problem with it. ... For reference: freebeast(7.1-P)[9] uname -a FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008 [EMAIL PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST i386 freebeast(7.1-P)[10] ls -la total 4 drwxrwxr-x 2 root operator 512 Nov 12 20:53 . drwxr-xr-x 14 root wheel 512 Jan 22 2008 .. freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1 0.91 real 0.00 user 0.05 sys 976 maximum resident set size 3 average shared memory size 627 average unshared data size 109 average unshared stack size 104 page reclaims 0 page faults 0 swaps 1 block input operations 230 block output operations 0 messages sent 0 messages received 0 signals received 101 voluntary context switches 34 involuntary context switches freebeast(7.1-P)[12] ls -la total 1460 drwxrwxr-x 2 root operator 512 Nov 12 20:54 . drwxr-xr-x 14 root wheel512 Jan 22 2008 .. -r--r- 1 root operator 2410791056 Nov 12 20:54 test.1 freebeast(7.1-P)[13] Peace, david -- David H. Wolfskill [EMAIL PROTECTED] Depriving a girl or boy of an opportunity for education is evil. See http://www.catwhisker.org/~david/publickey.gpg for my public key. pgpWq3nEgQVfJ.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
Kostik Belousov writes: | On Wed, Nov 12, 2008 at 07:49:28PM +, Tim Bishop wrote: | On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: | I run the mksnap_ffs command to take the snapshot and some time later | the system completely freezes up: | | paladin# cd /u2/.snap/ | paladin# mksnap_ffs /u2 test.1 | | Someone (not named because they choose not to reply to the list) gave me | the following patch: | | --- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006 | +++ sys/ufs/ffs/ffs_snapshot.c Mon Nov 20 14:59:13 2006 | @@ -282,6 +282,8 @@ restart: | if (error) | goto out; | bawrite(nbp); | + if (cg % 10 == 0) | + ffs_syncvnode(vp, MNT_WAIT); | } | /* | * Copy all the cylinder group maps. Although the | @@ -303,6 +305,8 @@ restart: | goto out; | error = cgaccount(cg, vp, nbp, 1); | bawrite(nbp); | + if (cg % 10 == 0) | + ffs_syncvnode(vp, MNT_WAIT); | if (error) | goto out; | } | | With the description: | | What can happen is on a big file system it will fill up the buffer | cache with I/O and then run out. When the buffer cache fills up then no | more disk I/O can happen :-( When you do a sync, it flushes that out to | disk so things don't hang. | | It seems to work too. But it seems more like a workaround than a fix? | | It looks hackish, but in fact it is not that wrong, and I even say that | it provides reasonable workaround. | | The usual way to prevent wdrain deadlock is to issue bwillwrite() call | before any vnode lock is taken. This is sufficient for most VFS syscalls | that typically put dozen or less dirty buffers into delayed write | queue. | | Snapshot creation does not call bwillwrite() at all, and then does a lot | of async writes, completely saturating buffer cache with dirty buffers. | bwillwrite cannot be called after the vnode is locked, and just forcing | a sync for the embrionic snapshot vnode is good enough. | | The 10 counter is debatable, but debate shall be postponed until the patch | goes into tree. I ask an anonymous submitter to commit it. Thanks ! I plan to commit it tomorrow since I sent it to Tim to test. The 10 can be tuned but it has kept a bunch of machines at work up. Glad people don't think it is that it is to wrong :-) It probably could be made a little more dynamic but I wonder if it would show any real performance difference and might risk more bugs. Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote: On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: ... On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. ... Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. Also, during my own testing, I am always able to use Ctrl-T to get SIGINFO from the running process (mksnap_ffs). That behaviour does not change for me. The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? I hadn't ever tried until I saw your message. Granted, I'm using a smaller file system (I doubt that I have a toital of as much as 2 TB in all my machines combined), and I'm running i386, vs. amd64. But it ran just fine. I wasn't able to test SIGINFO; it finished before I had a chance. (I ran it under time(1); wall clock time was 0.91 sec.) Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. Well, I routinely use dump/restore pipelines to copy file systems around; never had a problem with it. ... For reference: freebeast(7.1-P)[9] uname -a FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008 [EMAIL PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST i386 freebeast(7.1-P)[10] ls -la total 4 drwxrwxr-x 2 root operator 512 Nov 12 20:53 . drwxr-xr-x 14 root wheel 512 Jan 22 2008 .. freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1 0.91 real 0.00 user 0.05 sys 976 maximum resident set size 3 average shared memory size 627 average unshared data size 109 average unshared stack size 104 page reclaims 0 page faults 0 swaps 1 block input operations 230 block output operations 0 messages sent 0 messages received 0 signals received 101 voluntary context switches 34 involuntary context switches freebeast(7.1-P)[12] ls -la total 1460 drwxrwxr-x 2 root operator 512 Nov 12 20:54 . drwxr-xr-x 14 root wheel512 Jan 22 2008 .. -r--r- 1 root operator 2410791056 Nov 12 20:54 test.1 freebeast(7.1-P)[13] David, thanks for chiming in. This is exactly what I was fearing/worried about. It would be greatly beneficial if we could figure out what triggers the slowdown for a lot of us, since for others (proof above) mksnap_ffs behaves as expected. Since I'm able to reproduce this pretty much everywhere, here's information: # df -ki /usr Filesystem 1024-blocksUsed Avail Capacity iusedifree %iused Mounted on /dev/ad4s1f 163815904 3835274 146875358 3% 254864 209419341% /usr # cd /usr/.snap # /usr/bin/time -l mksnap_ffs /usr test.1 after about 20 seconds, hitting Ctrl-T load: 1.90 cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k 23.25 real 0.00 user 0.00 sys 135.98 real 0.00 user 0.62 sys 1092 maximum resident set size 4 average shared memory size 1081 average unshared data size 135 average unshared stack size 101 page reclaims 0 page faults 0 swaps 895 block input operations 13444 block output operations 0 messages sent 0 messages received 0 signals received 6433 voluntary context switches 197 involuntary context switches # ls -l test.1 -r--r- 1 root operator 173203463240 Nov 12 21:42 test.1 David's filesystem is 2GBs, while mine is 16GB. His snap takes under 1 second, yet mine takes over 2 minutes. Possibly the large deviation is explained by the amount of space used on the
Re: System deadlock when using mksnap_ffs
Quoting Jeremy Chadwick, who wrote on Wed, Nov 12, 2008 at 08:42:00PM -0800 .. On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote: On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote: On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote: I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 99074920248%/u2 To clarify completely freezes up: unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html Ok, I've done that, and removed the patch that seemed to fix things. The first thing I notice after doing this on the console is that I can still ctrl+t the process: load: 0.14 cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k But the top and ps I left running on other ttys have all stopped responding. Then in my book, the patch didn't fix anything. :-) The system is still deadlocking; snapshot generation **should not** wedge the system hard like this. Also, during my own testing, I am always able to use Ctrl-T to get SIGINFO from the running process (mksnap_ffs). That behaviour does not change for me. The rest of the below information is good -- but I'm confused about something: is there anyone out there who can use mksnap_ffs on a filesystem (/usr is a good test source) and NOT experience this deadlocking problem? Literally *every* FreeBSD box I have root access to suffers from this problem, so I'm a little baffled why we end-users need to keep providing debugging output when it should be easy as pie for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch their system wedge. dump -L on my RELENG_7 machine does not wedge it. So there must be multiple factors influencing the snap creating problems or not. Wilko ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: System deadlock when using mksnap_ffs
On 2008-Nov-12 20:47:37 -0800, Doug Ambrisko [EMAIL PROTECTED] wrote: I plan to commit it tomorrow since I sent it to Tim to test. The 10 can be tuned but it has kept a bunch of machines at work up. Glad people don't think it is that it is to wrong :-) It probably could be made a little more dynamic but I wonder if it would show any real performance difference and might risk more bugs. FWIW, I've been running the patch since I first saw Doug post it in Feb 2006 and don't recall ever having problems with mksnap_ffs since applying it (I did before) -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgpXbZklXqxzU.pgp Description: PGP signature
Re: System deadlock when using mksnap_ffs
(moving my thread from -fs to -stable) Before touching anything, here's a description of the symptoms I see... Rather busy system, with quite a bit of filesystem activity occurring while the snapshot is being made. Quad CPU amd64 box with 16GB of ram, 6x10Krpm RAID array. Should be reasonably fast. Filesystem 1K-blocks Used Avail Capacity iusedifree %iused Mounted on /dev/da0s1a 739339824 74357926 60583471411% 1718540 938554742% / 1.7 million inodes, 71G used of a 705G volume. Here's a timeline of what I see when starting to make a new snapshot. I've got a few windows running, showing top, iostat, etc. Baseline disk activity before starting anything: device r/s w/skr/skw/s wait svc_t b da0 24.0 2.0 355.632.01 10.7 28 0m0s: Snapshot begins, using mount -u -o snapshot //.snap/weekly. 0 / Drives immediately jump to 100% busy as expected. device r/s w/skr/skw/s wait svc_t b da0 153.8 6.0 3378.695.92 16.9 100 the mount process is spending 100% of its time in biord. 2m10s: The mount process starts spending more and more time in snaplk, alternating with biord. device r/s w/skr/skw/s wait svc_t b da0 77.9 67.9 1270.7 3754.21 10.7 100 12m15s: The first intermittent slowdowns start affecting other processes on the system. Occasionally all active processes will get stuck in snaplk or ufs for 5-10 seconds before resuming. device r/s w/skr/skw/s wait svc_t b da0 77.9 31.0 1150.8 1054.91 10.4 100 114m47s: Active processes are briefly stuck in suspfs 115m22s: Mount is now in snaprdb, Active processes are now completely stuck in snaplk. Still responsive to SIGINFO, top is still running, etc. Just hangs any time anything needs the filesystem. device r/s w/skr/skw/s wait svc_t b da0 238.8 0.0 3820.1 0.01 4.1 99 143m19s: Mount now in wdrain. 143m34s: Finished. snapshot logging shows /: suspended 13.308 sec, redo 153 of 4058 Most processes were hung for 28 minutes. Is this what others are seeing? It sounds like some of the complaints are it getting stuck in the wdrain state, not what I'm showing here. Another mildly annoying note: Any process that touches .snap while a snapshot is being generated gets stuck in ufs until it finishes. I can understand wanting to keep operations in there in sync, but it would be really nice if find / wouldn't get hung when it tries to decent into .snap, for example. ts5# cd /.snap ts5# ls -l ^T load: 0.17 cmd: ls 3696 [ufs] 0.00u 0.00s 0% 1496k ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]