Re: System deadlock when using mksnap_ffs

2008-11-14 Thread Greg Byshenk
On Thu, Nov 13, 2008 at 05:08:10PM +0100, Greg Byshenk wrote:
 On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
  
  The rest of the below information is good -- but I'm confused about
  something: is there anyone out there who can use mksnap_ffs on a
  filesystem (/usr is a good test source) and NOT experience this
  deadlocking problem?  Literally *every* FreeBSD box I have root access
  to suffers from this problem, so I'm a little baffled why we end-users
  need to keep providing debugging output when it should be easy as pie
  for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
  their system wedge.
 
 As an answer to the question (and additional information), I am 
 experiencing the problem, but not on all filesystems. 
 
 This is under FreeBSD 7.1-PRERELEASE #7: Thu Nov  6 11:29:52 CET 2008,
 amd64 (from sources csup'ed immediately prior to the build).
 
 I have four filesystems used for data storage:
 
 /dev/da1p196850470   7866026   81236408 9%/export/mail
 /dev/da1p2  1937058312 972070320  81002332855%/export/home
 /dev/da1p3  1937058312  79027008 1703066640 4%/export/misc
 /dev/da1p4  2598991534 271980564 211909164811%/export/spare
 
 I can successfully mksnap_ffs the first (smaller) partition, but an
 attempt to do so on any of the others causes a lock.
 
 Note: this is a lockup, not a slow.  The system becomes unresponsive
 to any input, and there is no hard drive activity, and this does not
 change over a period of more than 12 hours.


As a followup to my own post, after reading this discussion, I applied
the patches and rebuild my system last night.

As of today, with the patched ffs_snapshot.c, I can now make snapshots
of all the filesystems listed above.  It takes rather a long time, but
that is to be expected, I think, and the snapshots finish normally.


-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Tim Bishop
Jeremy,

On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
 On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
  On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
   On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1
   
   You need to provide information described in the
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
   and especially
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
  
  Ok, I've done that, and removed the patch that seemed to fix things.
  
  The first thing I notice after doing this on the console is that I can
  still ctrl+t the process:
  
  load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
  
  But the top and ps I left running on other ttys have all stopped
  responding.
 
 Then in my book, the patch didn't fix anything.  :-)  The system is
 still deadlocking; snapshot generation **should not** wedge the system
 hard like this.

You missed the part where I said I removed the patch. I did that so I
could provide details with it wedged.

I agree that there's still some fundamental speed issues with
snapshotting though. And I'm sure the FS itself will still be locked out
for a while during the snapshot. But with the patch at least the whole
thing doesn't lock up.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Jeremy Chadwick
On Wed, Nov 12, 2008 at 10:05:21PM -0800, Jeremy Chadwick wrote:
 On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote:
  On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
   ...
 On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
  I've been playing around with snapshots lately but I've got a 
  problem on
  one of my servers running 7-STABLE amd64:
  
  FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 
  10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN 
   amd64
  
  I run the mksnap_ffs command to take the snapshot and some time 
  later
  the system completely freezes up:
  
  paladin# cd /u2/.snap/
  paladin# mksnap_ffs /u2 test.1
  
  It only happens on this one filesystem, though, which might be to do
  with its size. It's not over the 2TB marker, but it's pretty close. 
  It's
  also backed by a hardware RAID system, although a smaller 
  filesystem on
  the same RAID has no issues.
   ...
   Then in my book, the patch didn't fix anything.  :-)  The system is
   still deadlocking; snapshot generation **should not** wedge the system
   hard like this.
   
   Also, during my own testing, I am always able to use Ctrl-T to get
   SIGINFO from the running process (mksnap_ffs).  That behaviour does not
   change for me.
   
   The rest of the below information is good -- but I'm confused about
   something: is there anyone out there who can use mksnap_ffs on a
   filesystem (/usr is a good test source) and NOT experience this
   deadlocking problem?
  
  I hadn't ever tried until I saw your message.  Granted, I'm using a
  smaller file system (I doubt that I have a toital of as much as 2 TB in
  all my machines combined), and I'm running i386, vs. amd64.  But it ran
  just fine.  I wasn't able to test SIGINFO; it finished before I had a
  chance.  (I ran it under time(1); wall clock time was 0.91 sec.)
  
   Literally *every* FreeBSD box I have root access
   to suffers from this problem, so I'm a little baffled why we end-users
   need to keep providing debugging output when it should be easy as pie
   for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
   their system wedge.
  
  Well, I routinely use dump/restore pipelines to copy file systems
  around; never had a problem with it.
  
   ...
  
  For reference:
  
  freebeast(7.1-P)[9] uname -a
  FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE 
  #127: Wed Nov 12 05:16:20 PST 2008 [EMAIL 
  PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST  i386
  freebeast(7.1-P)[10] ls -la
  total 4
  drwxrwxr-x   2 root  operator  512 Nov 12 20:53 .
  drwxr-xr-x  14 root  wheel 512 Jan 22  2008 ..
  freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1
  0.91 real 0.00 user 0.05 sys
 976  maximum resident set size
   3  average shared memory size
 627  average unshared data size
 109  average unshared stack size
 104  page reclaims
   0  page faults
   0  swaps
   1  block input operations
 230  block output operations
   0  messages sent
   0  messages received
   0  signals received
 101  voluntary context switches
  34  involuntary context switches
  freebeast(7.1-P)[12] ls -la
  total 1460
  drwxrwxr-x   2 root  operator 512 Nov 12 20:54 .
  drwxr-xr-x  14 root  wheel512 Jan 22  2008 ..
  -r--r-   1 root  operator  2410791056 Nov 12 20:54 test.1
  freebeast(7.1-P)[13] 
 
 David, thanks for chiming in.  This is exactly what I was
 fearing/worried about.
 
 It would be greatly beneficial if we could figure out what triggers the
 slowdown for a lot of us, since for others (proof above) mksnap_ffs
 behaves as expected.
 
 Since I'm able to reproduce this pretty much everywhere, here's
 information:
 
 # df -ki /usr
 Filesystem  1024-blocksUsed Avail Capacity iusedifree %iused  
 Mounted on
 /dev/ad4s1f   163815904 3835274 146875358 3%  254864 209419341%   /usr
 
 # cd /usr/.snap
 # /usr/bin/time -l mksnap_ffs /usr test.1
 
 after about 20 seconds, hitting Ctrl-T
 
 load: 1.90  cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k
23.25 real 0.00 user 0.00 sys
 
   135.98 real 0.00 user 0.62 sys
   1092  maximum resident set size
  4  average shared memory size
   1081  average unshared data size
135  average unshared stack size
101  page reclaims
  0  page faults
  0  swaps
895  block input operations
  13444  block output operations
  0  messages sent
  0  messages received
  0  signals received
   6433  voluntary context switches
197  involuntary context switches
 # ls -l test.1
 -r--r-  1 root  operator  173203463240 Nov 12 21:42 

Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Jeremy Chadwick
On Thu, Nov 13, 2008 at 12:26:42PM +0200, Kostik Belousov wrote:
 On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
  On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
   On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
 I've been playing around with snapshots lately but I've got a problem 
 on
 one of my servers running 7-STABLE amd64:
 
 FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  
 amd64
 
 I run the mksnap_ffs command to take the snapshot and some time later
 the system completely freezes up:
 
 paladin# cd /u2/.snap/
 paladin# mksnap_ffs /u2 test.1
 
 It only happens on this one filesystem, though, which might be to do
 with its size. It's not over the 2TB marker, but it's pretty close. 
 It's
 also backed by a hardware RAID system, although a smaller filesystem 
 on
 the same RAID has no issues.
 
 Filesystem  1K-blocks   Used Avail Capacity  Mounted on
 /dev/da0s1a 2078881084 921821396 99074920248%/u2
 
 To clarify completely freezes up: unresponsive to all services over
 the network, except ping. On the console I can switch between the 
 ttys,
 but none of them respond. The only way out is to hit the reset button.

You need to provide information described in the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
and especially
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
   
   Ok, I've done that, and removed the patch that seemed to fix things.
   
   The first thing I notice after doing this on the console is that I can
   still ctrl+t the process:
   
   load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
   
   But the top and ps I left running on other ttys have all stopped
   responding.
  
  Then in my book, the patch didn't fix anything.  :-)  The system is
  still deadlocking; snapshot generation **should not** wedge the system
  hard like this.
 You systematically mix two completely different issues:
 - first one is the _deadlock_ experienced by Tim;

Re-read what he wrote.  Quote:

Ok, I've done that, and removed the patch that seemed to fix things.

The first thing I notice after doing this on the console is that I can
still ctrl+t the process:

load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k

But the top and ps I left running on other ttys have all stopped
responding.

If he can press Control-T, it means SIGINFO can be sent to the
mksnap_ffs process, and the process responds with that information.  So,
the system is not deadlocked -- meaning, I believe what he experiences
is what others experience (the system becomes completely unusable during
mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so
god-awful slow that processes on the machine literally sit and spin for
minutes at a time).

 - second one is the slowdown during snapshot creation.
 In fact, I may count third, where dump itself hangs, as a usermode process,
 but kernel still normally operates.
 
 Patch posted should fix or paper over the first issue for practical means.
 Third issue most likely fixed by the subr_sleepqueue race fix.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Kostik Belousov
On Thu, Nov 13, 2008 at 02:45:14AM -0800, Jeremy Chadwick wrote:
 On Thu, Nov 13, 2008 at 12:26:42PM +0200, Kostik Belousov wrote:
  On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
   On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
 On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
  I've been playing around with snapshots lately but I've got a 
  problem on
  one of my servers running 7-STABLE amd64:
  
  FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 
  10 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN 
   amd64
  
  I run the mksnap_ffs command to take the snapshot and some time 
  later
  the system completely freezes up:
  
  paladin# cd /u2/.snap/
  paladin# mksnap_ffs /u2 test.1
  
  It only happens on this one filesystem, though, which might be to do
  with its size. It's not over the 2TB marker, but it's pretty close. 
  It's
  also backed by a hardware RAID system, although a smaller 
  filesystem on
  the same RAID has no issues.
  
  Filesystem  1K-blocks   Used Avail Capacity  Mounted on
  /dev/da0s1a 2078881084 921821396 99074920248%/u2
  
  To clarify completely freezes up: unresponsive to all services 
  over
  the network, except ping. On the console I can switch between the 
  ttys,
  but none of them respond. The only way out is to hit the reset 
  button.
 
 You need to provide information described in the
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
 and especially
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

Ok, I've done that, and removed the patch that seemed to fix things.

The first thing I notice after doing this on the console is that I can
still ctrl+t the process:

load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k

But the top and ps I left running on other ttys have all stopped
responding.
   
   Then in my book, the patch didn't fix anything.  :-)  The system is
   still deadlocking; snapshot generation **should not** wedge the system
   hard like this.
  You systematically mix two completely different issues:
  - first one is the _deadlock_ experienced by Tim;
 
 Re-read what he wrote.  Quote:
 
 Ok, I've done that, and removed the patch that seemed to fix things.
 
 The first thing I notice after doing this on the console is that I can
 still ctrl+t the process:
 
 load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
 
 But the top and ps I left running on other ttys have all stopped
 responding.
 
 If he can press Control-T, it means SIGINFO can be sent to the
 mksnap_ffs process, and the process responds with that information.  So,
 the system is not deadlocked -- meaning, I believe what he experiences
 is what others experience (the system becomes completely unusable during
 mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so
 god-awful slow that processes on the machine literally sit and spin for
 minutes at a time).

Unless NOKERNINFO is specified in the local flags in the controlling
terminal termios, kernel prints one line summary as shown above. This is
done from the tty discipline input handler (or whatever it is in new tty
code). No process cooperation is required. On the other hand, actually
delivering SIGINFO and getting output from the process-installed
handler do require process to either executing usermode or sleeping
interruptible.


pgpLcGHHYtlZZ.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Kostik Belousov
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
 On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
  On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
   On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
I've been playing around with snapshots lately but I've got a problem on
one of my servers running 7-STABLE amd64:

FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64

I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1

It only happens on this one filesystem, though, which might be to do
with its size. It's not over the 2TB marker, but it's pretty close. It's
also backed by a hardware RAID system, although a smaller filesystem on
the same RAID has no issues.

Filesystem  1K-blocks   Used Avail Capacity  Mounted on
/dev/da0s1a 2078881084 921821396 99074920248%/u2

To clarify completely freezes up: unresponsive to all services over
the network, except ping. On the console I can switch between the ttys,
but none of them respond. The only way out is to hit the reset button.
   
   You need to provide information described in the
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
   and especially
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
  
  Ok, I've done that, and removed the patch that seemed to fix things.
  
  The first thing I notice after doing this on the console is that I can
  still ctrl+t the process:
  
  load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
  
  But the top and ps I left running on other ttys have all stopped
  responding.
 
 Then in my book, the patch didn't fix anything.  :-)  The system is
 still deadlocking; snapshot generation **should not** wedge the system
 hard like this.
You systematically mix two completely different issues:
- first one is the _deadlock_ experienced by Tim;
- second one is the slowdown during snapshot creation.
In fact, I may count third, where dump itself hangs, as a usermode process,
but kernel still normally operates.

Patch posted should fix or paper over the first issue for practical means.
Third issue most likely fixed by the subr_sleepqueue race fix.


pgp0BffFqWeFE.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Greg Byshenk
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
 
 The rest of the below information is good -- but I'm confused about
 something: is there anyone out there who can use mksnap_ffs on a
 filesystem (/usr is a good test source) and NOT experience this
 deadlocking problem?  Literally *every* FreeBSD box I have root access
 to suffers from this problem, so I'm a little baffled why we end-users
 need to keep providing debugging output when it should be easy as pie
 for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
 their system wedge.

As an answer to the question (and additional information), I am 
experiencing the problem, but not on all filesystems. 

This is under FreeBSD 7.1-PRERELEASE #7: Thu Nov  6 11:29:52 CET 2008,
amd64 (from sources csup'ed immediately prior to the build).

I have four filesystems used for data storage:

/dev/da1p196850470   7866026   81236408 9%/export/mail
/dev/da1p2  1937058312 972070320  81002332855%/export/home
/dev/da1p3  1937058312  79027008 1703066640 4%/export/misc
/dev/da1p4  2598991534 271980564 211909164811%/export/spare

I can successfully mksnap_ffs the first (smaller) partition, but an
attempt to do so on any of the others causes a lock.

Note: this is a lockup, not a slow.  The system becomes unresponsive
to any input, and there is no hard drive activity, and this does not
change over a period of more than 12 hours.

-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Doug Ambrisko
Kostik Belousov writes:
| On Thu, Nov 13, 2008 at 02:45:14AM -0800, Jeremy Chadwick wrote:
[snip]
|  If he can press Control-T, it means SIGINFO can be sent to the
|  mksnap_ffs process, and the process responds with that information.  So,
|  the system is not deadlocked -- meaning, I believe what he experiences
|  is what others experience (the system becomes completely unusable during
|  mksnap_ffs running, but DOES NOT hang or lock up, it just becomes so
|  god-awful slow that processes on the machine literally sit and spin for
|  minutes at a time).
| 
| Unless NOKERNINFO is specified in the local flags in the controlling
| terminal termios, kernel prints one line summary as shown above. This is
| done from the tty discipline input handler (or whatever it is in new tty
| code). No process cooperation is required. On the other hand, actually
| delivering SIGINFO and getting output from the process-installed
| handler do require process to either executing usermode or sleeping
| interruptible.

Also note that dead-lock is not just a locking issue but can be
WRT to other chains such as, hit the max buffer cache usage so the
buffer daemon needs to flush things out but it can't since it needs
a buffer but the buffer daemon can't get it since need to flush some.
Things get really bad when the buffer daemon needs a buffer but
can't!  In theory it can go and use emergency space just for it
to get out of this situation but it the buffer cache is fragmented
such that all available buffers are to small then the buffer daemon
is stuck on itself.  Note that all stuff works except for anything
that touch the buffer cache, such as a program coming off disk.  A
program in memory is okay.

To really get a good picture of this you need to look at the 
various buffer cache variables via ddb (ie. hi, low, running etc.)
A while back I wrote a debugging function to dump that state of
things every minute or so.  There are various loops you can get into.
So then you start playing wack a mole.  Usually due to the first
bug you can't hit the 2nd, 3rd and so one adding to the fun.
Unfortunately there isn't one magic bullet.

These are not new problems since we hit them in 4.X.  I did start
to go over some of this issue with Tor but ran into ENOTIME on my
side :-(

Snap shots can take a very long time to make depending on the amount 
of stuff it has to snap shot and during that time it has to effectively
lock out everything from the file system or the snap shot will be 
wrong.  This then leads to a need for a good journaling fs that
can be used on big disks (big, isn't that big anymore).

Doug A.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-13 Thread Patrick Reich

I'll just chime in briefly.  I contacted Jeremy off the list
about this issue a few days ago.  I have one spare box i386
sitting here that I can happily test patches against; if I
can be of help, let me know.

 uname -a
FreeBSD localhost.localdomain 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0:
Tue Nov 11 21:40:27 CST 2008
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC   i386

 ident /boot/kernel/kernel | grep sleepqueue
   $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16
20:01:57 jhb Exp $

Suffers from the description given by Jeremy: the box is not deadlocked
during snapshot but I might as well walk away from it because I can't
use it.  I'd really like to see this get fixed; I rely on dump for
backups.

Regards,
Pat
-- 

Jesus, can't I count on you people!?
--Oh Brother, Where Art Thou, George Clooney

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


System deadlock when using mksnap_ffs

2008-11-12 Thread Tim Bishop
I've been playing around with snapshots lately but I've got a problem on
one of my servers running 7-STABLE amd64:

FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 
GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64

I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1

It only happens on this one filesystem, though, which might be to do
with its size. It's not over the 2TB marker, but it's pretty close. It's
also backed by a hardware RAID system, although a smaller filesystem on
the same RAID has no issues.

Filesystem  1K-blocks   Used Avail Capacity  Mounted on
/dev/da0s1a 2078881084 921821396 99074920248%/u2

To clarify completely freezes up: unresponsive to all services over
the network, except ping. On the console I can switch between the ttys,
but none of them respond. The only way out is to hit the reset button.

Any advice? I'm happy to help debug this further to get to the bottom of
it.

Thanks,

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: System deadlock when using mksnap_ffs

2008-11-12 Thread David Peall
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:owner-freebsd-
 [EMAIL PROTECTED] On Behalf Of Tim Bishop
 Sent: 12 November 2008 07:58 PM
 To: freebsd-stable@freebsd.org
 Cc: [EMAIL PROTECTED]
 Subject: System deadlock when using mksnap_ffs

 FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10
 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
 
 I run the mksnap_ffs command to take the snapshot and some time later
 the system completely freezes up:


If the file system is UFS2 it's a known problem but should have been
fixed.
http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues

ident /boot/kernel/kernel | grep subr_sleepqueue

version should be greater than 1.39.2.3?

Regards

--
David Peall :: IT Manager
e-Schools' Network :: http://www.esn.org.za/
Phone +27 (021) 674-9140


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Jeremy Chadwick
On Wed, Nov 12, 2008 at 06:22:35PM +, Tim Bishop wrote:
 On Wed, Nov 12, 2008 at 08:10:50PM +0200, David Peall wrote:
   FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10
   20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
   
   I run the mksnap_ffs command to take the snapshot and some time later
   the system completely freezes up:
  
  If the file system is UFS2 it's a known problem but should have been
  fixed.
  http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
  
  ident /boot/kernel/kernel | grep subr_sleepqueue
  
  version should be greater than 1.39.2.3?
 
 Yes it's UFS2, and yes it's greater than 1.39.2.3:
 
 $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb 
 Exp $
 
 Are you sure the problem referenced on that page is the same? It talks
 about dog slow snapshotting, which I see on other filesystems and
 machines. But in this particular case the system is dead, and does not
 recover.

This problem gets brought up every few weeks on average, I think.

The problem still exists regardless of subr_sleepqueue.c 1.39.2.3.  I
can still reproduce it on every FreeBSD box I have access to.  The
last time I tried it was on 2008/10/24, on a RELENG_7 system built
from source csup'd on the same day.

The result of dump -L -0 -a -f /someplace/fs.dump /usr was the same:
the system became more or less unusable (meaning to the point where you
might as well not try to do anything with it because it's so incredibly
slow, especially with anything I/O-bound), and mksnap_ffs remained in
the following state for many, many minutes:

load: 0.00  cmd: mksnap_ffs 10480 [wdrain] 0.00u 0.06s 0% 1076k

Hitting ^C at this point took 4-5 *full minutes* to execute.  While ^C
was (hopefully) executing, the process remained in wdrain state as well.
After the process was terminated fully, the system was again responsive.

That filesystem (/usr) I was dumping:

Filesystem  1K-blocksUsed Avail Capacity iusedifree %iused  Mounted 
on
/dev/ad4s1f 163815904 3834316 146876316 3%  254752 209420461%   /usr

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Kostik Belousov
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
 I've been playing around with snapshots lately but I've got a problem on
 one of my servers running 7-STABLE amd64:
 
 FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 
 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
 
 I run the mksnap_ffs command to take the snapshot and some time later
 the system completely freezes up:
 
 paladin# cd /u2/.snap/
 paladin# mksnap_ffs /u2 test.1
 
 It only happens on this one filesystem, though, which might be to do
 with its size. It's not over the 2TB marker, but it's pretty close. It's
 also backed by a hardware RAID system, although a smaller filesystem on
 the same RAID has no issues.
 
 Filesystem  1K-blocks   Used Avail Capacity  Mounted on
 /dev/da0s1a 2078881084 921821396 99074920248%/u2
 
 To clarify completely freezes up: unresponsive to all services over
 the network, except ping. On the console I can switch between the ttys,
 but none of them respond. The only way out is to hit the reset button.
 
 Any advice? I'm happy to help debug this further to get to the bottom of
 it.

You need to provide information described in the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
and especially
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html


pgp1bwpyRLXdz.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Tim Bishop
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
 I run the mksnap_ffs command to take the snapshot and some time later
 the system completely freezes up:
 
 paladin# cd /u2/.snap/
 paladin# mksnap_ffs /u2 test.1

Someone (not named because they choose not to reply to the list) gave me
the following patch:

--- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006
+++ sys/ufs/ffs/ffs_snapshot.c  Mon Nov 20 14:59:13 2006
@@ -282,6 +282,8 @@ restart:
if (error)
goto out;
bawrite(nbp);
+   if (cg % 10 == 0)
+   ffs_syncvnode(vp, MNT_WAIT);
}
/*
 * Copy all the cylinder group maps. Although the
@@ -303,6 +305,8 @@ restart:
goto out;
error = cgaccount(cg, vp, nbp, 1);
bawrite(nbp);
+   if (cg % 10 == 0)
+   ffs_syncvnode(vp, MNT_WAIT);
if (error)
goto out;
}

With the description:

What can happen is on a big file system it will fill up the buffer
cache with I/O and then run out.  When the buffer cache fills up then no
more disk I/O can happen :-(  When you do a sync, it flushes that out to
disk so things don't hang.

It seems to work too. But it seems more like a workaround than a fix?

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Kostik Belousov
On Wed, Nov 12, 2008 at 07:49:28PM +, Tim Bishop wrote:
 On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
  I run the mksnap_ffs command to take the snapshot and some time later
  the system completely freezes up:
  
  paladin# cd /u2/.snap/
  paladin# mksnap_ffs /u2 test.1
 
 Someone (not named because they choose not to reply to the list) gave me
 the following patch:
 
 --- sys/ufs/ffs/ffs_snapshot.c.orig   Wed Mar 22 09:42:31 2006
 +++ sys/ufs/ffs/ffs_snapshot.cMon Nov 20 14:59:13 2006
 @@ -282,6 +282,8 @@ restart:
   if (error)
   goto out;
   bawrite(nbp);
 + if (cg % 10 == 0)
 + ffs_syncvnode(vp, MNT_WAIT);
   }
   /*
* Copy all the cylinder group maps. Although the
 @@ -303,6 +305,8 @@ restart:
   goto out;
   error = cgaccount(cg, vp, nbp, 1);
   bawrite(nbp);
 + if (cg % 10 == 0)
 + ffs_syncvnode(vp, MNT_WAIT);
   if (error)
   goto out;
   }
 
 With the description:
 
 What can happen is on a big file system it will fill up the buffer
 cache with I/O and then run out.  When the buffer cache fills up then no
 more disk I/O can happen :-(  When you do a sync, it flushes that out to
 disk so things don't hang.
 
 It seems to work too. But it seems more like a workaround than a fix?

It looks hackish, but in fact it is not that wrong, and I even say that
it provides reasonable workaround.

The usual way to prevent wdrain deadlock is to issue bwillwrite() call
before any vnode lock is taken. This is sufficient for most VFS syscalls
that typically put dozen or less dirty buffers into delayed write
queue.

Snapshot creation does not call bwillwrite() at all, and then does a lot
of async writes, completely saturating buffer cache with dirty buffers.
bwillwrite cannot be called after the vnode is locked, and just forcing
a sync for the embrionic snapshot vnode is good enough.

The 10 counter is debatable, but debate shall be postponed until the patch
goes into tree. I ask an anonymous submitter to commit it. Thanks !


pgpUWHqVm17Bs.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Tim Bishop
On Wed, Nov 12, 2008 at 08:10:50PM +0200, David Peall wrote:
  FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10
  20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
  
  I run the mksnap_ffs command to take the snapshot and some time later
  the system completely freezes up:
 
 If the file system is UFS2 it's a known problem but should have been
 fixed.
 http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
 
 ident /boot/kernel/kernel | grep subr_sleepqueue
 
 version should be greater than 1.39.2.3?

Yes it's UFS2, and yes it's greater than 1.39.2.3:

$FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb Exp 
$

Are you sure the problem referenced on that page is the same? It talks
about dog slow snapshotting, which I see on other filesystems and
machines. But in this particular case the system is dead, and does not
recover.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Tim Bishop
On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
 On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
  I've been playing around with snapshots lately but I've got a problem on
  one of my servers running 7-STABLE amd64:
  
  FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
  20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
  
  I run the mksnap_ffs command to take the snapshot and some time later
  the system completely freezes up:
  
  paladin# cd /u2/.snap/
  paladin# mksnap_ffs /u2 test.1
  
  It only happens on this one filesystem, though, which might be to do
  with its size. It's not over the 2TB marker, but it's pretty close. It's
  also backed by a hardware RAID system, although a smaller filesystem on
  the same RAID has no issues.
  
  Filesystem  1K-blocks   Used Avail Capacity  Mounted on
  /dev/da0s1a 2078881084 921821396 99074920248%/u2
  
  To clarify completely freezes up: unresponsive to all services over
  the network, except ping. On the console I can switch between the ttys,
  but none of them respond. The only way out is to hit the reset button.
 
 You need to provide information described in the
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
 and especially
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

Ok, I've done that, and removed the patch that seemed to fix things.

The first thing I notice after doing this on the console is that I can
still ctrl+t the process:

load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k

But the top and ps I left running on other ttys have all stopped
responding.

Also the following kernel message came out:

Expensive timeout(9) function: 0x802ce380(0xff000677ca50) 
0.006121001 s

There is also still some disk I/O.

Dropping to ddb worked, but I don't have a serial console so I can't
paste the output.

ps shows mksnap_ffs in newbuf, as we already saw. A trace of mksnap_ffs
looks like this:

Tracing pid 2603 tid 100214 td 0xff0006a0e370
sched_switch() at sched_switch+0x2a1
mi_switch() at mi_switch+0x233
sleepq_switch() at sleepq_switch+0xe9
sleepq_wait() at sleepq_wait+0x44
_sleep() at _sleep+0x351
getnewbuf() at getnewbuf+0x2e1
getblk() at getblk+0x30d
setup_allocindir_phase2() at setup_allocindir_phase2+0x338
softdep_setup_allocindir_page() at softdep_setup_allocindir_page+0xa7
ffs_balloc_ufs2() at ffs_balloc_ufs2+0x121e
ffs_snapshot() at ffs_snapshot+0xc52
ffs_mount() at ffs_mount+0x735
vfs_donmount() at vfs_donmount+0xeb5
kernel_mount() at kernel_mount+0xa1
ffs_cmount() at ffs_cmount+0x92
mount() at mount+0x1cc
syscall() at syscall+0x1f6
Xfast_syscall() at Xfast_syscall+0xab
--- syscall (21, FreeBSD ELF64, mount), rip = 0x80068636c, rsp = 
0x7fffe518, rbp = 0x8008447a0 ---

show pcpu shows cpuid 3 (quad core machine) in thread swi6: Giant taskq.
All the other cpus are idle.

show locks shows:

exclusive sleep mutex Giant r = 0 (0x806ae040) locked @ 
/usr/src/sys/kern/kern_intr.c:1087

There are two other locks shown by show all locks, one for sshd and one
for mysqld, both in kern/uipc_sockbuf.c.

show lockedvnods shows mksnap_ffs has a lock on da0s1a with ffs_vget at
the top of the stack.

Sorry for any typos. I'll sort out a serial cable if more is needed :-)

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Jeremy Chadwick
On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
 On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
  On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
   I've been playing around with snapshots lately but I've got a problem on
   one of my servers running 7-STABLE amd64:
   
   FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
   20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64
   
   I run the mksnap_ffs command to take the snapshot and some time later
   the system completely freezes up:
   
   paladin# cd /u2/.snap/
   paladin# mksnap_ffs /u2 test.1
   
   It only happens on this one filesystem, though, which might be to do
   with its size. It's not over the 2TB marker, but it's pretty close. It's
   also backed by a hardware RAID system, although a smaller filesystem on
   the same RAID has no issues.
   
   Filesystem  1K-blocks   Used Avail Capacity  Mounted on
   /dev/da0s1a 2078881084 921821396 99074920248%/u2
   
   To clarify completely freezes up: unresponsive to all services over
   the network, except ping. On the console I can switch between the ttys,
   but none of them respond. The only way out is to hit the reset button.
  
  You need to provide information described in the
  http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
  and especially
  http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
 Ok, I've done that, and removed the patch that seemed to fix things.
 
 The first thing I notice after doing this on the console is that I can
 still ctrl+t the process:
 
 load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
 
 But the top and ps I left running on other ttys have all stopped
 responding.

Then in my book, the patch didn't fix anything.  :-)  The system is
still deadlocking; snapshot generation **should not** wedge the system
hard like this.

Also, during my own testing, I am always able to use Ctrl-T to get
SIGINFO from the running process (mksnap_ffs).  That behaviour does not
change for me.

The rest of the below information is good -- but I'm confused about
something: is there anyone out there who can use mksnap_ffs on a
filesystem (/usr is a good test source) and NOT experience this
deadlocking problem?  Literally *every* FreeBSD box I have root access
to suffers from this problem, so I'm a little baffled why we end-users
need to keep providing debugging output when it should be easy as pie
for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
their system wedge.

Also, a fellow on -fs just mentioned he's having this exact problem:

http://lists.freebsd.org/pipermail/freebsd-fs/2008-November/005324.html

 Also the following kernel message came out:
 
 Expensive timeout(9) function: 0x802ce380(0xff000677ca50) 
 0.006121001 s

 There is also still some disk I/O.
 
 Dropping to ddb worked, but I don't have a serial console so I can't
 paste the output.
 
 ps shows mksnap_ffs in newbuf, as we already saw. A trace of mksnap_ffs
 looks like this:
 
 Tracing pid 2603 tid 100214 td 0xff0006a0e370
 sched_switch() at sched_switch+0x2a1
 mi_switch() at mi_switch+0x233
 sleepq_switch() at sleepq_switch+0xe9
 sleepq_wait() at sleepq_wait+0x44
 _sleep() at _sleep+0x351
 getnewbuf() at getnewbuf+0x2e1
 getblk() at getblk+0x30d
 setup_allocindir_phase2() at setup_allocindir_phase2+0x338
 softdep_setup_allocindir_page() at softdep_setup_allocindir_page+0xa7
 ffs_balloc_ufs2() at ffs_balloc_ufs2+0x121e
 ffs_snapshot() at ffs_snapshot+0xc52
 ffs_mount() at ffs_mount+0x735
 vfs_donmount() at vfs_donmount+0xeb5
 kernel_mount() at kernel_mount+0xa1
 ffs_cmount() at ffs_cmount+0x92
 mount() at mount+0x1cc
 syscall() at syscall+0x1f6
 Xfast_syscall() at Xfast_syscall+0xab
 --- syscall (21, FreeBSD ELF64, mount), rip = 0x80068636c, rsp = 
 0x7fffe518, rbp = 0x8008447a0 ---
 
 show pcpu shows cpuid 3 (quad core machine) in thread swi6: Giant taskq.
 All the other cpus are idle.
 
 show locks shows:
 
 exclusive sleep mutex Giant r = 0 (0x806ae040) locked @ 
 /usr/src/sys/kern/kern_intr.c:1087
 
 There are two other locks shown by show all locks, one for sshd and one
 for mysqld, both in kern/uipc_sockbuf.c.
 
 show lockedvnods shows mksnap_ffs has a lock on da0s1a with ffs_vget at
 the top of the stack.
 
 Sorry for any typos. I'll sort out a serial cable if more is needed :-)
 
 Tim.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Doug Ambrisko
Jeremy Chadwick writes:
[snip]
| The rest of the below information is good -- but I'm confused about
| something: is there anyone out there who can use mksnap_ffs on a
| filesystem (/usr is a good test source) and NOT experience this
| deadlocking problem?  Literally *every* FreeBSD box I have root access
| to suffers from this problem, so I'm a little baffled why we end-users
| need to keep providing debugging output when it should be easy as pie
| for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
| their system wedge.

We can at work, but we have a bunch of other patches.  There are a
few problems with the buffer cache:
1)  The buffer daemon can't use the space that is reserved for it
since to flush some stuff it needs to use more buffers.
2)  The buffer cache can get fragmented to prevent large I/O
which the buffer daemon may need.
3)  Other issues ...
I have fix for 1.  It is pretty easy.  I have a hack'ish fix for 2
in the I make all request use max size so it can't get fragmented
since there is no code to defrag and it isn't trivial to defrag the
memory.  I have some fixes for some other issues, but there were
some review issues with them.  I might just commit the fixes for 1 and
2.  It makes things better and there was no-objections at the time.
We have the patches in shipping products.

I can try to do some experiments at work like you said since I 
had similar things working before and it is pretty easy to put in
printf's to see the issue.

Doug A.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread David Wolfskill
On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
 ...
   On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
I've been playing around with snapshots lately but I've got a problem on
one of my servers running 7-STABLE amd64:

FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64

I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1

It only happens on this one filesystem, though, which might be to do
with its size. It's not over the 2TB marker, but it's pretty close. It's
also backed by a hardware RAID system, although a smaller filesystem on
the same RAID has no issues.
 ...
 Then in my book, the patch didn't fix anything.  :-)  The system is
 still deadlocking; snapshot generation **should not** wedge the system
 hard like this.
 
 Also, during my own testing, I am always able to use Ctrl-T to get
 SIGINFO from the running process (mksnap_ffs).  That behaviour does not
 change for me.
 
 The rest of the below information is good -- but I'm confused about
 something: is there anyone out there who can use mksnap_ffs on a
 filesystem (/usr is a good test source) and NOT experience this
 deadlocking problem?

I hadn't ever tried until I saw your message.  Granted, I'm using a
smaller file system (I doubt that I have a toital of as much as 2 TB in
all my machines combined), and I'm running i386, vs. amd64.  But it ran
just fine.  I wasn't able to test SIGINFO; it finished before I had a
chance.  (I ran it under time(1); wall clock time was 0.91 sec.)

 Literally *every* FreeBSD box I have root access
 to suffers from this problem, so I'm a little baffled why we end-users
 need to keep providing debugging output when it should be easy as pie
 for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
 their system wedge.

Well, I routinely use dump/restore pipelines to copy file systems
around; never had a problem with it.

 ...

For reference:

freebeast(7.1-P)[9] uname -a
FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: 
Wed Nov 12 05:16:20 PST 2008 [EMAIL 
PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST  i386
freebeast(7.1-P)[10] ls -la
total 4
drwxrwxr-x   2 root  operator  512 Nov 12 20:53 .
drwxr-xr-x  14 root  wheel 512 Jan 22  2008 ..
freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1
0.91 real 0.00 user 0.05 sys
   976  maximum resident set size
 3  average shared memory size
   627  average unshared data size
   109  average unshared stack size
   104  page reclaims
 0  page faults
 0  swaps
 1  block input operations
   230  block output operations
 0  messages sent
 0  messages received
 0  signals received
   101  voluntary context switches
34  involuntary context switches
freebeast(7.1-P)[12] ls -la
total 1460
drwxrwxr-x   2 root  operator 512 Nov 12 20:54 .
drwxr-xr-x  14 root  wheel512 Jan 22  2008 ..
-r--r-   1 root  operator  2410791056 Nov 12 20:54 test.1
freebeast(7.1-P)[13] 

Peace,
david
-- 
David H. Wolfskill  [EMAIL PROTECTED]
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.


pgpWq3nEgQVfJ.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Doug Ambrisko
Kostik Belousov writes:
| On Wed, Nov 12, 2008 at 07:49:28PM +, Tim Bishop wrote:
|  On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
|   I run the mksnap_ffs command to take the snapshot and some time later
|   the system completely freezes up:
|   
|   paladin# cd /u2/.snap/
|   paladin# mksnap_ffs /u2 test.1
|  
|  Someone (not named because they choose not to reply to the list) gave me
|  the following patch:
|  
|  --- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006
|  +++ sys/ufs/ffs/ffs_snapshot.c  Mon Nov 20 14:59:13 2006
|  @@ -282,6 +282,8 @@ restart:
|  if (error)
|  goto out;
|  bawrite(nbp);
|  +   if (cg % 10 == 0)
|  +   ffs_syncvnode(vp, MNT_WAIT);
|  }
|  /*
|   * Copy all the cylinder group maps. Although the
|  @@ -303,6 +305,8 @@ restart:
|  goto out;
|  error = cgaccount(cg, vp, nbp, 1);
|  bawrite(nbp);
|  +   if (cg % 10 == 0)
|  +   ffs_syncvnode(vp, MNT_WAIT);
|  if (error)
|  goto out;
|  }
|  
|  With the description:
|  
|  What can happen is on a big file system it will fill up the buffer
|  cache with I/O and then run out.  When the buffer cache fills up then no
|  more disk I/O can happen :-(  When you do a sync, it flushes that out to
|  disk so things don't hang.
|  
|  It seems to work too. But it seems more like a workaround than a fix?
| 
| It looks hackish, but in fact it is not that wrong, and I even say that
| it provides reasonable workaround.
| 
| The usual way to prevent wdrain deadlock is to issue bwillwrite() call
| before any vnode lock is taken. This is sufficient for most VFS syscalls
| that typically put dozen or less dirty buffers into delayed write
| queue.
| 
| Snapshot creation does not call bwillwrite() at all, and then does a lot
| of async writes, completely saturating buffer cache with dirty buffers.
| bwillwrite cannot be called after the vnode is locked, and just forcing
| a sync for the embrionic snapshot vnode is good enough.
| 
| The 10 counter is debatable, but debate shall be postponed until the patch
| goes into tree. I ask an anonymous submitter to commit it. Thanks !

I plan to commit it tomorrow since I sent it to Tim to test.  The 10 can 
be tuned but it has kept a bunch of machines at work up.  Glad people 
don't think it is that it is to wrong :-)  It probably could be made
a little more dynamic but I wonder if it would show any real performance
difference and might risk more bugs.

Doug A.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Jeremy Chadwick
On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote:
 On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
  ...
On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
 I've been playing around with snapshots lately but I've got a problem 
 on
 one of my servers running 7-STABLE amd64:
 
 FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
 20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  
 amd64
 
 I run the mksnap_ffs command to take the snapshot and some time later
 the system completely freezes up:
 
 paladin# cd /u2/.snap/
 paladin# mksnap_ffs /u2 test.1
 
 It only happens on this one filesystem, though, which might be to do
 with its size. It's not over the 2TB marker, but it's pretty close. 
 It's
 also backed by a hardware RAID system, although a smaller filesystem 
 on
 the same RAID has no issues.
  ...
  Then in my book, the patch didn't fix anything.  :-)  The system is
  still deadlocking; snapshot generation **should not** wedge the system
  hard like this.
  
  Also, during my own testing, I am always able to use Ctrl-T to get
  SIGINFO from the running process (mksnap_ffs).  That behaviour does not
  change for me.
  
  The rest of the below information is good -- but I'm confused about
  something: is there anyone out there who can use mksnap_ffs on a
  filesystem (/usr is a good test source) and NOT experience this
  deadlocking problem?
 
 I hadn't ever tried until I saw your message.  Granted, I'm using a
 smaller file system (I doubt that I have a toital of as much as 2 TB in
 all my machines combined), and I'm running i386, vs. amd64.  But it ran
 just fine.  I wasn't able to test SIGINFO; it finished before I had a
 chance.  (I ran it under time(1); wall clock time was 0.91 sec.)
 
  Literally *every* FreeBSD box I have root access
  to suffers from this problem, so I'm a little baffled why we end-users
  need to keep providing debugging output when it should be easy as pie
  for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
  their system wedge.
 
 Well, I routinely use dump/restore pipelines to copy file systems
 around; never had a problem with it.
 
  ...
 
 For reference:
 
 freebeast(7.1-P)[9] uname -a
 FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: 
 Wed Nov 12 05:16:20 PST 2008 [EMAIL 
 PROTECTED]:/common/S3/obj/usr/src/sys/FREEBEAST  i386
 freebeast(7.1-P)[10] ls -la
 total 4
 drwxrwxr-x   2 root  operator  512 Nov 12 20:53 .
 drwxr-xr-x  14 root  wheel 512 Jan 22  2008 ..
 freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1
 0.91 real 0.00 user 0.05 sys
976  maximum resident set size
  3  average shared memory size
627  average unshared data size
109  average unshared stack size
104  page reclaims
  0  page faults
  0  swaps
  1  block input operations
230  block output operations
  0  messages sent
  0  messages received
  0  signals received
101  voluntary context switches
 34  involuntary context switches
 freebeast(7.1-P)[12] ls -la
 total 1460
 drwxrwxr-x   2 root  operator 512 Nov 12 20:54 .
 drwxr-xr-x  14 root  wheel512 Jan 22  2008 ..
 -r--r-   1 root  operator  2410791056 Nov 12 20:54 test.1
 freebeast(7.1-P)[13] 

David, thanks for chiming in.  This is exactly what I was
fearing/worried about.

It would be greatly beneficial if we could figure out what triggers the
slowdown for a lot of us, since for others (proof above) mksnap_ffs
behaves as expected.

Since I'm able to reproduce this pretty much everywhere, here's
information:

# df -ki /usr
Filesystem  1024-blocksUsed Avail Capacity iusedifree %iused  
Mounted on
/dev/ad4s1f   163815904 3835274 146875358 3%  254864 209419341%   /usr

# cd /usr/.snap
# /usr/bin/time -l mksnap_ffs /usr test.1

after about 20 seconds, hitting Ctrl-T

load: 1.90  cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k
   23.25 real 0.00 user 0.00 sys

  135.98 real 0.00 user 0.62 sys
  1092  maximum resident set size
 4  average shared memory size
  1081  average unshared data size
   135  average unshared stack size
   101  page reclaims
 0  page faults
 0  swaps
   895  block input operations
 13444  block output operations
 0  messages sent
 0  messages received
 0  signals received
  6433  voluntary context switches
   197  involuntary context switches
# ls -l test.1
-r--r-  1 root  operator  173203463240 Nov 12 21:42 test.1

David's filesystem is 2GBs, while mine is 16GB.  His snap takes under 1
second, yet mine takes over 2 minutes.

Possibly the large deviation is explained by the amount of space used on
the 

Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Wilko Bulte
Quoting Jeremy Chadwick, who wrote on Wed, Nov 12, 2008 at 08:42:00PM -0800 ..
 On Thu, Nov 13, 2008 at 12:41:02AM +, Tim Bishop wrote:
  On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
   On Wed, Nov 12, 2008 at 05:58:26PM +, Tim Bishop wrote:
I've been playing around with snapshots lately but I've got a problem on
one of my servers running 7-STABLE amd64:

FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 
20:49:51 GMT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PALADIN  amd64

I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1

It only happens on this one filesystem, though, which might be to do
with its size. It's not over the 2TB marker, but it's pretty close. It's
also backed by a hardware RAID system, although a smaller filesystem on
the same RAID has no issues.

Filesystem  1K-blocks   Used Avail Capacity  Mounted on
/dev/da0s1a 2078881084 921821396 99074920248%/u2

To clarify completely freezes up: unresponsive to all services over
the network, except ping. On the console I can switch between the ttys,
but none of them respond. The only way out is to hit the reset button.
   
   You need to provide information described in the
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
   and especially
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
  
  Ok, I've done that, and removed the patch that seemed to fix things.
  
  The first thing I notice after doing this on the console is that I can
  still ctrl+t the process:
  
  load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
  
  But the top and ps I left running on other ttys have all stopped
  responding.
 
 Then in my book, the patch didn't fix anything.  :-)  The system is
 still deadlocking; snapshot generation **should not** wedge the system
 hard like this.
 
 Also, during my own testing, I am always able to use Ctrl-T to get
 SIGINFO from the running process (mksnap_ffs).  That behaviour does not
 change for me.
 
 The rest of the below information is good -- but I'm confused about
 something: is there anyone out there who can use mksnap_ffs on a
 filesystem (/usr is a good test source) and NOT experience this
 deadlocking problem?  Literally *every* FreeBSD box I have root access
 to suffers from this problem, so I'm a little baffled why we end-users
 need to keep providing debugging output when it should be easy as pie
 for a developer to do dump -0 -L -a -f /path/fs.dump /usr and watch
 their system wedge.

dump -L on my RELENG_7 machine does not wedge it.  So there must be
multiple factors influencing the snap creating problems or not.

Wilko
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Peter Jeremy
On 2008-Nov-12 20:47:37 -0800, Doug Ambrisko [EMAIL PROTECTED] wrote:
I plan to commit it tomorrow since I sent it to Tim to test.  The 10 can 
be tuned but it has kept a bunch of machines at work up.  Glad people 
don't think it is that it is to wrong :-)  It probably could be made
a little more dynamic but I wonder if it would show any real performance
difference and might risk more bugs.

FWIW, I've been running the patch since I first saw Doug post it in
Feb 2006 and don't recall ever having problems with mksnap_ffs since
applying it (I did before)

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgpXbZklXqxzU.pgp
Description: PGP signature


Re: System deadlock when using mksnap_ffs

2008-11-12 Thread Kevin Day


(moving my thread from -fs to -stable)


Before touching anything, here's a description of the symptoms I  
see... Rather busy system, with quite a bit of filesystem activity  
occurring while the snapshot is being made. Quad CPU amd64 box with  
16GB of ram, 6x10Krpm RAID array. Should be reasonably fast.


Filesystem   1K-blocks Used Avail Capacity iusedifree  
%iused  Mounted on
/dev/da0s1a  739339824 74357926 60583471411% 1718540  
938554742%   /


1.7 million inodes, 71G used of a 705G volume.

Here's a timeline of what I see when starting to make a new snapshot.  
I've got a few windows running, showing top, iostat, etc.



Baseline disk activity before starting anything:

device r/s   w/skr/skw/s wait svc_t  b
da0   24.0   2.0   355.632.01  10.7  28


0m0s: Snapshot begins, using mount -u -o snapshot //.snap/weekly. 
0 /  Drives immediately jump to 100% busy as expected.


device r/s   w/skr/skw/s wait svc_t  b
da0  153.8   6.0  3378.695.92  16.9 100

the mount process is spending 100% of its time in biord.


2m10s: The mount process starts spending more and more time in  
snaplk, alternating with biord.


device r/s   w/skr/skw/s wait svc_t  b
da0   77.9  67.9  1270.7  3754.21  10.7 100


12m15s: The first intermittent slowdowns start affecting other  
processes on the system. Occasionally all active processes will get  
stuck in snaplk or ufs for 5-10 seconds before resuming.


device r/s   w/skr/skw/s wait svc_t  b
da0   77.9  31.0  1150.8  1054.91  10.4 100


114m47s: Active processes are briefly stuck in suspfs

115m22s: Mount is now in snaprdb, Active processes are now  
completely stuck in snaplk. Still responsive to SIGINFO, top is  
still running, etc. Just hangs any time anything needs the filesystem.


device r/s   w/skr/skw/s wait svc_t  b
da0  238.8   0.0  3820.1 0.01   4.1  99

143m19s: Mount now in wdrain.

143m34s: Finished.

snapshot logging shows /: suspended 13.308 sec, redo 153 of 4058   
Most processes were hung for 28 minutes.



Is this what others are seeing? It sounds like some of the complaints  
are it getting stuck in the wdrain state, not what I'm showing here.





Another mildly annoying note: Any process that touches .snap while a  
snapshot is being generated gets stuck in ufs until it finishes. I  
can understand wanting to keep operations in there in sync, but it  
would be really nice if find / wouldn't get hung when it tries to  
decent into .snap, for example.


ts5# cd /.snap
ts5# ls -l
^T
load: 0.17  cmd: ls 3696 [ufs] 0.00u 0.00s 0% 1496k


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]