Re: ffs snapshots patch

2011-04-29 Thread Ignatios Souvatzis
On Fri, Apr 29, 2011 at 01:56:01PM +0200, Juergen Hannken-Illjes wrote:
 On Fri, Apr 29, 2011 at 01:48:39PM +0200, Manuel Bouyer wrote:
  With your last changes, things are much better now:
  /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
149.85 real 0.00 user 1.16 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
227.49 real 0.00 user 1.90 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
263.58 real 0.00 user 2.97 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
353.23 real 0.00 user 3.88 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  
  Taking a snapshot will still probably require a lot of time on
  large filesystems with a dozen snapshots, but at last the server
  won't hang for a long time.
  thanks !
 
 Not really.  Any thread ending up in ffs_copyonwrite() or ffs_snapblkfree()
 will block.  If this server runs NFS it could be possible that all NFS
 server threads block.

Oh - I might have seen this on Monday - 5.99.47 on sparc64. All I saw
was [tstile], and the quickest way out after a couple of minutes was to
hard reboot the machine and let wapbl / fsck sort it out - and to move
back to the pre-snapshot rsync script.

Sorry, no core dump.

Regards,
-is


Re: ffs snapshots patch

2011-04-29 Thread Manuel Bouyer
On Fri, Apr 29, 2011 at 01:56:01PM +0200, Juergen Hannken-Illjes wrote:
 On Fri, Apr 29, 2011 at 01:48:39PM +0200, Manuel Bouyer wrote:
  With your last changes, things are much better now:
  /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
149.85 real 0.00 user 1.16 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
227.49 real 0.00 user 1.90 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
263.58 real 0.00 user 2.97 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
353.23 real 0.00 user 3.88 sys
  /home: suspended 0.040 sec, redo 0 of 2556
  
  Taking a snapshot will still probably require a lot of time on
  large filesystems with a dozen snapshots, but at last the server
  won't hang for a long time.
  thanks !
 
 Not really.  Any thread ending up in ffs_copyonwrite() or ffs_snapblkfree()
 will block.  If this server runs NFS it could be possible that all NFS
 server threads block.

hum, this is bad then, because a NFS server with home directories is seeing
mostly writes ...

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: ffs snapshots patch

2011-04-28 Thread Manuel Bouyer
On Thu, Apr 28, 2011 at 11:48:55AM +0200, Juergen Hannken-Illjes wrote:
 On Wed, Apr 27, 2011 at 10:43:59AM +0200, Manuel Bouyer wrote:
  On Mon, Apr 18, 2011 at 09:36:25AM +0200, Juergen Hannken-Illjes wrote:
   [...]
Fixing 2) is trickier. To avoid the heavy writes to the snapshot file
with the fs suspended, the snapshot appears with its real lenght and
blocks at the time of creation, but is marked invalid (only the
inode block needs to be copied, and this can be done before suspending
the fs). Now BLK_SNAP should never be seen as a block number, and we 
skip
ffs_copyonwrite() if the write is to a snapshot inode.
   
   I strongly object here.  There are good reasons to expunge old snapshots.
   
   Even it it were done right, without deadlocks and locking-against-self,
   the resulting snapshot looses at least two properties:
   
   - A snapshot is considered stable.  Whenever you read a block you get
 the same contents.  Allowing old snapshots to exist but not running
 copy-on-write means these blocks will change their contents.
   
   - A snapshot will fsck clean.  It is impossible to change fsck_ffs
 to check a snapshot as these old snapshots indirect blocks now will
 contain garbage.
  
  Maybe we should relax these contraints then
 
 No.  We use snapshots (with -X) for fsck and dump.  This makes no sense
 if we cannot fsck a snapshot any more.

AFAIK dump will ignore snapshot files (or at last it should), so it's not a
problem is the snapshot's blocks changes while we're working on a snapshot.
Also AFAIK, the above issue will only cause fsck to report missing blocks
in group maps and summary informations. It's not a big deal either.

In their current form, snapshot are not useable even for this, because
it's not acceptable to suspend a file server for several 10s of
seconds (if not minutes) to start a dump or fsck.

  /home: suspended 170.733 sec, redo 0 of 2556
  
  Even a 14s hang is still a long time for a NFS server (workstations will be
  frozen by this time). Even if we can make it shorter with some filesystem
  tuning, it still doesn't scale with the size of the filesystem and
  the number of snapshot (having 12 persistent snapshots on a filesystem is
  not a unreasonable number).
  Other OSes can do it with almost no freeze, so it should be possible
  (the snapshot may not be fsck-able, but I'm not sure it's the most
  important property of FS snapshots).
 
 The only other OS with ffs+snapshots is FreeBSD which should behave similiar.
 Other file systems like ZFS, NilFS etc. will be faster and scale better as
 they are designed with instant snapshots in mind.

what about ext3fs ?

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: ffs snapshots patch

2011-04-18 Thread Juergen Hannken-Illjes
On Sat, Apr 16, 2011 at 09:29:26PM +0200, Manuel Bouyer wrote:
 Hello,
 attached is a work in progress on ffs snapshot (as it's work in progress,
 some debug and instrumentation code is still present in the
 patch, no need to comment on this part :).
 The start of this work is that when working on quota, I noticed that
 taking a snapshot on a 500Gb filesystem needs several minutes, and is
 O(n) with the number of persisent snapshots.
 Here's some timings on a otherwise idle 500Gb filesystem (it's some brand of
 SATA2 3.5 drive attached to a AHCI controller, so it's a reasonable test
 bed for today):
 java# /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
   260.53 real 0.00 user 1.15 sys
 /home: suspended 77.873 sec, redo 1184 of 2556
 java# /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
   377.87 real 0.00 user 2.53 sys
 /home: suspended 206.078 sec, redo 1184 of 2556
 java# /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
   508.23 real 0.00 user 4.28 sys
 /home: suspended 338.534 sec, redo 1184 of 2556
 java# /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
   621.40 real 0.00 user 5.50 sys
 /home: suspended 431.154 sec, redo 1183 of 2556
 
 suspending a filesystem for more than 7mn to take a snapshot makes
 persisent snapshot quite useless to me. I wonder how it would behaves
 on a multi-terabyte filesystem.
 
 I looked at where the time is spend and found 2 major issues:
 1 cgaccount() works in 2 pass: first it copies cg before suspending the
   filesystem; then it is called again to copy only the cg that have been
   modified between copy and filesystem suspend.
   The problem is that to copy a cg we need to allocate blocks for the snapshot
   file, which may be in a cg we just copied. This is the cause of the high
   number of cg copies (almost half of them) with the filesystem suspended.
 
 2 while the filesystem is suspended, we want to expunge the snapshot files
   from the snapshot view (make them appear as a 0-length file).
   With ~500GB sparse files this is a lot of work.
 
 I fixed 1) by preallocating needed blocks snapshot_setup(). 

Good catch.  Committed.

 Fixing 2) is trickier. To avoid the heavy writes to the snapshot file
 with the fs suspended, the snapshot appears with its real lenght and
 blocks at the time of creation, but is marked invalid (only the
 inode block needs to be copied, and this can be done before suspending
 the fs). Now BLK_SNAP should never be seen as a block number, and we skip
 ffs_copyonwrite() if the write is to a snapshot inode.

I strongly object here.  There are good reasons to expunge old snapshots.

Even it it were done right, without deadlocks and locking-against-self,
the resulting snapshot looses at least two properties:

- A snapshot is considered stable.  Whenever you read a block you get
  the same contents.  Allowing old snapshots to exist but not running
  copy-on-write means these blocks will change their contents.

- A snapshot will fsck clean.  It is impossible to change fsck_ffs
  to check a snapshot as these old snapshots indirect blocks now will
  contain garbage.

You cannot copy blocks before suspension without rewriting them once
the file system is suspended.

The check in ffs_copyonwrite() will only work as long as the old
snapshot exists.  As sson as it gets removed we will run COW
on the blocks used by the old snapshot.

 With these changes the times are much more reasonable:
 /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
   299.68 real 0.00 user 1.10 sys
 /home: suspended 0.310 sec, redo 0 of 2556
 /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
   188.10 real 0.00 user 0.86 sys
 /home: suspended 0.270 sec, redo 0 of 2556
 /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
   169.78 real 0.00 user 0.95 sys
 /home: suspended 0.450 sec, redo 0 of 2556
 /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
   172.39 real 0.00 user 0.99 sys
 /home: suspended 0.300 sec, redo 0 of 2556
 
 This seems to work; one issue with this patch is that the block
 count for the snapshot inode, and block summary informations (the
 second being probably a consequence of the first) appear wrong when
 running fsck against a snapshot.  I believe this is fixable, but
 I've not yet found from where the information mismatch is coming from.
 
 comments ?
 
 PS: I'm away from computers for one week, so don't expect replies to
 your comments before next sunday.
 
 -- 
 Manuel Bouyer bou...@antioche.eu.org
  NetBSD: 26 ans d'experience feront toujours la difference
 --

-- 
Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)