Re: [zfs-discuss] ZFS + DB + "fragments"

can you guess? Tue, 20 Nov 2007 05:18:06 -0800

...

> My understanding of ZFS (in short: an upside down
> tree) is that each block is referenced by it's
> parent. So regardless of how many snapshots you take,
> each block is only ever referenced by one other, and
> I'm guessing that the pointer and checksum are both
> stored there.
> 
> If that's the case, to move a block it's just a case
> of:
> - read the data
> - write to the new location
> - update the pointer in the parent block


Which changes the contents of the parent block (the change in the data checksum 
changed it as well), and thus requires that this parent also be rewritten 
(using COW), which changes the pointer to it (and of course its checksum as 
well) in *its* parent block, which thus also must be re-written... and finally 
a new copy of the superblock is written to reflect the new underlying tree 
structure - all this in a single batch-written 'transaction'.

The old version of each of these blocks need only be *saved* if a snapshot 
exists and it hasn't previously been updated since that snapshot was created.  
But all the blocks need to be COWed even if no snapshot exists (in which case 
the old versions are simply discarded).

...
 
> PS.
> 
> >1. You'd still need an initial defragmentation pass
> to ensure that the file was reasonably piece-wise
> contiguous to begin with.
> 
> No, not necessarily.  If you were using a zpool
> configured like this I'd hope you were planning on
> creating the file as a contiguous block in the first
> place :)

I'm not certain that you could ensure this if other updates in the system were 
occurring concurrently.  Furthermore, the file may be extended dynamically as 
new data is inserted, and you'd like to have some mechanism that could restore 
reasonable contiguity to the result (which can be difficult to accomplish in 
the foreground if, for example, free space doesn't happen to exist on the disk 
right after the existing portion of the file).

...
 
> Any zpool with this option would probably be
> dedicated to the database file and nothing else.  In
> fact, even with multiple databases I think I'd have a
> single pool per database.

It's nice if you can afford such dedicated resources, but it seems a bit 
cavalier to ignore users who just want decent performance from a database that 
has to share its resources with other activity.

Your prompt response is probably what prevented me from editing my previous 
post after I re-read it and realized I had overlooked the fact that 
over-writing the old data complicates things.  So I'll just post the revised 
portion here:


3.  Now you must make the above transaction persistent, and then randomly 
over-write the old data block with the new data (since that data must be in 
place before you update the path to it below, and unfortunately since its 
location is not arbitrary you can't combine this update with either the 
transaction above or the transaction below).

4.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block (save for the fact that 
the block itself was already written above):  all the *additional* overhead 
occurred in the previous steps.

So instead of a single full-path update that fragments the file, you have two 
full-path updates, a random write, and possibly a random read initially to 
fetch the old data.  And you still need an initial defrag pass to establish 
initial contiguity.  Furthermore, these additional resources are consumed at 
normal rather than the reduced priority at which a background reorg can 
operate.  On the plus side, though, the file would be kept contiguous all the 
time rather than just returned to contiguity whenever there was time to do so.

...

> Taking it a stage further, I wonder if this would
> work well with the prioritized write feature request
> (caching writes to a solid state disk)?
>  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
> age_Developer_Wish_List
> 
> That could potentially mean there's very little
> slowdown:
>  - Read the original block
> - Save that to solid state disk
>  - Write the the new block in the original location
> - Periodically stream writes from the solid state
> disk to the main storage

I'm not sure this would confer much benefit if things in fact need to be 
handled as I described above.  In particular, if a snapshot exists you almost 
certainly must establish the old version in its new location in the snapshot 
rather than just capture it in the log; if no snapshot exists you could capture 
the old version in the log and then discard it as soon as the new version 
becomes persistent, but I'm not sure how easily that (and especially recovering 
should a crash occur before the new version becomes persistent) could be 
integrated with the existing COW facilities.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

Reply via email to