... > My understanding of ZFS (in short: an upside down > tree) is that each block is referenced by it's > parent. So regardless of how many snapshots you take, > each block is only ever referenced by one other, and > I'm guessing that the pointer and checksum are both > stored there. > > If that's the case, to move a block it's just a case > of: > - read the data > - write to the new location > - update the pointer in the parent block
Which changes the contents of the parent block (the change in the data checksum changed it as well), and thus requires that this parent also be rewritten (using COW), which changes the pointer to it (and of course its checksum as well) in *its* parent block, which thus also must be re-written... and finally a new copy of the superblock is written to reflect the new underlying tree structure - all this in a single batch-written 'transaction'. The old version of each of these blocks need only be *saved* if a snapshot exists and it hasn't previously been updated since that snapshot was created. But all the blocks need to be COWed even if no snapshot exists (in which case the old versions are simply discarded). ... > PS. > > >1. You'd still need an initial defragmentation pass > to ensure that the file was reasonably piece-wise > contiguous to begin with. > > No, not necessarily. If you were using a zpool > configured like this I'd hope you were planning on > creating the file as a contiguous block in the first > place :) I'm not certain that you could ensure this if other updates in the system were occurring concurrently. Furthermore, the file may be extended dynamically as new data is inserted, and you'd like to have some mechanism that could restore reasonable contiguity to the result (which can be difficult to accomplish in the foreground if, for example, free space doesn't happen to exist on the disk right after the existing portion of the file). ... > Any zpool with this option would probably be > dedicated to the database file and nothing else. In > fact, even with multiple databases I think I'd have a > single pool per database. It's nice if you can afford such dedicated resources, but it seems a bit cavalier to ignore users who just want decent performance from a database that has to share its resources with other activity. Your prompt response is probably what prevented me from editing my previous post after I re-read it and realized I had overlooked the fact that over-writing the old data complicates things. So I'll just post the revised portion here: 3. Now you must make the above transaction persistent, and then randomly over-write the old data block with the new data (since that data must be in place before you update the path to it below, and unfortunately since its location is not arbitrary you can't combine this update with either the transaction above or the transaction below). 4. You can't just slide in the new version of the block using the old version's existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version's new location that you just had to establish (if a snapshot exists, that's the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block (save for the fact that the block itself was already written above): all the *additional* overhead occurred in the previous steps. So instead of a single full-path update that fragments the file, you have two full-path updates, a random write, and possibly a random read initially to fetch the old data. And you still need an initial defrag pass to establish initial contiguity. Furthermore, these additional resources are consumed at normal rather than the reduced priority at which a background reorg can operate. On the plus side, though, the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ... > Taking it a stage further, I wonder if this would > work well with the prioritized write feature request > (caching writes to a solid state disk)? > http://www.genunix.org/wiki/index.php/OpenSolaris_Sto > age_Developer_Wish_List > > That could potentially mean there's very little > slowdown: > - Read the original block > - Save that to solid state disk > - Write the the new block in the original location > - Periodically stream writes from the solid state > disk to the main storage I'm not sure this would confer much benefit if things in fact need to be handled as I described above. In particular, if a snapshot exists you almost certainly must establish the old version in its new location in the snapshot rather than just capture it in the log; if no snapshot exists you could capture the old version in the log and then discard it as soon as the new version becomes persistent, but I'm not sure how easily that (and especially recovering should a crash occur before the new version becomes persistent) could be integrated with the existing COW facilities. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss