Re: [zfs-discuss] ZFS + DB + "fragments"

can you guess? Tue, 20 Nov 2007 04:25:16 -0800

...

> - Nathan appears to have suggested a good workaround.
> Could ZFS be updated to have a 'contiguous' setting
> where blocks are kept together.  This sacrifices
>  write performance for read.


I had originally thought that this would be incompatible with ZFS's snapshot 
mechanism, but with a minor tweak it may not be.

...

> - Bill seems to understand the issue, and added some
> useful background (although in an entertaining but
> rather condascending way).

There is a bit of nearby history that led to that.

...

> One point that I haven't seen raised yet:  I believe
> most databases will have had years of tuning based
> around the assumption that their data is saved
> contigously on disk.  They will be optimising their
> disk access based on that and this is not something
> we should ignore.

Ah - nothing like real, experienced user input.  I tend to agree with ZFS's 
general philosophy of attempting to minimize the number of knobs that need 
tuning, but this can lead to forgetting that higher-level software may have 
knobs of its own.  My original assumption was that databases automatically 
attempted to leverage on-disk contiguity (which the more evolved ones certainly 
do when they're controlling the on-disk layout themselves and one might suspect 
try to do even when running on top of files by assuming that the file system is 
trying to preserve on-disk contiguity), but of course admins play a major role 
as well (e.g., in determining which indexes need not be created because 
sequential table scans can get the job done efficiently).

...
 
> I definately don't think defragmentation is the
> solution (although that is needed in ZFS for other
> scenarios).  If your database is under enough read
> strain to need the fix suggested here, your disks
> definately do not have the time needed to scan and
> defrag the entire system.

Well, it's only this kind of randomly-updated/sequentially-scanned data that 
needs much defragmention in the first place.  Data that's written once and then 
only read at worst needs a single defragmentation pass (if the original writes 
got interrupted by a lot of other update activity), data that's not read 
sequentially (e.g., indirect blocks) needn't be defragmented at all, nor need 
data that's seldom read and/or not very fragmented in the first place.

> 
> It would seem to me that Nathan's suggestion right at
> the start of the thread is the way to go.  It
> guarantees read performance for the database, and
> would seem to be relatively easy to implement at the
> zpool level.  Yes it adds considerable overhead to
> writes, but that is a decision database
> administrators can make given the expected load.  
> 
> If I'm understanding Nathan right, saving a block of
> data would mean:
> - Reading the original block (may be cached if we're
>  lucky)
> - Saving that block to a new location
>  - Saving the new data to the original location

1.  You'd still need an initial defragmentation pass to ensure that the file 
was reasonably piece-wise contiguous to begin with.

2.  You can't move the old version of the block without updating all its 
ancestors (since the pointer to it changes).  When you update this path to the 
old version, you need to suppress the normal COW behavior if a snapshot exists 
because it would otherwise maintain the old path pointing to the old data 
location that you're just about to over-write below.  This presumably requires 
establishing the entire new path and deallocating the entire old path in a 
single transaction but this may just be equivalent to a normal data block 
'update' (that just doesn't happen to change any data in the block) when no 
snapshot exists.  I don't *think* that there should be any new issues raised 
with other updates that may be combined in the same 'transaction', even if they 
may affect some of the same ancestral blocks.

3.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block:  all the *additional* 
overhead occurred in the previous steps.

Given that doing the update twice, as described above, only adds to the 
bandwidth consumed (steps 2 and 3 should be able to be combined in a single 
transaction), the only additional disk seek would be that required to re-read 
the original data if it wasn't cached.  So you may well be correct that this 
approach would likely consume fewer resources than background defragmentation 
would (though, as noted above, you'd still need an initial defrag pass to 
establish initial contiguity), and while the additional resources would be 
consumed at normal rather than reduced priority the file would be kept 
contiguous all the time rather than just returned to contiguity whenever there 
was time to do so.

...

> Taking it a stage further, I wonder if this would
> work well with the prioritized write feature request
> (caching writes to a solid state disk)?
>  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
> age_Developer_Wish_List
> 
> That could potentially mean there's very little
> slowdown:
>  - Read the original block
> - Save that to solid state disk
>  - Write the the new block in the original location
> - Periodically stream writes from the solid state
> disk to the main storage

I don't think this applies (nor would it confer any benefit) if things in fact 
need to be handled as I described above.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

Reply via email to