...

 currently  what it
> does is to
> maintain files subject to  small random writes
> contiguous to
> the level  of the zfs recordsize.   Now  after a
> significant
> run of  random writes the  files ends  up with  a
>  scattered
> n-disk layout.  This should work  well  for the
> transaction
> parts  of the workload.

Absolutely (save for the fact that every database block write winds up writing 
all the block's ancestors as well, but that's a different discussion and one 
where ZFS looks only somewhat sub-optimal rather than completely uncompetitive 
when compared with different approaches).

  But   the implications  of
> using  a
> small  recordsize are that   large sequential scans
> of files
> will make the disk heads  very busy fetching or
> pre-fetching
> recordsized chunks.

Well, no:  that's only an implication if you *choose* not to arrange the 
individual blocks on the disk to support sequential access better - and that 
choice can have devastating implications for the kind of workload being 
discussed here (another horrendous example would be a simple array of 
fixed-size records in a file accessed - and in particular updated - randomly by 
ordinal record number converted to a file offset but also scanned sequentially 
for bulk processing).

Yes, choosing to reorganize files does have the kinds of snapshot implications 
that I've mentioned, but in most installations (save those entirely dedicated 
to databases) the situation under discussion here will typically involve only a 
small subset of the total data stored and thus reorganization shouldn't have 
severe consequences.

  Get  more spindles and  a good
> prefetch
> algorithm you  can reach whatever  throughput you
> need.

At the cost of using about 25x (or 50x - 200x, if using RAID-Z) as many disks 
as you'd need for the same throughput if the blocks were laid out in 1 MB 
chunks rather than the 16 KB chunks in the example database.

  The
> problem  is   that your  scanning   ops will   create
>  heavy
> ompetition   at  the  spindle  level  thus
>    impacting the
> nsactional response time (once you have 150 IOPS on
> every
> spindle just prefetching data for your full table
> scans, the
> OLTP will suffer).

And this effect on OLTP performance would be dramatically reduced as well if 
you pulled the sequential scan off the disk in large chunks rather than in 
randomly-distributed individual blocks.

 Yes we  do need data to
> characterise this
> but the physics are fairly clear.

Indeed they are - thanks for recognizing this better than some here have 
managed to.

> 
> 
> The BP suggesting a small recordize needs to be
> updated.
> 
> We need to strike a balance between random writes and
> sequential reads which does imply using greater
> records that 
> 8K/16K DB blocks.

As Jesus Cea observed in the recent "ZFS + DB + default blocksize" discussion, 
this then requires that every database block update first read in the larger 
ZFS record before performing the update rather than allowing the database block 
to be written directly.  The ZFS block *might* still be present in ZFS's cache, 
but if most RAM is dedicated to the database cache (which for many workloads 
makes more sense) the chances of this are reduced (not only is ZFS's cache 
smaller but the larger database cache will hold the small database blocks a 
*lot* longer than ZFS's cache will hold any associated large ZFS blocks, so a 
database block update can easily occur long after the associated ZFS block has 
been evicted).

Even ignoring the deleterious effect on random database update performance, you 
still can't get *good* performance for sequential scans this way because 
maximum-size 128 KB ZFS blocks laid out randomly are still a factor of about 4 
less efficient in doing this than 1 MB chunks would be (i.e., you'd need about 
4x as many disks - or 8x - 32x as many disks if using RAID-Z - to achieve 
comparable performance).

And it just doesn't have to be that way - as other modern Unix file systems 
recognized long ago.  You don't even need to embrace extent-based allocation as 
they did, but just rearrange your blocks sensibly - and to at least some degree 
you could do that while they're still cache-resident if you captured updates 
for a while in the ZIL (what's the largest update that you're willing to stuff 
in there now?) before batch-writing them back to less temporary locations.

RAID-Z could be fixed as well, which would help a much wider range of workloads.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to