Re: [zfs-discuss] ZFS + DB + "fragments"

Richard Elling Wed, 14 Nov 2007 18:03:09 -0800

can you guess? wrote:
>> can you guess? wrote:
>>     
>>>> For very read intensive and position sensitive
>>>> applications, I guess 
>>>> this sort of capability might make a difference?
>>>>         
>>> No question about it.  And sequential table scans
>>>       
>> in databases 
>>     
>>> are among the most significant examples, because
>>>       
>> (unlike things 
>>     
>>> like streaming video files which just get laid down
>>>       
>> initially 
>>     
>>> and non-synchronously in a manner that at least
>>>       
>> potentially 
>>     
>>> allows ZFS to accumulate them in large, contiguous
>>>       
>> chunks - 
>>     
>>> though ISTR some discussion about just how well ZFS
>>>       
>> managed 
>>     
>>> this when it was accommodating multiple such write
>>>       
>> streams in 
>>     
>>> parallel) the tables are also subject to
>>>       
>> fine-grained, 
>>     
>>> often-random update activity.
>>>
>>> Background defragmentation can help, though it
>>>       
>> generates a 
>>     
>>> boatload of additional space overhead in any
>>>       
>> applicable snapshot.
>>
>> The reason that this is hard to characterize is that
>> there are
>> really two very different configurations used to
>> address different
>> performance requirements: cheap and fast.  It seems
>> that when most
>> people first consider this problem, they do so from
>> the cheap
>> perspective: single disk view.  Anyone who strives
>> for database
>> performance will choose the fast perspective:
>> stripes.
>>     
>
> And anyone who *really* understands the situation will do both.
>


I'm not sure I follow.  Many people who do high performance
databases use hardware RAID arrays which often do not
expose single disks.

>   Note: data
>   
>> redundancy isn't really an issue for this analysis,
>> but consider it
>> done in real life.  When you have a striped storage
>> device under a
>> file system, then the database or file system's view
>> of contiguous
>> data is not contiguous on the media.
>>     
>
> The best solution is to make the data piece-wise contiguous on the media at 
> the appropriate granularity - which is largely determined by disk access 
> characteristics (the following assumes that the database table is large 
> enough to be spread across a lot of disks at moderately coarse granularity, 
> since otherwise it's often small enough to cache in the generous amounts of 
> RAM that are inexpensively available today).
>
> A single chunk on an (S)ATA disk today (the analysis is similar for 
> high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to 
> yield over 80% of the disk's maximum possible (fully-contiguous layout) 
> sequential streaming performance (after the overhead of an 'average' - 1/3 
> stroke - initial seek and partial rotation are figured in:  the latter could 
> be avoided by using a chunk size that's an integral multiple of the track 
> size, but on today's zoned disks that's a bit awkward).  A 1 MB chunk yields 
> around 50% of the maximum streaming performance.  ZFS's maximum 128 KB 'chunk 
> size' if effectively used as the disk chunk size as you seem to be suggesting 
> yields only about 15% of the disk's maximum streaming performance (leaving 
> aside an additional degradation to a small fraction of even that should you 
> use RAID-Z).  And if you match the ZFS block size to a 16 KB database block 
> size and use that as the effective unit of distribution across the set of 
> disks, you'll !
 obt
>  ain a mighty 2% of the potential streaming performance (again, we'll be 
> charitable and ignore the further degradation if RAID-Z is used).
>
>   

You do not seem to be considering the track cache, which for
modern disks is 16-32 MBytes.  If those disks are in a RAID array,
then there is often larger read caches as well.  Expecting a seek and
read for each iop is a bad assumption.

> Now, if your system is doing nothing else but sequentially scanning this one 
> database table, this may not be so bad:  you get truly awful disk utilization 
> (2% of its potential in the last case, ignoring RAID-Z), but you can still 
> read ahead through the entire disk set and obtain decent sequential scanning 
> performance by reading from all the disks in parallel.  But if your database 
> table scan is only one small part of a workload which is (perhaps the worst 
> case) performing many other such scans in parallel, your overall system 
> throughput will be only around 4% of what it could be had you used 1 MB 
> chunks (and the individual scan performances will also suck commensurately, 
> of course).
>
> Using 1 MB chunks still spreads out your database admirably for parallel 
> random-access throughput:  even if the table is only 1 GB in size (eminently 
> cachable in RAM, should that be preferable), that'll spread it out across 
> 1,000 disks (2,000, if you mirror it and load-balance to spread out the 
> accesses), and for much smaller database tables if they're accessed 
> sufficiently heavily for throughput to be an issue they'll be wholly 
> cache-resident.  Or another way to look at it is in terms of how many disks 
> you have in your system:  if it's less than the number of MB in your table 
> size, then the table will be spread across all of them regardless of what 
> chunk size is used, so you might as well use one that's large enough to give 
> you decent sequential scanning performance (and if your table is too small to 
> spread across all the disks, then it may well all wind up in cache anyway).
>
> ZFS's problem (well, the one specific to this issue, anyway) is that it tries 
> to use its 'block size' to cover two different needs:  performance for 
> moderately fine-grained updates (though its need to propagate those updates 
> upward to the root of the applicable tree significantly compromises this 
> effort), and decent disk utilization (I'm using that term to describe 
> throughput as a fraction of potential streaming throughput:  just 'keeping 
> the disks saturated' only describes where they system hits its throughput 
> wall, not how well its design does in pushing that wall back as far as 
> possible).  The two requirements conflict, and in ZFS's case the latter one 
> loses - badly.
>   

Real data would be greatly appreciated.  In my tests, I see
reasonable media bandwidth speeds for reads.

> Which is why background defragmentation could help, as I previously noted:  
> it could rearrange the table such that multiple virtually-sequential ZFS 
> blocks were placed contiguously on each disk (to reach 1 MB total, in the 
> current example) without affecting the ZFS block size per se.  But every 
> block so rearranged (and every tree ancestor of each such block) would then 
> leave an equal-sized residue in the most recent snapshot if one existed, 
> which gets expensive fast in terms of snapshot space overhead (which then is 
> proportional to the amount of reorganization performed as well as to the 
> amount of actual data updating).
>
>   
This comes up often from people who want > 128 kByte block sizes
for ZFS.  And yet we can demonstrate media bandwidth limits
relatively easily.  How would you reconcile the differences?
 -- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

Reply via email to