Re: [zfs-discuss] ZFS + DB + "fragments"

can you guess? Thu, 15 Nov 2007 18:57:32 -0800

Richard Elling wrote:

...


>>> there are
>>> really two very different configurations used to
>>> address different
>>> performance requirements: cheap and fast.  It seems
>>> that when most
>>> people first consider this problem, they do so from
>>> the cheap
>>> perspective: single disk view.  Anyone who strives
>>> for database
>>> performance will choose the fast perspective:
>>> stripes.
>>>     
>>
>> And anyone who *really* understands the situation will do both.
>>   
> 
> I'm not sure I follow.  Many people who do high performance
> databases use hardware RAID arrays which often do not
> expose single disks.

They don't have to expose single disks:  they just have to use reasonable chunk 
sizes on each disk, as I explained later.

Only very early (or very low-end) RAID used very small per-disk chunks (up to 
64 KB max).  Before the mid-'90s chunk sizes had grown to 128 - 256 KB per disk 
on mid-range arrays in order to improve disk utilization in the array.  From 
talking with one of its architects years ago my impression is that HP's (now 
somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as 
an example, though today one could argue for as much as 4 MB and soon perhaps 
even more).

The array chunk size is not the unit of update, just the unit of distribution 
across the array:  RAID-5 will happily update a single 4 KB file block within a 
given array chunk and the associated 4 KB of parity within the parity chunk.  
But the larger chunk size does allow files to retain the option of using 
logical contiguity to attain better streaming sequential performance, rather 
than splintering that logical contiguity at fine grain across multiple disks.

...

>> A single chunk on an (S)ATA disk today (the analysis is similar for 
>> high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size 
>> to yield over 80% of the disk's maximum possible (fully-contiguous 
>> layout) sequential streaming performance (after the overhead of an 
>> 'average' - 1/3 stroke - initial seek and partial rotation are figured 
>> in:  the latter could be avoided by using a chunk size that's an 
>> integral multiple of the track size, but on today's zoned disks that's 
>> a bit awkward).  A 1 MB chunk yields around 50% of the maximum 
>> streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
>> effectively used as the disk chunk size as you seem to be suggesting 
>> yields only about 15% of the disk's maximum streaming performance 
>> (leaving aside an additional degradation to a small fraction of even 
>> that should you use RAID-Z).  And if you match the ZFS block size to a 
>> 16 KB database block size and use that as the effective unit of 
>> distribution across the set of disks, you'll 
>> obtain a mighty 2% of the potential streaming performance (again, we'll 
>> be charitable and ignore the further degradation if RAID-Z is used).
>>
>>   
> 
> You do not seem to be considering the track cache, which for
> modern disks is 16-32 MBytes.  If those disks are in a RAID array,
> then there is often larger read caches as well.

Are you talking about hardware RAID in that last comment?  I thought ZFS was 
supposed to eliminate the need for that.

  Expecting a seek and
> read for each iop is a bad assumption.

The bad assumption is that the disks are otherwise idle and therefore have the 
luxury of filling up their track caches - especially when I explicitly assumed 
otherwise in the following paragraph in that post.  If the system is heavily 
loaded the disks will usually have other requests queued up (even if the next 
request comes in immediately rather than being queued at the disk itself, an 
even half-smart disk will abort any current read-ahead activity so that it can 
satisfy the new request).

Not that it would necessarily do much good for the case currently under 
discussion even if the disks weren't otherwise busy and they did fill up the 
track caches:  ZFS's COW policies tend to encourage data that's updated 
randomly at fine grain (as a database table often is) to be splattered across 
the storage rather than neatly arranged such that the next data requested from 
a given disk will just happen to reside right after the previous data requested 
from that disk.

> 
>> Now, if your system is doing nothing else but sequentially scanning 
>> this one database table, this may not be so bad:  you get truly awful 
>> disk utilization (2% of its potential in the last case, ignoring 
>> RAID-Z), but you can still read ahead through the entire disk set and 
>> obtain decent sequential scanning performance by reading from all the 
>> disks in parallel.  But if your database table scan is only one small 
>> part of a workload which is (perhaps the worst case) performing many 
>> other such scans in parallel, your overall system throughput will be 
>> only around 4% of what it could be had you used 1 MB chunks (and the 
>> individual scan performances will also suck commensurately, of course).

...

> Real data would be greatly appreciated.  In my tests, I see
> reasonable media bandwidth speeds for reads.

You already said that you hadn't been studying databases (the source of the 
kind of random-update/streaming read access mix specifically under 
consideration here).  But while they may one of the worst cases in this respect 
(especially given their tendency to want to perform synchronous rather than 
lazy writes), the underlying problem is hardly unique:  didn't I see a 
reference recently to streaming read performance issues with data that had been 
laid down by multiple concurrent sequential write streams?

> 
>> Which is why background defragmentation could help, as I previously 
>> noted:  it could rearrange the table such that multiple 
>> virtually-sequential ZFS blocks were placed contiguously on each disk 
>> (to reach 1 MB total, in the current example) without affecting the 
>> ZFS block size per se.  But every block so rearranged (and every tree 
>> ancestor of each such block) would then leave an equal-sized residue 
>> in the most recent snapshot if one existed, which gets expensive fast 
>> in terms of snapshot space overhead (which then is proportional to the 
>> amount of reorganization performed as well as to the amount of actual 
>> data updating).
>>
>>   
> This comes up often from people who want > 128 kByte block sizes
> for ZFS.

Using larger block sizes to solve this problem would just be piling one kludge 
on top of another.  Block size is not the right answer to streaming performance 
- you achieve it by arranging *multiple* blocks sensibly on the media, so that 
you can then use a block size that's otherwise appropriate for the application 
(e.g., 16 KB for a database that uses 16 KB blocks itself).

  And yet we can demonstrate media bandwidth limits
> relatively easily.  How would you reconcile the differences?

Perhaps the difference is that you're happier talking about workloads that make 
ZFS look good rather than actively looking for workloads that give ZFS fits.  
Start looking for what ZFS is *not* good at and you'll find it (and then be 
able to start thinking about how to fix it).

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + DB + "fragments"

Reply via email to