can you guess? wrote: >> can you guess? wrote: >> >>>> For very read intensive and position sensitive >>>> applications, I guess >>>> this sort of capability might make a difference? >>>> >>> No question about it. And sequential table scans >>> >> in databases >> >>> are among the most significant examples, because >>> >> (unlike things >> >>> like streaming video files which just get laid down >>> >> initially >> >>> and non-synchronously in a manner that at least >>> >> potentially >> >>> allows ZFS to accumulate them in large, contiguous >>> >> chunks - >> >>> though ISTR some discussion about just how well ZFS >>> >> managed >> >>> this when it was accommodating multiple such write >>> >> streams in >> >>> parallel) the tables are also subject to >>> >> fine-grained, >> >>> often-random update activity. >>> >>> Background defragmentation can help, though it >>> >> generates a >> >>> boatload of additional space overhead in any >>> >> applicable snapshot. >> >> The reason that this is hard to characterize is that >> there are >> really two very different configurations used to >> address different >> performance requirements: cheap and fast. It seems >> that when most >> people first consider this problem, they do so from >> the cheap >> perspective: single disk view. Anyone who strives >> for database >> performance will choose the fast perspective: >> stripes. >> > > And anyone who *really* understands the situation will do both. >
I'm not sure I follow. Many people who do high performance databases use hardware RAID arrays which often do not expose single disks. > Note: data > >> redundancy isn't really an issue for this analysis, >> but consider it >> done in real life. When you have a striped storage >> device under a >> file system, then the database or file system's view >> of contiguous >> data is not contiguous on the media. >> > > The best solution is to make the data piece-wise contiguous on the media at > the appropriate granularity - which is largely determined by disk access > characteristics (the following assumes that the database table is large > enough to be spread across a lot of disks at moderately coarse granularity, > since otherwise it's often small enough to cache in the generous amounts of > RAM that are inexpensively available today). > > A single chunk on an (S)ATA disk today (the analysis is similar for > high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to > yield over 80% of the disk's maximum possible (fully-contiguous layout) > sequential streaming performance (after the overhead of an 'average' - 1/3 > stroke - initial seek and partial rotation are figured in: the latter could > be avoided by using a chunk size that's an integral multiple of the track > size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields > around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk > size' if effectively used as the disk chunk size as you seem to be suggesting > yields only about 15% of the disk's maximum streaming performance (leaving > aside an additional degradation to a small fraction of even that should you > use RAID-Z). And if you match the ZFS block size to a 16 KB database block > size and use that as the effective unit of distribution across the set of > disks, you'll ! obt > ain a mighty 2% of the potential streaming performance (again, we'll be > charitable and ignore the further degradation if RAID-Z is used). > > You do not seem to be considering the track cache, which for modern disks is 16-32 MBytes. If those disks are in a RAID array, then there is often larger read caches as well. Expecting a seek and read for each iop is a bad assumption. > Now, if your system is doing nothing else but sequentially scanning this one > database table, this may not be so bad: you get truly awful disk utilization > (2% of its potential in the last case, ignoring RAID-Z), but you can still > read ahead through the entire disk set and obtain decent sequential scanning > performance by reading from all the disks in parallel. But if your database > table scan is only one small part of a workload which is (perhaps the worst > case) performing many other such scans in parallel, your overall system > throughput will be only around 4% of what it could be had you used 1 MB > chunks (and the individual scan performances will also suck commensurately, > of course). > > Using 1 MB chunks still spreads out your database admirably for parallel > random-access throughput: even if the table is only 1 GB in size (eminently > cachable in RAM, should that be preferable), that'll spread it out across > 1,000 disks (2,000, if you mirror it and load-balance to spread out the > accesses), and for much smaller database tables if they're accessed > sufficiently heavily for throughput to be an issue they'll be wholly > cache-resident. Or another way to look at it is in terms of how many disks > you have in your system: if it's less than the number of MB in your table > size, then the table will be spread across all of them regardless of what > chunk size is used, so you might as well use one that's large enough to give > you decent sequential scanning performance (and if your table is too small to > spread across all the disks, then it may well all wind up in cache anyway). > > ZFS's problem (well, the one specific to this issue, anyway) is that it tries > to use its 'block size' to cover two different needs: performance for > moderately fine-grained updates (though its need to propagate those updates > upward to the root of the applicable tree significantly compromises this > effort), and decent disk utilization (I'm using that term to describe > throughput as a fraction of potential streaming throughput: just 'keeping > the disks saturated' only describes where they system hits its throughput > wall, not how well its design does in pushing that wall back as far as > possible). The two requirements conflict, and in ZFS's case the latter one > loses - badly. > Real data would be greatly appreciated. In my tests, I see reasonable media bandwidth speeds for reads. > Which is why background defragmentation could help, as I previously noted: > it could rearrange the table such that multiple virtually-sequential ZFS > blocks were placed contiguously on each disk (to reach 1 MB total, in the > current example) without affecting the ZFS block size per se. But every > block so rearranged (and every tree ancestor of each such block) would then > leave an equal-sized residue in the most recent snapshot if one existed, > which gets expensive fast in terms of snapshot space overhead (which then is > proportional to the amount of reorganization performed as well as to the > amount of actual data updating). > > This comes up often from people who want > 128 kByte block sizes for ZFS. And yet we can demonstrate media bandwidth limits relatively easily. How would you reconcile the differences? -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss