... currently what it > does is to > maintain files subject to small random writes > contiguous to > the level of the zfs recordsize. Now after a > significant > run of random writes the files ends up with a > scattered > n-disk layout. This should work well for the > transaction > parts of the workload.
Absolutely (save for the fact that every database block write winds up writing all the block's ancestors as well, but that's a different discussion and one where ZFS looks only somewhat sub-optimal rather than completely uncompetitive when compared with different approaches). But the implications of > using a > small recordsize are that large sequential scans > of files > will make the disk heads very busy fetching or > pre-fetching > recordsized chunks. Well, no: that's only an implication if you *choose* not to arrange the individual blocks on the disk to support sequential access better - and that choice can have devastating implications for the kind of workload being discussed here (another horrendous example would be a simple array of fixed-size records in a file accessed - and in particular updated - randomly by ordinal record number converted to a file offset but also scanned sequentially for bulk processing). Yes, choosing to reorganize files does have the kinds of snapshot implications that I've mentioned, but in most installations (save those entirely dedicated to databases) the situation under discussion here will typically involve only a small subset of the total data stored and thus reorganization shouldn't have severe consequences. Get more spindles and a good > prefetch > algorithm you can reach whatever throughput you > need. At the cost of using about 25x (or 50x - 200x, if using RAID-Z) as many disks as you'd need for the same throughput if the blocks were laid out in 1 MB chunks rather than the 16 KB chunks in the example database. The > problem is that your scanning ops will create > heavy > ompetition at the spindle level thus > impacting the > nsactional response time (once you have 150 IOPS on > every > spindle just prefetching data for your full table > scans, the > OLTP will suffer). And this effect on OLTP performance would be dramatically reduced as well if you pulled the sequential scan off the disk in large chunks rather than in randomly-distributed individual blocks. Yes we do need data to > characterise this > but the physics are fairly clear. Indeed they are - thanks for recognizing this better than some here have managed to. > > > The BP suggesting a small recordize needs to be > updated. > > We need to strike a balance between random writes and > sequential reads which does imply using greater > records that > 8K/16K DB blocks. As Jesus Cea observed in the recent "ZFS + DB + default blocksize" discussion, this then requires that every database block update first read in the larger ZFS record before performing the update rather than allowing the database block to be written directly. The ZFS block *might* still be present in ZFS's cache, but if most RAM is dedicated to the database cache (which for many workloads makes more sense) the chances of this are reduced (not only is ZFS's cache smaller but the larger database cache will hold the small database blocks a *lot* longer than ZFS's cache will hold any associated large ZFS blocks, so a database block update can easily occur long after the associated ZFS block has been evicted). Even ignoring the deleterious effect on random database update performance, you still can't get *good* performance for sequential scans this way because maximum-size 128 KB ZFS blocks laid out randomly are still a factor of about 4 less efficient in doing this than 1 MB chunks would be (i.e., you'd need about 4x as many disks - or 8x - 32x as many disks if using RAID-Z - to achieve comparable performance). And it just doesn't have to be that way - as other modern Unix file systems recognized long ago. You don't even need to embrace extent-based allocation as they did, but just rearrange your blocks sensibly - and to at least some degree you could do that while they're still cache-resident if you captured updates for a while in the ZIL (what's the largest update that you're willing to stuff in there now?) before batch-writing them back to less temporary locations. RAID-Z could be fixed as well, which would help a much wider range of workloads. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss