... > - Nathan appears to have suggested a good workaround. > Could ZFS be updated to have a 'contiguous' setting > where blocks are kept together. This sacrifices > write performance for read.
I had originally thought that this would be incompatible with ZFS's snapshot mechanism, but with a minor tweak it may not be. ... > - Bill seems to understand the issue, and added some > useful background (although in an entertaining but > rather condascending way). There is a bit of nearby history that led to that. ... > One point that I haven't seen raised yet: I believe > most databases will have had years of tuning based > around the assumption that their data is saved > contigously on disk. They will be optimising their > disk access based on that and this is not something > we should ignore. Ah - nothing like real, experienced user input. I tend to agree with ZFS's general philosophy of attempting to minimize the number of knobs that need tuning, but this can lead to forgetting that higher-level software may have knobs of its own. My original assumption was that databases automatically attempted to leverage on-disk contiguity (which the more evolved ones certainly do when they're controlling the on-disk layout themselves and one might suspect try to do even when running on top of files by assuming that the file system is trying to preserve on-disk contiguity), but of course admins play a major role as well (e.g., in determining which indexes need not be created because sequential table scans can get the job done efficiently). ... > I definately don't think defragmentation is the > solution (although that is needed in ZFS for other > scenarios). If your database is under enough read > strain to need the fix suggested here, your disks > definately do not have the time needed to scan and > defrag the entire system. Well, it's only this kind of randomly-updated/sequentially-scanned data that needs much defragmention in the first place. Data that's written once and then only read at worst needs a single defragmentation pass (if the original writes got interrupted by a lot of other update activity), data that's not read sequentially (e.g., indirect blocks) needn't be defragmented at all, nor need data that's seldom read and/or not very fragmented in the first place. > > It would seem to me that Nathan's suggestion right at > the start of the thread is the way to go. It > guarantees read performance for the database, and > would seem to be relatively easy to implement at the > zpool level. Yes it adds considerable overhead to > writes, but that is a decision database > administrators can make given the expected load. > > If I'm understanding Nathan right, saving a block of > data would mean: > - Reading the original block (may be cached if we're > lucky) > - Saving that block to a new location > - Saving the new data to the original location 1. You'd still need an initial defragmentation pass to ensure that the file was reasonably piece-wise contiguous to begin with. 2. You can't move the old version of the block without updating all its ancestors (since the pointer to it changes). When you update this path to the old version, you need to suppress the normal COW behavior if a snapshot exists because it would otherwise maintain the old path pointing to the old data location that you're just about to over-write below. This presumably requires establishing the entire new path and deallocating the entire old path in a single transaction but this may just be equivalent to a normal data block 'update' (that just doesn't happen to change any data in the block) when no snapshot exists. I don't *think* that there should be any new issues raised with other updates that may be combined in the same 'transaction', even if they may affect some of the same ancestral blocks. 3. You can't just slide in the new version of the block using the old version's existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version's new location that you just had to establish (if a snapshot exists, that's the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block: all the *additional* overhead occurred in the previous steps. Given that doing the update twice, as described above, only adds to the bandwidth consumed (steps 2 and 3 should be able to be combined in a single transaction), the only additional disk seek would be that required to re-read the original data if it wasn't cached. So you may well be correct that this approach would likely consume fewer resources than background defragmentation would (though, as noted above, you'd still need an initial defrag pass to establish initial contiguity), and while the additional resources would be consumed at normal rather than reduced priority the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ... > Taking it a stage further, I wonder if this would > work well with the prioritized write feature request > (caching writes to a solid state disk)? > http://www.genunix.org/wiki/index.php/OpenSolaris_Sto > age_Developer_Wish_List > > That could potentially mean there's very little > slowdown: > - Read the original block > - Save that to solid state disk > - Write the the new block in the original location > - Periodically stream writes from the solid state > disk to the main storage I don't think this applies (nor would it confer any benefit) if things in fact need to be handled as I described above. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss