My initial thought was that this whole thread may be irrelevant - anybody wanting to run such a database is likely to use a specialised filesystem optimised for it. But then I realised that for a database admin the integrity checking and other benefits of ZFS would be very tempting, but only if ZFS can guarantee equivalent performance to other filesystems.
So, let me see if I understand this right: - Louwtjie is concerned that ZFS will fragment databases, potentially leading to read performance issues for some databases. - Nathan appears to have suggested a good workaround. Could ZFS be updated to have a 'contiguous' setting where blocks are kept together. This sacrifices write performance for read. - Richard isn't convinced there's a problem as he's not seen any data supporting this. I can see his point, but I don't agree that this is a non starter. For certain situations it could be very useful, and balancing read and write performance is an integral part in the choice of storage configuration. - Bill seems to understand the issue, and added some useful background (although in an entertaining but rather condascending way). Richard then went into a little more detail. I think he's pointing out here that while contiguous data is fastest if you consider a single disk, is not necessarily the fastest approach when your data is spread across multiple disks. Instead he feels a 'diverse stochastic spread' is needed. I guess that means you want the data spread so all the disks can be used in parallel. I think I'm now seeing why Richard is asking for real data. I think he believes that ZFS may already be faster or equal to a standard contiguous filesystem in this scenario. Richard seems to be using a random or statistical approach to this: If data is saved randomly, you're likely to be using all disks when reading data. I do see the point, and yes, data would be useful, but I think I agree with Bill on this. For reading data, while random locations are likely to be fast in terms of using multiple disks, that data is also likely to be spread and so is almost certain to result in more disk seeks. Whereas if you have contiguous data you can guarantee that it will be striped across the maximum possible number of disks, with the minimum number of seeks. As a database admin I would take guaranteed performance over probable performance any day of the week. Especially if I can be sure that performance will be consistent and will not degrade as the database ages. One point that I haven't seen raised yet: I believe most databases will have had years of tuning based around the assumption that their data is saved contigously on disk. They will be optimising their disk access based on that and this is not something we should ignore. Yes, until we have data to demonstrate the problem it's just theoretical. However that may be hard to obtain and in the meantime I think the theory is sound, and the solution easy enough that it is worth tackling. I definately don't think defragmentation is the solution (although that is needed in ZFS for other scenarios). If your database is under enough read strain to need the fix suggested here, your disks definately do not have the time needed to scan and defrag the entire system. It would seem to me that Nathan's suggestion right at the start of the thread is the way to go. It guarantees read performance for the database, and would seem to be relatively easy to implement at the zpool level. Yes it adds considerable overhead to writes, but that is a decision database administrators can make given the expected load. If I'm understanding Nathan right, saving a block of data would mean: - Reading the original block (may be cached if we're lucky) - Saving that block to a new location - Saving the new data to the original location So you've got a 2-3x slowdown in write performance, but you guarantee read performance will at least match existing filesystems (with ZFS caching, it may exceed it). ZFS then works much better with all the existing optimisations done within the database software, and you still keep all the benefits of ZFS - full data integrity, snapshots, clones, etc... For many database admins, I think that would be an option they would like to have. Taking it a stage further, I wonder if this would work well with the prioritized write feature request (caching writes to a solid state disk)? http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List That could potentially mean there's very little slowdown: - Read the original block - Save that to solid state disk - Write the the new block in the original location - Periodically stream writes from the solid state disk to the main storage In theory there's no need for the drive head to move at all between the read and the write, so this should only be fractionally slower than traditional ZFS writes. Yes the data needs to be flushed from the solid state store from time to time, but those writes can be batched together for improved performance and streamed to contiguous free space on the disk. That would appear to then give you the best of both worlds. This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss