My initial thought was that this whole thread may be irrelevant - anybody 
wanting to run such a database is likely to use a specialised filesystem 
optimised for it.  But then I realised that for a database admin the integrity 
checking and other benefits of ZFS would be very tempting, but only if ZFS can 
guarantee equivalent performance to other filesystems.

So, let me see if I understand this right:

- Louwtjie is concerned that ZFS will fragment databases, potentially leading 
to read performance issues for some databases.

- Nathan appears to have suggested a good workaround.  Could ZFS be updated to 
have a 'contiguous' setting where blocks are kept together.  This sacrifices 
write performance for read.

- Richard isn't convinced there's a problem as he's not seen any data 
supporting this.  I can see his point, but I don't agree that this is a non 
starter.  For certain situations it could be very useful, and balancing read 
and write performance is an integral part in the choice of storage 
configuration.

- Bill seems to understand the issue, and added some useful background 
(although in an entertaining but rather condascending way).

Richard then went into a little more detail.  I think he's pointing out here 
that while contiguous data is fastest if you consider a single disk, is not 
necessarily the fastest approach when your data is spread across multiple 
disks.  Instead he feels a 'diverse stochastic spread' is needed.  I guess that 
means you want the data spread so all the disks can be used in parallel.

I think I'm now seeing why Richard is asking for real data.  I think he 
believes that ZFS may already be faster or equal to a standard contiguous 
filesystem in this scenario.  Richard seems to be using a random or statistical 
approach to this:  If data is saved randomly, you're likely to be using all 
disks when reading data.

I do see the point, and yes, data would be useful, but I think I agree with 
Bill on this.  For reading data, while random locations are likely to be fast 
in terms of using multiple disks, that data is also likely to be spread and so 
is almost certain to result in more disk seeks.  Whereas if you have contiguous 
data you can guarantee that it will be striped across the maximum possible 
number of disks, with the minimum number of seeks.  As a database admin I would 
take guaranteed performance over probable performance any day of the week.  
Especially if I can be sure that performance will be consistent and will not 
degrade as the database ages.

One point that I haven't seen raised yet:  I believe most databases will have 
had years of tuning based around the assumption that their data is saved 
contigously on disk.  They will be optimising their disk access based on that 
and this is not something we should ignore.

Yes, until we have data to demonstrate the problem it's just theoretical.  
However that may be hard to obtain and in the meantime I think the theory is 
sound, and the solution easy enough that it is worth tackling.

I definately don't think defragmentation is the solution (although that is 
needed in ZFS for other scenarios).  If your database is under enough read 
strain to need the fix suggested here, your disks definately do not have the 
time needed to scan and defrag the entire system.

It would seem to me that Nathan's suggestion right at the start of the thread 
is the way to go.  It guarantees read performance for the database, and would 
seem to be relatively easy to implement at the zpool level.  Yes it adds 
considerable overhead to writes, but that is a decision database administrators 
can make given the expected load.  

If I'm understanding Nathan right, saving a block of data would mean:
 - Reading the original block (may be cached if we're lucky)
 - Saving that block to a new location
 - Saving the new data to the original location

So you've got a 2-3x slowdown in write performance, but you guarantee read 
performance will at least match existing filesystems (with ZFS caching, it may 
exceed it).  ZFS then works much better with all the existing optimisations 
done within the database software, and you still keep all the benefits of ZFS - 
full data integrity, snapshots, clones, etc...

For many database admins, I think that would be an option they would like to 
have.

Taking it a stage further, I wonder if this would work well with the 
prioritized write feature request (caching writes to a solid state disk)?  
http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List

That could potentially mean there's very little slowdown:
 - Read the original block
 - Save that to solid state disk
 - Write the the new block in the original location
 - Periodically stream writes from the solid state disk to the main storage

In theory there's no need for the drive head to move at all between the read 
and the write, so this should only be fractionally slower than traditional ZFS 
writes.  Yes the data needs to be flushed from the solid state store from time 
to time, but those writes can be batched together for improved performance and 
streamed to contiguous free space on the disk.

That would appear to then give you the best of both worlds.
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to