On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
<richard.ell...@gmail.com> wrote:
> Good question!  Additional thoughts below...
>
> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>
>> Suppose I have a storage server that runs ZFS, presumably providing
>> file (NFS) and/or block (iSCSI, FC) services to other machines that
>> are running Solaris.  Some of the use will be for LDoms and zones[1],
>> which would create zpools on top of zfs (fs or zvol).  I have concerns
>> about variable block sizes and the implications for performance.
>>
>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>
>> Suppose that on the storage server, an NFS shared dataset is created
>> without tuning the block size.  This implies that when the client
>> (ldom or zone v12n server) runs mkfile or similar to create the
>> backing store for a vdisk or a zpool, the file on the storage server
>> will be created with 128K blocks.  Then when Solaris or OpenSolaris is
>> installed into the vdisk or zpool, files of a wide variety of sizes
>> will be created.  At this layer they will be created with variable
>> block sizes (512B to 128K).
>>
>> The implications for a 512 byte write in the upper level zpool (inside
>> a zone or ldom) seems to be:
>>
>> - The 512 byte write turns into a 128 KB write at the storage server
>>  (256x multiplication in write size).
>> - To write that 128 KB block, the rest of the block needs to be read
>>  to recalculate the checksum.  That is, a read/modify/write process
>>  is forced.  (Less impact if block already in ARC.)
>> - Deduplicaiton is likely to be less effective because it is unlikely
>>  that the same combination of small blocks in different zones/ldoms
>>  will be packed into the same 128 KB block.
>>
>> Alternatively, the block size could be forced to something smaller at
>> the storage server.  Setting it to 512 bytes could eliminate the
>> read/modify/write cycle, but would presumably be less efficient (less
>> performant) with moderate to large files.  Setting it somewhere in
>> between may be desirable as well, but it is not clear where.  The key
>> competition in this area seems to have a fixed 4 KB block size.
>>
>> Questions:
>>
>> Are my basic assumptions about a given file consisting only of a
>> single sized block, except for perhaps the final block?
>
> Yes, for a file system dataset. Volumes are fixed block size with
> the default being 8 KB.  So in the iSCSI over volume case, OOB
> it can be more efficient.  4KB matches well with NTFS or some of
> the Linux file systems

OOB is missing from my TLA translator.  Help, please.

>
>> Has any work been done to identify the performance characteristics in
>> this area?
>
> None to my knowledge.  The performance teams know to set the block
> size to match the application, so they don't waste time re-learning this.

That works great for certain workloads, particularly those with a
fixed record size or large sequential I/O.  If the workload is
"installing then running an operating system" the answer is harder to
define.

>
>> Is there less to be concerned about from a performance standpoint if
>> the workload is primarily read?
>
> Sequential read: yes
> Random read: no

I was thinking that random wouldn't be too much of a concern either
assuming that the things that are commonly read are in cache.  I guess
this does open the door for a small chunk of useful code in the middle
of a largely useless shared library to force lot of that shared
library into the ARC, among other things.

>
>> To maximize the efficacy of dedup, would it be best to pick a fixed
>> block size and match it between the layers of zfs?
>
> I don't think we know yet.  Until b128 arrives in binary, and folks get
> some time to experiment, we just don't have much data... and there
> are way too many variables at play to predict.  I can make one
> prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-)

We already have that optimization with compression.  Dedupe just
messes up my method of repeatedly writing the same smallish (<1MB)
chunk of random or already compressed data to avoid the block-of-zeros
compression optimization.

Pretty soon filebench is going to need to add statistical methods to
mimic the level of duplicate data it is simulating.  Trying to write
simple benchmarks to test increasingly smart systems looks to be
problematic.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to