this is not a problem we're trying to solve, but part of a characterization 
study of the zfs implementation ... we're currently using the default 8KB 
blocksize for our zvol deployment, and we're performing tests using write block 
sizes as small as 4KB and as large as 1MB as previously described (including an 
8KB write aligned to logical zvol block zero, for a perfect match to the zvol 
blocksize) ... in all cases we see at least twice the IO to the disks than we 
generate from our test program (and it's much worse for smaller write block 
sizes) ... we're not exactly caught in read-modify-write hell (except when we 
write the 4KB blocks that are smaller than the zvol blocksize), it's more like 
modify-write hell since the original meta-data that maps the 2GB region we're 
writing is probably just read once and kept in cache for the duration of the 
test ... the large amount of back-end IO is almost entirely write operations, 
but these write operations include the re-writing of meta-data that has to 
change to reflect the re-location of newly written data (remember, no in-place 
writes ever occur for data or meta-data) ... using the default zvol block size 
of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total 
2GB write region (this is a large percentage vs other file systems like ufs, 
for example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block 
pointer) ... as new data is written over the old data, the leaves of the 
meta-data tree are necessarily changed to point to the new locations on disk of 
the new data, but any new leaf block-pointer requires that a new block of leaf 
pointers be allocated and written, which requires that the next indirect level 
up from these leaves point to this new set of leaf pointers, so it must be 
rewritten itself, and so on up the tree (and remember, meta-data is subject to 
being written in up to 3 copies - default is 2 - anytime any of it is written 
to disk) ... the indirect pointer blocks closer to the root of the tree may 
only see a single pointer change over the course of a 5 second consolidation 
(based on the size of the zvol, the size of the block allocation unit in the 
zvol and the amount of data actually written to the zvol in 5 seconds), but a 
complete new indirect block must be created and written to disk (all the way 
back to the uberblock) on each transaction group write ... this means that some 
of these meta-data blocks are written to disk over and over again with only 
small changes from their previous composition ... consolidating for more than 5 
seconds would help to mitigate this situation, but longer consolidation periods 
put more data at risk of being lost in case of a power failure ... this is not 
particularly a problem, just a manifestation of the need to never write 
in-place, a rather large block pointer size and the possible writing of 
multiple copies of meta-data (of course this block pointer carries check sums, 
and the addresses of up to 3 duplicate blocks, providing the excellent data and 
meta-data protection zfs is so well known for) ... the original thread that 
this reply addressed was the characteristic 5 second delay in writes, which I 
tried to explain in the context of copy-on-write consolidation, but it's clear 
that even this delay cannot prevent the modification and re-writing of the same 
basic meta-data many times with small modifications
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to