On 5/7/07, Chris Csanady <[EMAIL PROTECTED]> wrote:
On 5/7/07, Tony Galway <[EMAIL PROTECTED]> wrote:
> Greetings learned ZFS geeks & guru's,
>
> Yet another question comes from my continued ZFS performance testing. This
has to do with zpool iostat, and the strangeness that I do see.
> I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving
me a 465G volume.
> # zpool create tp raidz c4t600 ... 8 disks worth of zpool
> # zfs create tp/pool
> # zfs set recordsize=8k tp/pool
> # zfs set mountpoint=/pool tp/pool
This is a known problem, and is an interaction between the alignment
requirements imposed by RAID-Z and the small recordsize you have
chosen. You may effectively avoid it in most situations by choosing a
RAID-Z strip width of 2^n+1. For a fixed record size, this will work
perfectly well.
Well an alignment issue may be the case for the second iostat output,
but not for the first. I'd suspect in the first case the I/O being
seen is the syncing of the transaction group and associated block
pointers to the RAID (though I could be very wrong on this).
Also I'm also not entirely sure about your formula (how can you choose
a stripe width that's not a power of 2?). For an 8 disk single parity
RAID data is going to be written to 7 disks and parity to 1. If each
disk block is 512 bytes, then 128 disk blocks will be written for each
64k filesystem block. This will require 18 rows (and a bit of the
19th) on the 7 data disks. Therefore we have a requirement for 128
blocks of data + 19 blocks of parity = 147 blocks. Now if we take
into account the alignment requirement it says that the number of
block written must equal a multiple of (nparity + 1). So 148 blocks
will be written. 148 % 8 = 4 This means that on each successive 64k
write the 'extra' roundup block will alternate between one disk and
another 4 disks apart (which happens to be just what we see).
Even so, there will still be cases where small files will cause
problems for RAID-Z. While it does not affect many people right now,
I think it will become a more serious issue when disks move to 4k
sectors.
True. But when disks move to 4k sectors they will be on the order of
terabytes in size. It would probably be more pain than it's worth to
try to efficiently pack these. (And it's very likely that your
filesystem and per file block size will be at least 4k.)
I think the reason for the alignment constraint was to ensure that the
stranded space was accounted for, otherwise it would cause problems as
the pool fills up. (Consider a 3 device RAID-Z, where only one data
sector and one parity sector are written; the third sector in that
stripe is essentially dead space.)
Indeed. As Adam explained here:
http://www.opensolaris.org/jive/thread.jspa?threadID=26115&tstart=0 it
specifically pertains to what happens if you allow an odd numer of
disk blocks to be written, you then free that block and try to fill
the space with 512 bytes fs blocks -- you get a single 512-byte hole
that you can't fill.
Would it be possible (or worthwhile) to make the allocator aware of
this dead space, rather than imposing the alignment requirements?
Something like a concept of tentatively allocated space in the
allocator, which would be managed based on the requirements of the
vdev. Using such a mechanism, it could coalesce the space if possible
for allocations. Of course, it would also have to convert the
misaligned bits back into tentatively allocated space when blocks are
freed.
It would add complexity and this roundup only occurs in the RAID-Z
vdev. As the metaslab/space allocator doesn't have any idea about the
on disk layout it wouldn't be able to say whether successive single
free blocks in the space map are on the same/different disks -- and
this would further add to the complexity of data/parity allocation
within the RAID-Z vdev itself.
While I expect this may require changes which would not easily be
backward compatible, the alignment on RAID-Z has always felt a bit
wrong. While the more severe effects can be addressed by also writing
out the dead space, that will not address uneven placement of data and
parity across the stripes.
I've also had issues with this (under a slightly different guise).
I've implemented a rather naive raidz implementation based on the
current implementation which allows you to use all the disk space on
an array of mismatched disks.
What I've done is use the grid portion of the block pointer to specify
a RAID 'version' number (of which you are currently allowed 255 (0
being reserved for the current layout)). I've then organized it such
that metaslab_init is specialised in the raidz vdev (a la
vdev_raidz_asize()) and allocates the metaslab as before, but forces a
new metaslab when a boundary is reached that would alter the number of
disks in a stripe. This increases the number of metaslabs by O(number
of disks). It also means that you need to psize_to_asize slightly
later in the metaslab allocation section (rather than once per vdev);
and that things like raidz_asize() and map_alloc() have an addition
lg(number_disks) overhead in computation.
The result is allocation that is computationally marginally more
complex though, from benchmarking with dtrace, it's hardly noticeable
compared to the overhead of malloc(); and you get a lot a disk space
back (if you're a disk collector (read: poor student) like me :) ).
While this is by no means complete, it does give you some things for
'free' (as it were). Like changing the number of parity disks from
1->2->1 should simply be a matter of creating a new grid version; and
it has backwards compatibility with the original RAID-Z.
Unfortunately it isn't all good news. Adding/relacing disks doesn't
appear to be as easy as this requires munging the space map. And then
you have the problem (now we get to it...) of single block unallocated
space when the striped width changes. This could probably be dealt
with by passivating a metaslab as full if all that's still available
is single blocks of contiguous free space... But yes, something to
work on :).
Anyway; that turned out to be rather longer than I expected. If
anyone has any wise words of advice, I'm all ears!
James
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss