[zfs-code] Extending RAIDZ.

Matthew Ahrens Wed, 19 Sep 2007 15:06:20 -0700

Pawel Jakub Dawidek wrote:
> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
>> Well I read this email having just written a mammoth one in the other
>> thread, my thoughts:
>>
>> The main difficulty in this, as far as I see it, is you're
>> intentionally moving data on a checksummed copy-on-write filesystem
>> ;).  At the very least this is creating lots of work before we even
>> start to address the problem (and given that the ZFS guys are
>> undoubtedly working on device removal, that effort would be wasted).
>> I think this is probably more difficult than it's worth -- re-writing
>> data should be a separate non RAID-Z specific feature (once you're
>> changing the block pointers, you need to update the checksums, and you
>> need to ensure that you're maintaining consistency, preserve
>> snapshots, etc. etc.). Surely it would be much easier to leave the
>> data as is and version the array's disk layout?
> 
> I've some time to experiment with my idea. What I did was:
> 
> 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
>    helps me to using hacked up 'zpool attach' with RAIDZ.
> 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> 3. zpool create tank raidz disk0 disk1 disk2
> 4. zpool attach tank disk0 disk3
> 5. zpool export tank
> 6. Backout 1.
> 7. Use a special tool, that will read all blocks written earlier. I use
>    only three disks for reading and logged offset+size pairs.
> 8. Use the same tool to write the data back, but now use four disks.
> 9. Try to: zpool import tank
> 
> Yeah, 9 fails. It shows that pool metadata is corrupted.
> 
> I was really surprised. This means that layers above vdev knows details
> about vdev internals, like number of disks, I think. What I basically
> did was adding one disk. ZFS can ask raidz vdev for a block using
> exactly the same offset+size as before.


Really?  I don't see how that could be possible using the current raidz 
on-disk layout.

How did you rearrange the data?  Ie, what do steps 7+8 do to the data on 
disk?  If you change the number of disks in a raidz group, then at a minimum 
the number of allocated sectors for each block will change, since this count 
includes the raidz parity sectors.  This will change the block pointers, so 
all block-pointer-containing metadata will have to be rewritten (ie, all 
indirect blocks and dnodes).

--matt

[zfs-code] Extending RAIDZ.

Reply via email to