[zfs-code] Extending RAIDZ.

Pawel Jakub Dawidek Thu, 20 Sep 2007 13:21:55 +0200

On Wed, Sep 19, 2007 at 03:06:20PM -0700, Matthew Ahrens wrote:
> Pawel Jakub Dawidek wrote:
> > On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
> >> Well I read this email having just written a mammoth one in the other
> >> thread, my thoughts:
> >>
> >> The main difficulty in this, as far as I see it, is you're
> >> intentionally moving data on a checksummed copy-on-write filesystem
> >> ;).  At the very least this is creating lots of work before we even
> >> start to address the problem (and given that the ZFS guys are
> >> undoubtedly working on device removal, that effort would be wasted).
> >> I think this is probably more difficult than it's worth -- re-writing
> >> data should be a separate non RAID-Z specific feature (once you're
> >> changing the block pointers, you need to update the checksums, and you
> >> need to ensure that you're maintaining consistency, preserve
> >> snapshots, etc. etc.). Surely it would be much easier to leave the
> >> data as is and version the array's disk layout?
> > 
> > I've some time to experiment with my idea. What I did was:
> > 
> > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
> >    helps me to using hacked up 'zpool attach' with RAIDZ.
> > 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> > 3. zpool create tank raidz disk0 disk1 disk2
> > 4. zpool attach tank disk0 disk3
> > 5. zpool export tank
> > 6. Backout 1.
> > 7. Use a special tool, that will read all blocks written earlier. I use
> >    only three disks for reading and logged offset+size pairs.
> > 8. Use the same tool to write the data back, but now use four disks.
> > 9. Try to: zpool import tank
> > 
> > Yeah, 9 fails. It shows that pool metadata is corrupted.
> > 
> > I was really surprised. This means that layers above vdev knows details
> > about vdev internals, like number of disks, I think. What I basically
> > did was adding one disk. ZFS can ask raidz vdev for a block using
> > exactly the same offset+size as before.
> 
> Really?  I don't see how that could be possible using the current raidz 
> on-disk layout.
> 
> How did you rearrange the data?  Ie, what do steps 7+8 do to the data on 
> disk?  If you change the number of disks in a raidz group, then at a minimum 
> the number of allocated sectors for each block will change, since this count 
> includes the raidz parity sectors.  This will change the block pointers, so 
> all block-pointer-containing metadata will have to be rewritten (ie, all 
> indirect blocks and dnodes).


I create a userland tool based on vdev_raidz.c. I logged all write
requests to vdev_raidz (offset+size). This userland tool first read all
the data (based on offset+size pairs) to memory - it pass old number of
components to vdev_raidz_map_alloc() as dcols argument. When all the
data is in memory, it writes the data back, but now giving dcols+1.

This way I don't change the offset which is passed by upper layers to
vdev_raidz.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL: 
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070920/70035301/attachment.bin>

[zfs-code] Extending RAIDZ.

Reply via email to