On Wed, Sep 19, 2007 at 03:06:20PM -0700, Matthew Ahrens wrote: > Pawel Jakub Dawidek wrote: > > On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote: > >> Well I read this email having just written a mammoth one in the other > >> thread, my thoughts: > >> > >> The main difficulty in this, as far as I see it, is you're > >> intentionally moving data on a checksummed copy-on-write filesystem > >> ;). At the very least this is creating lots of work before we even > >> start to address the problem (and given that the ZFS guys are > >> undoubtedly working on device removal, that effort would be wasted). > >> I think this is probably more difficult than it's worth -- re-writing > >> data should be a separate non RAID-Z specific feature (once you're > >> changing the block pointers, you need to update the checksums, and you > >> need to ensure that you're maintaining consistency, preserve > >> snapshots, etc. etc.). Surely it would be much easier to leave the > >> data as is and version the array's disk layout? > > > > I've some time to experiment with my idea. What I did was: > > > > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this > > helps me to using hacked up 'zpool attach' with RAIDZ. > > 2. Turn on logging of all write into RAIDZ vdev (offset+size). > > 3. zpool create tank raidz disk0 disk1 disk2 > > 4. zpool attach tank disk0 disk3 > > 5. zpool export tank > > 6. Backout 1. > > 7. Use a special tool, that will read all blocks written earlier. I use > > only three disks for reading and logged offset+size pairs. > > 8. Use the same tool to write the data back, but now use four disks. > > 9. Try to: zpool import tank > > > > Yeah, 9 fails. It shows that pool metadata is corrupted. > > > > I was really surprised. This means that layers above vdev knows details > > about vdev internals, like number of disks, I think. What I basically > > did was adding one disk. ZFS can ask raidz vdev for a block using > > exactly the same offset+size as before. > > Really? I don't see how that could be possible using the current raidz > on-disk layout. > > How did you rearrange the data? Ie, what do steps 7+8 do to the data on > disk? If you change the number of disks in a raidz group, then at a minimum > the number of allocated sectors for each block will change, since this count > includes the raidz parity sectors. This will change the block pointers, so > all block-pointer-containing metadata will have to be rewritten (ie, all > indirect blocks and dnodes).
I create a userland tool based on vdev_raidz.c. I logged all write requests to vdev_raidz (offset+size). This userland tool first read all the data (based on offset+size pairs) to memory - it pass old number of components to vdev_raidz_map_alloc() as dcols argument. When all the data is in memory, it writes the data back, but now giving dcols+1. This way I don't change the offset which is passed by upper layers to vdev_raidz. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070920/70035301/attachment.bin>
