Re: [zfs-discuss] Speeding up resilver on x4500
Stuart Anderson writes: > > On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote: > > > > > > > On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson > > > > wrote: > > > > However, it is a bit disconcerting to have to run with reduced data > > protection for an entire week. While I am certainly not going back to > > UFS, it seems like it should be at least theoretically possible to > > do this > > several orders of magnitude faster, e.g., what if every block on the > > replacement disk had its RAIDZ2 data recomputed from the degraded > > > > Maybe this is also saying - that for large disk sets a single RAIDZ2 > > provides a false sense of security. > > This configuration is with 3 large RAIDZ2 devices but I have more > recently > been building thumper/thor systems with a larger number of smaller > RAIDZ2's. > > Thanks. > 170M small files reconstructed in 1 week over 3 raid-z groups is 93 files / sec per raid-z group. That is not too far from expectations for 7.2K RPM drives (where they ?). I don't see orders of magnitude improvements on this however this CR (integrated in snv_109) might give the workload a boost : 6801507 ZFS read aggregation should not mind the gap This will enable more read aggregation to occur during a resilver. We could also contemplate enabling the vdev prefetch code for data during a resilver. Otherwise, limiting the # of small objects per raid-z group as you're doing now, seems wise to me. -r > -- > Stuart Anderson ander...@ligo.caltech.edu > http://www.ligo.caltech.edu/~anderson > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Jun 23, 2009, at 11:50 AM, Richard Elling wrote: (2) is there some reasonable way to read in multiples of these blocks in a single IOP? Theoretically, if the blocks are in chronological creation order, they should be (relatively) sequential on the drive(s). Thus, ZFS should be able to read in several of them without forcing a random seek. That is, you should be able to get multiple blocks in a single IOP. Metadata is prefetched. You can look at the hit rate in kstats. Stuart, you might post the output of "kstat -n vdev_cache_stats" I regularly see cache hit rates in the 60% range, which isn't bad considering what is being cached. # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_statsclass:misc crtime 129.03798177 delegations 25873382 hits114064783 misses 182253696 snaptime960064.85352608 Here is also some zpool iostat numbers during this resilver, # zpool iostat ldas-cit1 10 capacity operationsbandwidth pool used avail read write read write -- - - - - - - ldas-cit1 16.9T 3.49T165134 5.17M 1.58M ldas-cit1 16.9T 3.49T225237 1.28M 1.98M ldas-cit1 16.9T 3.49T288317 1.53M 2.26M ldas-cit1 16.9T 3.49T174269 1014K 1.68M And here is the pool configuration, # zpool status ldas-cit1 pool: ldas-cit1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 96h49m, 63.69% done, 55h12m to go config: NAMESTATE READ WRITE CKSUM ldas-cit1 DEGRADED 0 0 0 raidz2DEGRADED 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c6t2d0s0/o FAULTED 0 0 0 corrupted data c6t2d0 ONLINE 0 0 0 c6t0d0ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c3t6d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 spares c6t0d0INUSE currently in use errors: No known data errors -- Stuart Anderson ander...@ligo.calte
Re: [zfs-discuss] Speeding up resilver on x4500
On 23-Jun-09, at 1:58 PM, Erik Trimble wrote: Richard Elling wrote: Erik Trimble wrote: All this discussion hasn't answered one thing for me: exactly _how_ does ZFS do resilvering? Both in the case of mirrors, and of RAIDZ[2] ? I've seen some mention that it goes in cronological order (which to me, means that the metadata must be read first) of file creation, and that only used blocks are rebuilt, but exactly what is the methodology being used? See Jeff Bonwick's blog on the topic http://blogs.sun.com/bonwick/entry/smokin_mirrors -- richard That's very informative. Thanks, Richard. So, ZFS walks the used block tree to see what still needs rebuilding. I guess I have two related questions then: (1) Are these blocks some fixed size (based on the media - usually 512 bytes), or are they "ZFS blocks" - the fungible size based on the requirements of the original file size being written? (2) is there some reasonable way to read in multiples of these blocks in a single IOP? Theoretically, if the blocks are in chronological creation order, they should be (relatively) sequential on the drive(s). Thus, ZFS should be able to read in several of them without forcing a random seek. (I think) the disk's internal scheduling could help out here if they are indeed close to physically sequential. --Toby That is, you should be able to get multiple blocks in a single IOP. If we can't get multiple ZFS blocks in one sequential read, we're screwed - ZFS is going to be IOPS bound on the replacement disk, with no real workaround. Which means rebuild times for disks with lots of small files is going to be hideous. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Erik Trimble wrote: Richard Elling wrote: Erik Trimble wrote: All this discussion hasn't answered one thing for me: exactly _how_ does ZFS do resilvering? Both in the case of mirrors, and of RAIDZ[2] ? I've seen some mention that it goes in cronological order (which to me, means that the metadata must be read first) of file creation, and that only used blocks are rebuilt, but exactly what is the methodology being used? See Jeff Bonwick's blog on the topic http://blogs.sun.com/bonwick/entry/smokin_mirrors -- richard That's very informative. Thanks, Richard. So, ZFS walks the used block tree to see what still needs rebuilding. I guess I have two related questions then: (1) Are these blocks some fixed size (based on the media - usually 512 bytes), or are they "ZFS blocks" - the fungible size based on the requirements of the original file size being written? They are metadata, so they are compressed. I would expect many of them to be small -- though I have no data to place behind that assumption, it wouldn't be hard to measure. (2) is there some reasonable way to read in multiples of these blocks in a single IOP? Theoretically, if the blocks are in chronological creation order, they should be (relatively) sequential on the drive(s). Thus, ZFS should be able to read in several of them without forcing a random seek. That is, you should be able to get multiple blocks in a single IOP. Metadata is prefetched. You can look at the hit rate in kstats. Stuart, you might post the output of "kstat -n vdev_cache_stats" I regularly see cache hit rates in the 60% range, which isn't bad considering what is being cached. -- richard If we can't get multiple ZFS blocks in one sequential read, we're screwed - ZFS is going to be IOPS bound on the replacement disk, with no real workaround. Which means rebuild times for disks with lots of small files is going to be hideous. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Richard Elling wrote: Erik Trimble wrote: All this discussion hasn't answered one thing for me: exactly _how_ does ZFS do resilvering? Both in the case of mirrors, and of RAIDZ[2] ? I've seen some mention that it goes in cronological order (which to me, means that the metadata must be read first) of file creation, and that only used blocks are rebuilt, but exactly what is the methodology being used? See Jeff Bonwick's blog on the topic http://blogs.sun.com/bonwick/entry/smokin_mirrors -- richard That's very informative. Thanks, Richard. So, ZFS walks the used block tree to see what still needs rebuilding. I guess I have two related questions then: (1) Are these blocks some fixed size (based on the media - usually 512 bytes), or are they "ZFS blocks" - the fungible size based on the requirements of the original file size being written? (2) is there some reasonable way to read in multiples of these blocks in a single IOP? Theoretically, if the blocks are in chronological creation order, they should be (relatively) sequential on the drive(s). Thus, ZFS should be able to read in several of them without forcing a random seek. That is, you should be able to get multiple blocks in a single IOP. If we can't get multiple ZFS blocks in one sequential read, we're screwed - ZFS is going to be IOPS bound on the replacement disk, with no real workaround. Which means rebuild times for disks with lots of small files is going to be hideous. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Erik Trimble wrote: All this discussion hasn't answered one thing for me: exactly _how_ does ZFS do resilvering? Both in the case of mirrors, and of RAIDZ[2] ? I've seen some mention that it goes in cronological order (which to me, means that the metadata must be read first) of file creation, and that only used blocks are rebuilt, but exactly what is the methodology being used? See Jeff Bonwick's blog on the topic http://blogs.sun.com/bonwick/entry/smokin_mirrors -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
All this discussion hasn't answered one thing for me: exactly _how_ does ZFS do resilvering? Both in the case of mirrors, and of RAIDZ[2] ? I've seen some mention that it goes in cronological order (which to me, means that the metadata must be read first) of file creation, and that only used blocks are rebuilt, but exactly what is the methodology being used? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote: On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson > wrote: However, it is a bit disconcerting to have to run with reduced data protection for an entire week. While I am certainly not going back to UFS, it seems like it should be at least theoretically possible to do this several orders of magnitude faster, e.g., what if every block on the replacement disk had its RAIDZ2 data recomputed from the degraded Maybe this is also saying - that for large disk sets a single RAIDZ2 provides a false sense of security. This configuration is with 3 large RAIDZ2 devices but I have more recently been building thumper/thor systems with a larger number of smaller RAIDZ2's. Thanks. -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Mon, 2009-06-22 at 06:06 -0700, Richard Elling wrote: > Nevertheless, in my lab testing, I was not able to create a random-enough > workload to not be write limited on the reconstructing drive. Anecdotal > evidence shows that some systems are limited by the random reads. Systems I've run which have random-read-limited reconstruction have a combination of: - regular time-based snapshots - daily cron jobs which walk the filesystem, accessing all directories and updating all directory atimes in the process. Because the directory dnodes are randomly distributed through the dnode file, each block of the dnode file likely contains at least one directory dnode, and as a result each of the tree walk jobs causes the entire dnode file to diverge from the previous day's snapshot. If the underlying filesystems are mostly static and there are dozens of snapshots, a pool traverse spends most of its time reading the dnode files and finding block pointers to older blocks which it knows it has already seen. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Stuart Anderson wrote: On Jun 21, 2009, at 8:57 PM, Richard Elling wrote: Stuart Anderson wrote: It is currently taking ~1 week to resilver an x4500 running S10U6, recently patched with~170M small files on ~170 datasets after a disk failure/replacement, i.e., wow, that is impressive. There is zero chance of doing that with a manageable number of UFS file systems. However, it is a bit disconcerting to have to run with reduced data protection for an entire week. While I am certainly not going back to UFS, it seems like it should be at least theoretically possible to do this several orders of magnitude faster, e.g., what if every block on the replacement disk had its RAIDZ2 data recomputed from the degraded array regardless of whether the pool was using it or not. In that case I would expect it to be able to sequentially reconstruct in the same few hours it would take a HW RAID controller to do the same RAID6 job. ZFS reconstruction is done in time order, so the workload is random for data which has been updated over time. Nevertheless, in my lab testing, I was not able to create a random-enough workload to not be write limited on the reconstructing drive. Anecdotal evidence shows that some systems are limited by the random reads. Perhaps there needs to be an option to re-order the loops for resilvering on pools with lots of small files to resilver in device order rather than filesystem order? The information is not there. Unlike RAID-1/5/6, ZFS does not require a 1:N mapping of blocks. scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go Is there anything that can be tuned to improve this performance, e.g., adding a faster cache device for reading and/or writing? Resilver tends to be bound by one of two limits: 1. sequential write performance of the resilvering device 2. random I/O performance of the non-resilvering devices A quick look at iostat leads me to conjecture that the vdev rebuilding is taking a very low priority compared to ongoing application I/O (NFSD in this case). Are there any ZFS knobs that control the relative priority of resilvering to other disk I/O tasks? Yes, it is low priority. This is one argument for the competing RFEs: CR 6592835, resilver needs to go faster CR 6494473, ZFS needs a way to slow down resilvering -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Nicholas Lee wrote: On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson mailto:ander...@ligo.caltech.edu>> wrote: However, it is a bit disconcerting to have to run with reduced data protection for an entire week. While I am certainly not going back to UFS, it seems like it should be at least theoretically possible to do this several orders of magnitude faster, e.g., what if every block on the replacement disk had its RAIDZ2 data recomputed from the degraded Maybe this is also saying - that for large disk sets a single RAIDZ2 provides a false sense of security. Nicholas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm assuming the problem is that you are IOPS bound. Since you wrote small files, ZFS uses small stripe sizes. Which means, that when you need to do a full-stripe read to reconstruct the RAIDZ2 parity, you're reading only a very small amount of data. You're IOPS bound on the replacement disk. For arguments' sake, let's assume you have 4k stripe sizes. Thus, you do: (1) 4k read across all disks (2) checksum computation (3) tiny write to re-silver disk Assuming you might max out at 300 IOPS (not unreasonable for small reads on SATA drives), the results in: (300 / 2 ) x 4kB = 600k/s. That is, you can do 150 stripe reads and writes, each read/write pair reconstructing the parity for 4k of data. And, that might be optimal. At that rate, 1TB of data will take ( (1024 * 1024 * 1024 * 1024kB) / 600kB/s) = 1.8 million seconds =~ 500 hours. I don't know about how ZFS does the actual reconstruction, but I have two suggestions: (1) if ZFS is doing a serial resilver (i.e. resilver stripe 1 before doing stripe 2, etc), would it be possible to NOT do a full stripe write when doing the reconstruction? that is, only write the reconstructed data back to the replacement disk? That would allow the "data" disks to use their full IOPS reading, and the replacement disks it's full IOPS writing. It's still going to suck rocks, but only half as much. (2) Multiple stripe-reconstruction would probably be better; that is, ZFS should reconstruct several adjacent stripes together, up to some reasonable total size (say 1MB or so). That way, you could get reconstruction rates of 100MB/s (that is, reconstruct the parity for 100MB of data, NOT writing 100MB/s). 1TB of data @ 100MB/s is only 3 hours. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson wrote: > > However, it is a bit disconcerting to have to run with reduced data > protection for an entire week. While I am certainly not going back to > UFS, it seems like it should be at least theoretically possible to do this > several orders of magnitude faster, e.g., what if every block on the > replacement disk had its RAIDZ2 data recomputed from the degraded Maybe this is also saying - that for large disk sets a single RAIDZ2 provides a false sense of security. Nicholas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Jun 21, 2009, at 8:57 PM, Richard Elling wrote: Stuart Anderson wrote: It is currently taking ~1 week to resilver an x4500 running S10U6, recently patched with~170M small files on ~170 datasets after a disk failure/replacement, i.e., wow, that is impressive. There is zero chance of doing that with a manageable number of UFS file systems. However, it is a bit disconcerting to have to run with reduced data protection for an entire week. While I am certainly not going back to UFS, it seems like it should be at least theoretically possible to do this several orders of magnitude faster, e.g., what if every block on the replacement disk had its RAIDZ2 data recomputed from the degraded array regardless of whether the pool was using it or not. In that case I would expect it to be able to sequentially reconstruct in the same few hours it would take a HW RAID controller to do the same RAID6 job. Perhaps there needs to be an option to re-order the loops for resilvering on pools with lots of small files to resilver in device order rather than filesystem order? scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go Is there anything that can be tuned to improve this performance, e.g., adding a faster cache device for reading and/or writing? Resilver tends to be bound by one of two limits: 1. sequential write performance of the resilvering device 2. random I/O performance of the non-resilvering devices A quick look at iostat leads me to conjecture that the vdev rebuilding is taking a very low priority compared to ongoing application I/O (NFSD in this case). Are there any ZFS knobs that control the relative priority of resilvering to other disk I/O tasks? Thanks. -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Stuart Anderson wrote: It is currently taking ~1 week to resilver an x4500 running S10U6, recently patched with~170M small files on ~170 datasets after a disk failure/replacement, i.e., wow, that is impressive. There is zero chance of doing that with a manageable number of UFS file systems. scrub: resilver in progress for 53h47m, 30.72% done, 121h19m to go Is there anything that can be tuned to improve this performance, e.g., adding a faster cache device for reading and/or writing? Resilver tends to be bound by one of two limits: 1. sequential write performance of the resilvering device 2. random I/O performance of the non-resilvering devices A while back, I was doing some characterization of this, but the "funding" disappeared :-( So, it is unclear whether or how caching might help. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss