> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Mark Sandrock
> 
>       I'm working with someone who replaced a failed 1TB drive (50%
> utilized),
> on an X4540 running OS build 134, and I think something must be wrong.
> 
> Last Tuesday afternoon, zpool status reported:
> 
> scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go
> 
> and a week being 168 hours, that put completion at sometime tomorrow
> night.
> 
> However, he just reported zpool status shows:
> 
> scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go
> 
> so it's looking more like 2011 now. That can't be right.
> 
> I'm hoping for a suggestion or two on this issue.

For a typical live system, which has been in production for a long time with
files being created, snapshotted, partially overwritten, snapshots
destroyed, etc etc...  the blocks written to disk tend to be largely written
in random order.  And at least for now, the order of resilvering blocks is
by creation time, not disk location.  So resilver time is typically limited
by IOPS for random IO, and the number of records that are in the affected
vdev.  

To reduce the number of records in an affected vdev, it is effective to
build the pool using mirrors instead of raidz... Or use smaller vdevs of
raidz1 instead of large raidz3.  Unfortunately, you're not going to be able
to change that with an existing system.  Roughly speaking, a 23-disk raidz3
with capacity of 20 disks would take 40x longer to resilver than one of the
mirrors in a 40-disk stripe of mirrors with capacity of 20 disks.  In rough
numbers, that might be 20 days instead of 12 hours.

To reduce the IOPS time...  (for background info: Under normal
circumstances, you should disable the HBA WriteBack cache if you have
dedicated log present (on the X4275 that is done via realtek HBA utility, I
don't know about X4540) ) ... But during resilver, you might enable the
WriteBack for the drive that's being resilvered.  I don't know for sure if
that will help, but I think it should make some difference, because the
logic which led to the disabling of WB does not apply to resilver writes.

To reduce the number of records to resilver...
* If possible, disable the creation of new snapshots while resilver is
running.  
* If possible, delete files and destroy old snapshots that are not needed
anymore
* If possible, limit new writes to the system.

By the way, I'm sorry to say ... Also don't trust the progress indicator.
You're likely to reach 100% completed, and stay there for a long time.  Even
2T resilvered on a 1T disk... This is an ugly area which looks bad in the
face, but it's actually physically correct because the filesystem is in use,
and performing new writes during the resilver...

To reduce the IOPS time...
* If possible, limit the "live" IO to the system.  Resilver has lower
priority and therefore gets delayed a lot for production systems.
* Definitely DON'T scrub the pool while it's resilvering.

Maybe you might be able to offload some of the IO by adding cache devices,
dedicated log, or ram?  Meaning... I know it's sound in principle, but YMMV
immensely, depending on your workload.

All of the above is likely to be not amazingly effective.  There's not much
you can do, if you started with a huge raidz3, for example.  The most
important thing you can do to affect resilver time is choose to use mirrors
instead of raidz, at the time of pool creation.

So, as a last ditch effort ...  If you "zfs send" the pool to some other
storage, and then recreate your local pool, which will be empty and
therefore resilver completed, because zfs only resilvers used blocks... and
then "zfs send" the data back to restore the pool...  Then besides the fact
that your resilver has been forcibly completed, your received data will also
be ordered on disk optimally, which will greatly help in case another
resilver is needed in the near future... and you create an opportunity to
revisit the pool architecture, possibly in favor of mirrors instead of
raidz.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to