Re: [zfs-discuss] A resilver record?

Edward Ned Harvey Mon, 21 Mar 2011 14:41:24 -0700

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Kraus
> 
>     Is resilver time related to the amount of data (TBs) or the number
> of objects (file + directory counts) ? I have seen zpools with lots of
> data in very few files resilver quickly while smaller pools with lots
> of tiny files take much longer (no hard data here, just recollection
> of how long things took).


In some cases, it could be dependent on the total amount of data (TB) and be
limited by sequential drive throughput.  In that case, it will always be
fast.
In other cases, it could be dependent on a lot of small blocks scattered
randomly about.  In that case, it will be limited by random access time of
the devices, and it's certain to be painfully slow.

But in this conversation, we're trying to make a generalization.  So let's
define "typical," and discuss how each of the above cases is possible, and
reach a generalization:

Note:  There is another common usage scenario.  The home video server, or
large static sequential file store.  Which would have precisely the opposite
usage characteristics.  But for me, that's not typical, so when I'm the
person writing, here is what I'm defining as "typical..."

Typical:  You have a nontrivial pool, with volatile data.  Autosnapshots are
on, which means snapshots are frequently created & destroyed.  Some files &
directories are deleted, created, and/or modified or appended to, in
essentially random order.  It is in the nature of COW (and therefore ZFS) to
only write new copies of the changed blocks, while leaving old blocks in
place, hence files become progressively more fragmented, as long as they are
modified in the middles and ends (rather than deleted & recreated entirely).
It is in the nature of ZFS small write aggregation into larger sequential
blocks ... A bunch of small random writes are aggregated into a single
larger sequential write ...  And eventually those changes are changed or
deleted, and snapshots destroyed, leaving a "hole" in the middle of what was
formerly an aggregated sequential write...  It's in the nature of ZFS to
become progressively more fragmented in these too.

All of the above is normal for any snapshot-capable filesystem.  (Different
implementations reach the same result.)

Here is the part which is both a ZFS strength and weakness:  Upon scrub or
resilver, ZFS will only scrub or resilver the used blocks.  It will not do
the unused space.  If you have a really small percentage of pool
utilization, or highly sequential data, this is a strength.  Because you get
to skip over all the unused portions of disk, it will complete faster than
resilvering or scrubbing the whole disk sequentially.

Unfortunately, in my "typical" usage scenario, a system has been in volatile
production for an extended time, so there is significant usage in the pool,
which is highly fragmented.

Unfortunately, in ZFS resilver (and I think scrub too) the order of
resilvering blocks is NOT based on disk order, which means you don't get to
simply perform a bunch of sequential disk reads and skip over all the unused
sectors.  Instead, your heads need to thrash around, randomly seeking small
blocks all over the place, in essentially random order.

So the answer to your question, assuming my "typical" usage and assuming
hard drives (not SSD's etc) is:

Resilver is dependent on neither the total quantity of data, nor the total
number of files/directories.  It is dependent on the number of used blocks
in the vdev, and dependent on precisely how fragmented and how randomly
those blocks are scattered throughout the vdev, and limited by the random
access time of the vdev.  

YMMV, but here is one of my experiences:  In a given pool that I admin, if I
needed to resilver a whole disk including unused space, the sequential IO of
the disk would be the limiting factor, and the time would be approx 2 hours.
Instead, I am using ZFS, and this sytem is in "typical" production usage,
and I am using mirrors.  Hence, this is the best case scenario for a
"typical" ZFS server with volatile data.  My resilver took 12 hours.  If I
had used raidz2 with 8-2=6 disks, then it would have taken 3 days.

So the conclusion to draw is:
Yes, there are situations where ZFS resilver is a strength, and limited by
serial throughput.  But for what I call "typical" usage patterns, it's a
weakness, and it's dramatically much worse than resilvering the whole disk
sequentially.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A resilver record?

Reply via email to