On Tue, 16 Feb 2010, Christo Kutrovsky wrote:
Just finished reading the following excellent post:
http://queue.acm.org/detail.cfm?id=1670144
A nice article, even if I don't agree with all of its surmises and
conclusions. :-)
In fact, I would reach a different conclusion.
I considered something like simply do a 2way mirror. What are the
chances for a very specific drive to fail in 2 way mirror? What if I
do not want to take that chance?
The probability of whole drive failure, or individual sector failure,
has not increased over the years. The probability of individual
sector failure has diminished substantially over the years. The
probability of losing a whole mirror pair has gone down since the
probability of individual drive failure has gone down.
I could always put "copies=2" (or more) to my important datasets and
take some risk and tolerate such a failure.
I don't believe that "copies=2" buys much at all when using mirror
disks (or raidz). It assumes that there is a concurrency of
simultaneous media failure, which is actually quite rare indeed. The
"copies=2" setting only buys something when there is no other
redundancy available.
One of the ideas that sparkled is have a "max devices" property for
each data set, and limit how many mirrored devices a given data set
can be spread on. I mean if you don't need the performance, you can
limit (minimize) the device, should your capacity allow this.
What you seem to be suggesting is a sort of targeted heirarchical vdev
without extra RAID.
Remember. The goal is damage control. I know 2x raidz2 offers better
protection for more capacity (altought less performance, but that's
no the point).
It seems that Adam Leventhal's excellent paper reaches the wrong
conclusions because it assumes that history is a predictor for the
future. However, history is a rather poor predictor in this case.
Imagine if 9" floppies had increased their density to support 20GB
each (up from 160KB), but that did not happen, and now we don't use
floppies at all. We already see many cases where history was no
longer a good predictor of the future, and (as an example) increased
integration has brought us multi-core CPUs rather than 20GHz CPUs.
My own conclusions (supported by Adam Leventhal's excellent paper) are
that
- maximum device size should be constrained based on its time to
resilver.
- devices are growing too large and it is about time to transition to
the next smaller physical size.
It is unreasonable to spend more than 24 hours to resilver a single
drive. It is unreasonable to spend more than 6 days resilvering all
of the devices in a RAID group (the 7th day is reserved for the system
administrator). It is unreasonable to spend very much time at all on
resilvering (using current rotating media) since the resilvering
process kills performance.
When looking at the possibility of data failure it is wise to consider
physical issues such as
- shared power supply
- shared chassis
- shared physical location
- shared OS kernel or firmware instance
all of which are very bad for data reliability since a problem with
anything shared can lead to destruction of all copies of the data.
In New York City, all of the apartment doors seem to be fitted with
three deadlocks, all of which lock into the same flimsy splintered
door frame. It is important to consider each significant system
weakness in turn in order to achieve the least chance of loss, while
providing the best service.
Bob
P.S. NASA is tracking large asteroids and meteors with the hope that
they will eventually be able to deflect any which will strike our
planet in order to in an effort to save your precious data.
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss