On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:

> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bang...@gmail.com> wrote:
>> I'm actually more leaning towards running a simple 7+1 RAIDZ1.
>> Running this with 1TB is not a problem but I just wanted to
>> investigate at what TB size the "scales would tip".
> 
> It's not that simple.  Whilst resilver time is proportional to device
> size, it's far more impacted by the degree of fragmentation of the
> pool.  And there's no 'tipping point' - it's a gradual slope so it's
> really up to you to decide where you want to sit on the probability
> curve.

The "tipping point" won't occur for similar configurations. The tip
occurs for different configurations. In particular, if the size of the 
N+M parity scheme is very large and the resilver times become
very, very large (weeks) then a (M-1)-way mirror scheme can provide
better performance and dependability. But I consider these to be
extreme cases.

>>  I understand
>> RAIDZ2 protects against failures during a rebuild process.
> 
> This would be its current primary purpose.
> 
>> Currently,
>> my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
>> and worse case assuming this is 2 days this is my 'exposure' time.
> 
> Unless this is a write-once pool, you can probably also assume that
> your pool will get more fragmented over time, so by the time your
> pool gets to twice it's current capacity, it might well take 3 days
> to rebuild due to the additional fragmentation.
> 
> One point I haven't seen mentioned elsewhere in this thread is that
> all the calculations so far have assumed that drive failures were
> independent.  In practice, this probably isn't true.  All HDD
> manufacturers have their "off" days - where whole batches or models of
> disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
> demonstration that it's WD's turn to turn out junk.  Your best
> protection against this is to have disks from enough different batches
> that a batch failure won't take out your pool.

The problem with considering the failures as interdependent is that 
you cannot get the failure rate information from the vendors.  You could
guess, or use your own, but it would not always help you make a better design
decision.

> 
> PSU, fan and SATA controller failures are likely to take out multiple
> disks but it's far harder to include enough redundancy to handle this
> and your best approach is probably to have good backups.

The top 4 items that fail most often, in no particular order, are: fans,
power supplies, memory, and disk. This is why you will see the enterprise
class servers use redundant fans, multiple high-quality power supplies,
ECC memory, and some sort of RAID.

> 
>> I will be running hot (or maybe cold) spare.  So I don't need to
>> factor in "Time it takes for a manufacture to replace the drive".
> 
> In which case, the question is more whether 8-way RAIDZ1 with a
> hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  

In this case, raidz2 is much better for dependability because the "spare"
is already "resilvered."  It also performs better, though the dependability
gains tend to be bigger than the performance gains.

> In the latter
> case, your "hot spare" is already part of the pool so you don't
> lose the time-to-notice plus time-to-resilver before regaining
> redundancy.  The downside is that actively using the "hot spare"
> may increase the probability of it failing.

No. The disk failure rate data does not conclusively show that activity
causes premature failure. Other failure modes dominate.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to