Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Richard Elling Sat, 20 Apr 2013 21:12:27 -0700

Terminology warning below…

On Apr 18, 2013, at 3:46 AM, Sebastian Gabler <sequoiamo...@gmx.net> wrote:


> Am 18.04.2013 03:09, schrieb openindiana-discuss-requ...@openindiana.org:
>> Message: 1
>> Date: Wed, 17 Apr 2013 13:21:08 -0600
>> From: Jan Owoc<jso...@gmail.com>
>> To: Discussion list for OpenIndiana
>>      <openindiana-discuss@openindiana.org>
>> Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
>> Message-ID:
>>      <cadcwueyc14mt5agkez7pda64h014t07ggtojkpq5js4s279...@mail.gmail.com>
>> Content-Type: text/plain; charset=UTF-8
>> 
>> On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalson<tsc...@mst.edu>  wrote:
>>> >On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) <
>>> >openindi...@nedharvey.com> wrote:
>>> >
>>>> >>You also said the raidz2 will offer more protection against failure,
>>>> >>because you can survive any two disk failures (but no more.)  I would 
>>>> >>argue
>>>> >>this is incorrect (I've done the probability analysis before).  Mostly
>>>> >>because the resilver time in the mirror configuration is 8x to 16x faster
>>>> >>(there's 1/8 as much data to resilver, and IOPS is limited by a single
>>>> >>disk, not the "worst" of several disks, which introduces another factor 
>>>> >>up
>>>> >>to 2x, increasing the 8x as high as 16x), so the smaller resilver window
>>>> >>means lower probability of "concurrent" failures on the critical vdev.
>>>> >>  We're talking about 12 hours versus 1 week, actual result of my 
>>>> >> machines
>>>> >>in production.
>>>> >>
>>> >
>>> >Did you also compare the probability of bit errors causing data loss
>>> >without a complete pool failure?  2-way mirrors, when one device completely
>>> >dies, have no redundancy on that data, and the copy that remains must be
>>> >perfect or some data will be lost.  On the other hand, raid-z2 will still
>>> >have available redundancy, allowing every single block to have a bad read
>>> >on any single component disk, without losing data.  I haven't done the math
>>> >on this, but I seem to recall some papers claiming that this is the more
>>> >likely route to lost data on modern disks, by comparing bit error rate and
>>> >capacity.  Of course, a second outright failure puts raid-z2 in a much
>>> >worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
>>> >already be a less likely case.
>> Richard Elling wrote a blog post about "mean time to data loss" [1]. A
>> few years later he graphed out a few cases for typical values of
>> resilver times [2].
>> 
>> [1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
>> [2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
>> 
>> Cheers,
>> Jan
> 
> Notably, Richard's models posted do not include BER. Nevertheless it's an 
> important factor.

BER is the term most often used in networks, where the corruption is transient. 
For permanent
data faults, the equivalent is unrecoverable read error rate (UER), also 
expressed as a failure rate
per bit. My models clearly consider this. Unfortunately, terminology 
consistency between vendors
has been slow in coming, but Seagate and WD seem to be converging on 
"Non-recoverable 
read errors per bits read" while Toshiba seems to be ignoring the problem, or 
at least can't seem
to list it on their datasheets :-(

> From the back of my mind it will impact reliability in different ways in ZFS:
> 
> - Bit error in metadata (zfs should save us by metadata redundancy)
> - Bit error in full stripe data
> - Bit error in parity data

These aren't interesting from a system design perspective. To enhance the model 
to deal
with this, we just need to determine what percentage of the overall space 
contains copied
data. There is no general answer, but for most systems it will be a small 
percentage of the
total, as compared to data. In this respect, the models are worst-case, which 
is what we want
to use for design evalulations.

NB, traditional RAID systems don't know what is data and what is not data, so 
they could
run into uncorrectable errors that are not actually containing data. This 
becomes more
important for those systems which use a destructive scrub, as opposed to ZFS's 
readonly
scrub. Hence, some studies have shown where scrubbing can propagate errors in 
non-ZFS
RAID arrays.

> 
> AFAIK, a bit error in Parity or stripe data can be specifically dangerous 
> when it is raised during resilvering, and there is only one layer of 
> redundancy left. OTOH, BER issues scale with VDEV size, not with rebuild 
> time. So, I think that Tim actually made up a valid point about a 
> systematically weak point of 2-way mirrors or raidz1 on in vdevs that are 
> large in comparison to the BER rating of their member drives. Consumer drives 
> have a BER of 1:10^14..10^15, Enterprise drives start at 1:10^16.
> I do not think that zfs will have better resilience against rot of parity 
> data than conventional RAID. At best, block level checksums can help raise an 
> error, so you know at least that something went wrong. But recovery of the 
> data will probably not be possible. So, in my opinion BER is an issue under 
> ZFS as anywhere else.

Yep, which is why my MTTDL model 2 explicitily (MTTDL[2]) considers this case 
;-)

> 
> Best,
> 
> Sebastian
> PS: I occurred to me that WD doesn't publish BER data for some of their 
> drives (at least all I have searched for while writing this). Anybody happens 
> to be in possession of full specs for WD drives?

The trend seems to be that BER data is not shown for laptop drives, which is a 
large part of
the HDD market. Presumably, this is because the load/unload failure mode 
dominates in
this use case as the drives are not continuously spinning. It is a good idea to 
use components
in the environment for which they are designed, so I'm pretty sure you'd never 
consider using
a laptop drive for a storage array.
 -- richard

-- 

ZFS storage and performance consulting at http://www.RichardElling.com







_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Reply via email to