Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-22 Thread Sebastian Gabler

Am 21.04.2013 06:35, schrieb openindiana-discuss-requ...@openindiana.org:

--

Message: 3
Date: Sat, 20 Apr 2013 21:13:09 -0700
From: Richard Ellingrichard.ell...@richardelling.com
To: Discussion list for OpenIndiana
openindiana-discuss@openindiana.org
Subject: Re: [OpenIndiana-discuss] vdev reliability was:
Recommendations for fast storage
Message-ID:0b43e9ea-10fd-41af-81ef-31644ff49...@richardelling.com
Content-Type: text/plain; charset=windows-1252

Terminology warning below?

On Apr 18, 2013, at 3:46 AM, Sebastian Gablersequoiamo...@gmx.net  wrote:


Am 18.04.2013 03:09, schriebopenindiana-discuss-requ...@openindiana.org:

Message: 1
Date: Wed, 17 Apr 2013 13:21:08 -0600
From: Jan Owocjso...@gmail.com
To: Discussion list for OpenIndiana
openindiana-discuss@openindiana.org
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
Message-ID:
cadcwueyc14mt5agkez7pda64h014t07ggtojkpq5js4s279...@mail.gmail.com
Content-Type: text/plain; charset=UTF-8

On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalsontsc...@mst.edu   wrote:

 On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) 
 openindi...@nedharvey.com wrote:
 

 You also said the raidz2 will offer more protection against failure,
 because you can survive any two disk failures (but no more.)  I would 
argue
 this is incorrect (I've done the probability analysis before).  Mostly
 because the resilver time in the mirror configuration is 8x to 16x faster
 (there's 1/8 as much data to resilver, and IOPS is limited by a single
 disk, not the worst of several disks, which introduces another factor 
up
 to 2x, increasing the 8x as high as 16x), so the smaller resilver window
 means lower probability of concurrent failures on the critical vdev.
   We're talking about 12 hours versus 1 week, actual result of my 
machines
 in production.
 

 
 Did you also compare the probability of bit errors causing data loss
 without a complete pool failure?  2-way mirrors, when one device completely
 dies, have no redundancy on that data, and the copy that remains must be
 perfect or some data will be lost.  On the other hand, raid-z2 will still
 have available redundancy, allowing every single block to have a bad read
 on any single component disk, without losing data.  I haven't done the math
 on this, but I seem to recall some papers claiming that this is the more
 likely route to lost data on modern disks, by comparing bit error rate and
 capacity.  Of course, a second outright failure puts raid-z2 in a much
 worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
 already be a less likely case.

Richard Elling wrote a blog post about mean time to data loss [1]. A
few years later he graphed out a few cases for typical values of
resilver times [2].

[1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
[2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

Cheers,
Jan


Notably, Richard's models posted do not include BER. Nevertheless it's an 
important factor.



[..] /snip


 From the back of my mind it will impact reliability in different ways in ZFS:

- Bit error in metadata (zfs should save us by metadata redundancy)
- Bit error in full stripe data
- Bit error in parity data

These aren't interesting from a system design perspective. To enhance the model 
to deal
with this, we just need to determine what percentage of the overall space 
contains copied
data. There is no general answer, but for most systems it will be a small 
percentage of the
total, as compared to data. In this respect, the models are worst-case, which 
is what we want
to use for design evalulations.
As others already pointed out, the case where disk-based read errors are 
getting into focus when you read from an array/vdev that has no more 
redundancy. Indeed, there is no difference between parity and stripe 
data. There is however a difference to metadata when those are provided 
redundantly, even in a non-redundant vdev layout. That is the case in ZFS.


NB, traditional RAID systems don't know what is data and what is not data, so 
they could
run into uncorrectable errors that are not actually containing data. This 
becomes more
important for those systems which use a destructive scrub, as opposed to ZFS's 
readonly
scrub. Hence, some studies have shown where scrubbing can propagate errors in 
non-ZFS
RAID arrays.
AFAIK, traditional RAID constructs may have two additional issues 
compared to ZFS.
1. As you mention, usually the whole stripe set needs to be rebuild, 
whereas resilver only rebuilds active data.
2. The error may or may not be detected by the controller. In ZFS, even 
if a read error goes silent down the whole food-chain, there are still 
block-based checksums separating chaff from wheat.




AFAIK, a bit error in Parity or stripe data can be specifically dangerous when 
it is raised during resilvering, and there is only one layer of redundancy left

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-21 Thread Jim Klimov

On 2013-04-21 06:13, Richard Elling wrote:

Terminology warning below…




BER is the term most often used in networks, where the corruption is transient. 
For permanent
data faults, the equivalent is unrecoverable read error rate (UER), also 
expressed as a failure rate
per bit. ...


Well, with computers being networks of smaller components, beside the
UER contained only in the storage device as repeatably returning the
error (or rather a response different from stored and expected value),
there is a place for BER concept as you say it is - there are cables
and soldered signal lines which can catch noise, there are protocols
and firmwares which might mistreat some corner cases, etc. - providing
intermittent errors which are not there the second time you look.

Even UERs might not be persistent, if the HDD decides to relocate a
detected-failing sector into spare areas, and returns some consistent
replies to queries afterwards (I did have cases with old HDDs that
did creak and rattle for a while and returned some bytes when querying
bad sectors, and replies were different every time or IO errors were
returned at the protocol layer instead of random garbage as data).


The trend seems to be that BER data is not shown for laptop drives, which is a 
large part of
the HDD market. Presumably, this is because the load/unload failure mode 
dominates in
this use case as the drives are not continuously spinning. It is a good idea to 
use components
in the environment for which they are designed, so I'm pretty sure you'd never 
consider using
a laptop drive for a storage array.


This brings up an interesting question for home-NAS users: it does not
seem unreasonable to use a laptop drive or two as an rpool in an array
like the popular ZFS workhorse HP N40L. I agree that it seems improper
to build an array for *intensive* IO with an horde of such disks, but
do you have statistics to really discourage these two cases (rpool and
intensive IO)? What about home-NASes which just occasionally see some
IO, maybe in intensive bursts, but idle for hours otherwise?

Indeed, many portable-disk boxes contain a laptop drive. Arguably, they
might also be more reliable mechanically, because intended for use in
shaky environments.

Thanks,
//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-21 Thread Richard Elling
On Apr 21, 2013, at 3:47 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-04-21 06:13, Richard Elling wrote:
 Terminology warning below…
 
 
 BER is the term most often used in networks, where the corruption is 
 transient. For permanent
 data faults, the equivalent is unrecoverable read error rate (UER), also 
 expressed as a failure rate
 per bit. ...
 
 Well, with computers being networks of smaller components, beside the
 UER contained only in the storage device as repeatably returning the
 error (or rather a response different from stored and expected value),
 there is a place for BER concept as you say it is - there are cables
 and soldered signal lines which can catch noise, there are protocols
 and firmwares which might mistreat some corner cases, etc. - providing
 intermittent errors which are not there the second time you look.

The problem is finding a spec that you can design to. We have seen many
bad cables cause all sorts of latency (due to retries on bad transfers). This
information is not measured or spec'ed by disk vendors.

 Even UERs might not be persistent, if the HDD decides to relocate a
 detected-failing sector into spare areas, and returns some consistent
 replies to queries afterwards (I did have cases with old HDDs that
 did creak and rattle for a while and returned some bytes when querying
 bad sectors, and replies were different every time or IO errors were
 returned at the protocol layer instead of random garbage as data).
 
 The trend seems to be that BER data is not shown for laptop drives, which is 
 a large part of
 the HDD market. Presumably, this is because the load/unload failure mode 
 dominates in
 this use case as the drives are not continuously spinning. It is a good idea 
 to use components
 in the environment for which they are designed, so I'm pretty sure you'd 
 never consider using
 a laptop drive for a storage array.
 
 This brings up an interesting question for home-NAS users: it does not
 seem unreasonable to use a laptop drive or two as an rpool in an array
 like the popular ZFS workhorse HP N40L. I agree that it seems improper
 to build an array for *intensive* IO with an horde of such disks, but
 do you have statistics to really discourage these two cases (rpool and
 intensive IO)? What about home-NASes which just occasionally see some
 IO, maybe in intensive bursts, but idle for hours otherwise?
 
 Indeed, many portable-disk boxes contain a laptop drive. Arguably, they
 might also be more reliable mechanically, because intended for use in
 shaky environments.



You get what you pay for.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-20 Thread Richard Elling
Terminology warning below…

On Apr 18, 2013, at 3:46 AM, Sebastian Gabler sequoiamo...@gmx.net wrote:

 Am 18.04.2013 03:09, schrieb openindiana-discuss-requ...@openindiana.org:
 Message: 1
 Date: Wed, 17 Apr 2013 13:21:08 -0600
 From: Jan Owocjso...@gmail.com
 To: Discussion list for OpenIndiana
  openindiana-discuss@openindiana.org
 Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
 Message-ID:
  cadcwueyc14mt5agkez7pda64h014t07ggtojkpq5js4s279...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8
 
 On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalsontsc...@mst.edu  wrote:
 On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) 
 openindi...@nedharvey.com wrote:
 
 You also said the raidz2 will offer more protection against failure,
 because you can survive any two disk failures (but no more.)  I would 
 argue
 this is incorrect (I've done the probability analysis before).  Mostly
 because the resilver time in the mirror configuration is 8x to 16x faster
 (there's 1/8 as much data to resilver, and IOPS is limited by a single
 disk, not the worst of several disks, which introduces another factor 
 up
 to 2x, increasing the 8x as high as 16x), so the smaller resilver window
 means lower probability of concurrent failures on the critical vdev.
   We're talking about 12 hours versus 1 week, actual result of my 
  machines
 in production.
 
 
 Did you also compare the probability of bit errors causing data loss
 without a complete pool failure?  2-way mirrors, when one device completely
 dies, have no redundancy on that data, and the copy that remains must be
 perfect or some data will be lost.  On the other hand, raid-z2 will still
 have available redundancy, allowing every single block to have a bad read
 on any single component disk, without losing data.  I haven't done the math
 on this, but I seem to recall some papers claiming that this is the more
 likely route to lost data on modern disks, by comparing bit error rate and
 capacity.  Of course, a second outright failure puts raid-z2 in a much
 worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
 already be a less likely case.
 Richard Elling wrote a blog post about mean time to data loss [1]. A
 few years later he graphed out a few cases for typical values of
 resilver times [2].
 
 [1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
 [2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
 
 Cheers,
 Jan
 
 Notably, Richard's models posted do not include BER. Nevertheless it's an 
 important factor.

BER is the term most often used in networks, where the corruption is transient. 
For permanent
data faults, the equivalent is unrecoverable read error rate (UER), also 
expressed as a failure rate
per bit. My models clearly consider this. Unfortunately, terminology 
consistency between vendors
has been slow in coming, but Seagate and WD seem to be converging on 
Non-recoverable 
read errors per bits read while Toshiba seems to be ignoring the problem, or 
at least can't seem
to list it on their datasheets :-(

 From the back of my mind it will impact reliability in different ways in ZFS:
 
 - Bit error in metadata (zfs should save us by metadata redundancy)
 - Bit error in full stripe data
 - Bit error in parity data

These aren't interesting from a system design perspective. To enhance the model 
to deal
with this, we just need to determine what percentage of the overall space 
contains copied
data. There is no general answer, but for most systems it will be a small 
percentage of the
total, as compared to data. In this respect, the models are worst-case, which 
is what we want
to use for design evalulations.

NB, traditional RAID systems don't know what is data and what is not data, so 
they could
run into uncorrectable errors that are not actually containing data. This 
becomes more
important for those systems which use a destructive scrub, as opposed to ZFS's 
readonly
scrub. Hence, some studies have shown where scrubbing can propagate errors in 
non-ZFS
RAID arrays.

 
 AFAIK, a bit error in Parity or stripe data can be specifically dangerous 
 when it is raised during resilvering, and there is only one layer of 
 redundancy left. OTOH, BER issues scale with VDEV size, not with rebuild 
 time. So, I think that Tim actually made up a valid point about a 
 systematically weak point of 2-way mirrors or raidz1 on in vdevs that are 
 large in comparison to the BER rating of their member drives. Consumer drives 
 have a BER of 1:10^14..10^15, Enterprise drives start at 1:10^16.
 I do not think that zfs will have better resilience against rot of parity 
 data than conventional RAID. At best, block level checksums can help raise an 
 error, so you know at least that something went wrong. But recovery of the 
 data will probably not be possible. So, in my opinion BER is an issue under 
 ZFS as anywhere else.

Yep, which is why my MTTDL model 2 explicitily (MTTDL[2]) 

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-19 Thread Sebastian Gabler

Am 19.04.2013 11:22, schrieb openindiana-discuss-requ...@openindiana.org:

Message: 1
Date: Thu, 18 Apr 2013 16:03:32 -0500
From: Timothy Coalsontsc...@mst.edu
To: Discussion list for OpenIndiana
openindiana-discuss@openindiana.org
Subject: Re: [OpenIndiana-discuss] vdev reliability was:
Recommendations for fast storage
Message-ID:
CAK_=tazsbkcbbhie8czhqo2q+840s-sowajpykoppxc0gej...@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.netwrote:



One of Elling's posts cited by another poster did
take that into account, but the graphs don't load due to url changes (and
they seem to have the wrong MIME type with the fixed URL, I ended up using
wget):

https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl


Thanks for pointing at that. I stand corrected with my previous 
statement about Richard's MTTDL model excluding BER/UER. Asking Richard 
Elling to accept my apology.
Timothy, how do you manage to get these pngs? I get as far as extracting 
some links to them, but loading these links results in resets 
consistently. Help appreciated.


BR

Sebastian


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-19 Thread Timothy Coalson
On Fri, Apr 19, 2013 at 8:14 AM, Sebastian Gabler sequoiamo...@gmx.netwrote:


 On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.net
 wrote:


  One of Elling's posts cited by another poster did
 take that into account, but the graphs don't load due to url changes (and
 they seem to have the wrong MIME type with the fixed URL, I ended up using
 wget):

 https://blogs.oracle.com/**relling/entry/a_story_of_two_**mttdlhttps://blogs.oracle.com/relling/entry/a_story_of_two_mttdl


 Thanks for pointing at that. I stand corrected with my previous statement
 about Richard's MTTDL model excluding BER/UER. Asking Richard Elling to
 accept my apology.
 Timothy, how do you manage to get these pngs? I get as far as extracting
 some links to them, but loading these links results in resets consistently.
 Help appreciated.


One of the wrong URLs is this:

http://blogs.sun.com/relling/resource/X4500-MTTDL-models-raidz2.png

Note that the blog URL is blogs.oracle.com, not blogs.sun.com, so, do this:

$ wget
http://blogs.oracle.com/relling/resource/X4500-MTTDL-models-raidz2.png--no-check-certificate

It redirects to https, and oracle's certificate has problems, hence the
--no-check-certificate.  Unfortunately, entering the URL in a browser
treats the PNG data as text, hence using wget (or curl, or your preferred
method of downloading a URL).

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-19 Thread Richard Elling
[catching up...]

On Apr 19, 2013, at 6:14 AM, Sebastian Gabler sequoiamo...@gmx.net wrote:

 Am 19.04.2013 11:22, schrieb openindiana-discuss-requ...@openindiana.org:
 
 Message: 1
 Date: Thu, 18 Apr 2013 16:03:32 -0500
 From: Timothy Coalsontsc...@mst.edu
 To: Discussion list for OpenIndiana
   openindiana-discuss@openindiana.org
 Subject: Re: [OpenIndiana-discuss] vdev reliability was:
   Recommendations for fast storage
 Message-ID:
   CAK_=tazsbkcbbhie8czhqo2q+840s-sowajpykoppxc0gej...@mail.gmail.com
 Content-Type: text/plain; charset=ISO-8859-1
 
 On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.netwrote:
 
 
 One of Elling's posts cited by another poster did
 take that into account, but the graphs don't load due to url changes (and
 they seem to have the wrong MIME type with the fixed URL, I ended up using
 wget):
 
 https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
 
 Thanks for pointing at that. I stand corrected with my previous statement 
 about Richard's MTTDL model excluding BER/UER. Asking Richard Elling to 
 accept my apology.

No worries.
Unfortunately, Oracle totally hosed the older Sun blogs. I do have on my todo 
list the 
task of updating these and reposting on a site with better longevity, not 
controlled by
the lawnmower.
 -- richard

 Timothy, how do you manage to get these pngs? I get as far as extracting some 
 links to them, but loading these links results in resets consistently. Help 
 appreciated.
 
 BR
 
 Sebastian
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-19 Thread Richard Elling
[catching up... comment below]

On Apr 18, 2013, at 2:03 PM, Timothy Coalson tsc...@mst.edu wrote:

 On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler 
 sequoiamo...@gmx.netwrote:
 
 Am 18.04.2013 16:28, schrieb 
 openindiana-discuss-request@**openindiana.orgopenindiana-discuss-requ...@openindiana.org
 :
 
 Message: 1
 Date: Thu, 18 Apr 2013 12:17:47 +
 From: Edward Ned Harvey 
 (openindiana)openindiana@**nedharvey.comopenindi...@nedharvey.com
 
 
 To: Discussion list for OpenIndiana

 openindiana-discuss@**openindiana.orgopenindiana-discuss@openindiana.org
 
 Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
 Message-ID:
D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411@SN2PRD0410MB372.**
 namprd04.prod.outlook.comd1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com
 
 
 Content-Type: text/plain; charset=us-ascii
 
 From: Timothy Coalson [mailto:tsc...@mst.edu]
 
 
 Did you also compare the probability of bit errors causing data loss
 without a complete pool failure?  2-way mirrors, when one device
 completely
 dies, have no redundancy on that data, and the copy that remains must be
 perfect or some data will be lost.
 
 I had to think about this comment for a little while to understand what
 you were saying, but I think I got it.  I'm going to rephrase your question:
 
 If one device in a 2-way mirror becomes unavailable, then the remaining
 device has no redundancy.  So if a bit error is encountered on the (now
 non-redundant) device, then it's an uncorrectable error.  Question is, did
 I calculate that probability?
 
 Answer is, I think so.  Modelling the probability of drive failure
 (either complete failure or data loss) is very complex and non-linear.
 Also dependent on the specific model of drive in question, and the graphs
 are typically not available.  So what I did was to start with some MTBDL
 graphs that I assumed to be typical, and then assume every data-loss event
 meant complete drive failure.
 
 The thing is... Bit Errors can lead to corruption of files, or even to the
 loss of a whole pool, without having an additional faulted drive. Because
 Bit Errors do not necessarily lead to a drive error. The risk of a rebuild
 failing is proportional to the BER of the drives involved, and it scales by
 the amount of data moved, given that you don't have further redundancy
 left. I agree with previous suggestions made that scrubbing offers some
 degree of protection against that issue. It doesn't do away with the risk
 when dealing with Bit Errors in a situation that has all redundancy
 stripped for some reason. For this aspect, a second level of redundancy
 offers a clear benefit.
 AFAIU, that was the valid point of the poster raising the controversy
 about resilience of a single vdev with multiple redundancy vs. multiple
 vdevs with single redundancy.
 As much as scrubbing is concerned, it is true that it will reduce the risk
 of a bit error rearing precisely during rebuild. However, in cases where
 you will deliberately pull redundancy, i.e. for swapping drives with larger
 ones, you will want to have a valid backup, and thus you will have not have
 too much of WORN data.  In either case, it is user-driven, that is not
 scrub by itself is pro-active, but it gives the user a tool to be proactive
 about WORN data which are indeed those primarily prone to bit rot.
 
 
 Yes, that was my point, bit errors when there is no remaining redundancy
 are unrecoverable.  Thus, as long is it is likely that only 1 disk per vdev
 fails at a time, raid-z2 will survive these bit errors fine, while 2-way
 mirrors will lose data.  One of Elling's posts cited by another poster did
 take that into account, but the graphs don't load due to url changes (and
 they seem to have the wrong MIME type with the fixed URL, I ended up using
 wget):
 
 https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
 
 He linearized the bit error rate part, also.  Beware the different scale on
 the graphs - at any rate, his calculation arrived at 3 orders of magnitude
 difference, with raid-z2 as better than 2-way mirrors for MTTDL for
 equivalent usable space (if I'm reading those correctly).

Yes. 2-way mirror is a single-parity protection scheme and raidz2 is a 
double-parity protection scheme.

 
 Scrubbing introduces a question about the meaning of bit error rate - how
 different is the uncorrectable bit error rate on newly written data, versus
 the bit error rate on data that has been read back in successfully
 (additional qualifier: multiple times)?  Regular scrubs can change the
 MTTDL dramatically if these probabilities are significantly different,
 because the first probability only applies to data written since the most
 recent scrub, which can drop a few orders of magnitude from the calculation.

Pragmatically, it doesn't matter because the drive vendors do not publish the
information. So you'll have to measure it yourself, which is not an easy task
even if you 

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
 From: Sebastian Gabler [mailto:sequoiamo...@gmx.net]
 
 AFAIK, a bit error in Parity or stripe data can be specifically
 dangerous when it is raised during resilvering, and there is only one
 layer of redundancy left. 

You're saying error in parity, but that's because you're thinking of raidz, 
which I don't usually use.  You really mean error in redundant copy, and the 
only risk, as you've identified, is the error in the *last* redundant copy.

The answer to this is:  You *do* scrub every week or two, don't you?You 
should.


 I do not think that zfs will have better resilience against rot of
 parity data than conventional RAID.

That's incorrect, because conventional raid cannot scrub proactively.

Sure, if you have a pool with only one level of redundancy, and the bit error 
creeped in between the most recent scrub and the present failure time, then 
that's a problem, and zfs cannot protect you against it.  This is, by 
definition, simultaneous failure of all redundant copies of the data.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Jim Klimov

On 2013-04-18 12:46, Sebastian Gabler wrote:

I do not think that zfs will have better resilience against rot of
parity data than conventional RAID. At best, block level checksums can
help raise an error, so you know at least that something went wrong. But
recovery of the data will probably not be possible. So, in my opinion
BER is an issue under ZFS as anywhere else.


Well, thanks to checksums we can know which variant of userdata
is correct, and thanks to parities we can verify which bytes are
wrong in a particular block. If there's relatively few such bytes,
it is theoretically possible to brute-force match values into the
wrong bytes and recalculate checksums. So if a broken range
is on the order of 30-40 bytes (which someone said is typical
for a CRC error and HDD returning uncertain data) you have a
chance of recovering the block in a few days if lucky ;)

This is a very compute-intensive task; I proposed this idea half
a year ago on the zfs list (I had unrecoverable errors on raidz2
made of 4 data disks and 2 parity disks, meaning corruptions on
3 or more drives, but not necessarily whole-sector corruptions)
and tried to take known byte values from different components at
known bad byte offsets and put them into the puzzle. Complexity
(size of recursive iteration) grows very quickly even if we only
have about 5 values to match (unlike 256 in full recovery above),
and we estimated that for a 4096 byte block it would take Earth's
compute resources longer than the lifetime of the universe to do
the full search and recovery. So such approach is really limited
to just a few dozen broken bytes. But it is possible :)

//Jim



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
 From: Jim Klimov [mailto:jimkli...@cos.ru]
 
 Well, thanks to checksums we can know which variant of userdata
 is correct, and thanks to parities we can verify which bytes are
 wrong in a particular block. If there's relatively few such bytes,
 it is theoretically possible to brute-force match values into the
 wrong bytes and recalculate checksums. So if a broken range
 is on the order of 30-40 bytes (which someone said is typical
 for a CRC error and HDD returning uncertain data) you have a
 chance of recovering the block in a few days if lucky ;)
 
 This is a very compute-intensive task; I proposed this idea half
 a year ago on the zfs list (I had unrecoverable errors on raidz2
 made of 4 data disks and 2 parity disks, meaning corruptions on
 3 or more drives, but not necessarily whole-sector corruptions)
 and tried to take known byte values from different components at
 known bad byte offsets and put them into the puzzle. Complexity
 (size of recursive iteration) grows very quickly even if we only
 have about 5 values to match (unlike 256 in full recovery above),
 and we estimated that for a 4096 byte block it would take Earth's
 compute resources longer than the lifetime of the universe to do
 the full search and recovery. So such approach is really limited
 to just a few dozen broken bytes. But it is possible :)

I think you're misplacing a decimal, confusing bits for bytes, and mixing up 
exponents.  Cuz you're way off.

With merely 70 unknown *bits* that is, less than 10 bytes, you'll need a 
3-letter government agency devoting all its computational resources to the 
problem for a few years.

Furthermore, when you find a matching cksum, you haven't found the correct data 
yet.  You'll need to exhaustively search the entire space requiring 2^70 
operations, find all the matches (there will be a lot) and from those matches, 
choose the one you think is right.

Even with merely 70 unknown bits, and a 32-bit cksum (the default in zfs 
fletcher-4) you will have 2^38 (that is, 256 billion) results that produce the 
right cksum.  You'll have to rely on your knowledge of the jpg file or txt file 
or whatever, to choose which one of the 256 billion cksum-passing-results is 
*actually* the right result.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Jim Klimov

On 2013-04-18 15:57, Edward Ned Harvey (openindiana) wrote:

I think you're misplacing a decimal, confusing bits for bytes, and mixing up 
exponents.  Cuz you're way off.

With merely 70 unknown *bits* that is, less than 10 bytes, you'll need a 
3-letter government agency devoting all its computational resources to the 
problem for a few years.


Uh... I'll raise a white flag and say I'm too preoccupied now to really
go into math and see which one of us is wrong. Indeed, however, my post
was based on practical experiment with a few values to test in each
location (about 2^2 variants instead of 2^8 for a full-byte match), so
here's a few orders of magnitude in complexity difference right away ;)

So likely you're more correct, and solution of the problem is then out
of practical boundaries even for a small corruption, except for true
single-bit rots (if these are at all possible after CRC corrections).

BTW, in terms of the amount of matching checksums - does it matter if
we are talking about just a 70-bit string or about the same string at
a fixed location in the 4kb or 128kb block?

//Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Sebastian Gabler

Am 18.04.2013 16:28, schrieb openindiana-discuss-requ...@openindiana.org:

Message: 1
Date: Thu, 18 Apr 2013 12:17:47 +
From: Edward Ned Harvey (openindiana)openindi...@nedharvey.com
To: Discussion list for OpenIndiana
openindiana-discuss@openindiana.org
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
Message-ID:

d1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com

Content-Type: text/plain; charset=us-ascii


From: Timothy Coalson [mailto:tsc...@mst.edu]

Did you also compare the probability of bit errors causing data loss
without a complete pool failure?  2-way mirrors, when one device
completely
dies, have no redundancy on that data, and the copy that remains must be
perfect or some data will be lost.

I had to think about this comment for a little while to understand what you 
were saying, but I think I got it.  I'm going to rephrase your question:

If one device in a 2-way mirror becomes unavailable, then the remaining device 
has no redundancy.  So if a bit error is encountered on the (now non-redundant) 
device, then it's an uncorrectable error.  Question is, did I calculate that 
probability?

Answer is, I think so.  Modelling the probability of drive failure (either 
complete failure or data loss) is very complex and non-linear.  Also dependent 
on the specific model of drive in question, and the graphs are typically not 
available.  So what I did was to start with some MTBDL graphs that I assumed to 
be typical, and then assume every data-loss event meant complete drive failure.
The thing is... Bit Errors can lead to corruption of files, or even to 
the loss of a whole pool, without having an additional faulted drive. 
Because Bit Errors do not necessarily lead to a drive error. The risk of 
a rebuild failing is proportional to the BER of the drives involved, and 
it scales by the amount of data moved, given that you don't have further 
redundancy left. I agree with previous suggestions made that scrubbing 
offers some degree of protection against that issue. It doesn't do away 
with the risk when dealing with Bit Errors in a situation that has all 
redundancy stripped for some reason. For this aspect, a second level of 
redundancy offers a clear benefit.
AFAIU, that was the valid point of the poster raising the controversy 
about resilience of a single vdev with multiple redundancy vs. multiple 
vdevs with single redundancy.
As much as scrubbing is concerned, it is true that it will reduce the 
risk of a bit error rearing precisely during rebuild. However, in cases 
where you will deliberately pull redundancy, i.e. for swapping drives 
with larger ones, you will want to have a valid backup, and thus you 
will have not have too much of WORN data.  In either case, it is 
user-driven, that is not scrub by itself is pro-active, but it gives the 
user a tool to be proactive about WORN data which are indeed those 
primarily prone to bit rot.


BR

Sebastian

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Timothy Coalson
On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler sequoiamo...@gmx.netwrote:

 Am 18.04.2013 16:28, schrieb 
 openindiana-discuss-request@**openindiana.orgopenindiana-discuss-requ...@openindiana.org
 :

 Message: 1
 Date: Thu, 18 Apr 2013 12:17:47 +
 From: Edward Ned Harvey 
 (openindiana)openindiana@**nedharvey.comopenindi...@nedharvey.com
 

 To: Discussion list for OpenIndiana
 
 openindiana-discuss@**openindiana.orgopenindiana-discuss@openindiana.org
 
 Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
 Message-ID:
 D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411@SN2PRD0410MB372.**
 namprd04.prod.outlook.comd1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com
 

 Content-Type: text/plain; charset=us-ascii

  From: Timothy Coalson [mailto:tsc...@mst.edu]

 
 Did you also compare the probability of bit errors causing data loss
 without a complete pool failure?  2-way mirrors, when one device
 completely
 dies, have no redundancy on that data, and the copy that remains must be
 perfect or some data will be lost.

 I had to think about this comment for a little while to understand what
 you were saying, but I think I got it.  I'm going to rephrase your question:

 If one device in a 2-way mirror becomes unavailable, then the remaining
 device has no redundancy.  So if a bit error is encountered on the (now
 non-redundant) device, then it's an uncorrectable error.  Question is, did
 I calculate that probability?

 Answer is, I think so.  Modelling the probability of drive failure
 (either complete failure or data loss) is very complex and non-linear.
  Also dependent on the specific model of drive in question, and the graphs
 are typically not available.  So what I did was to start with some MTBDL
 graphs that I assumed to be typical, and then assume every data-loss event
 meant complete drive failure.

 The thing is... Bit Errors can lead to corruption of files, or even to the
 loss of a whole pool, without having an additional faulted drive. Because
 Bit Errors do not necessarily lead to a drive error. The risk of a rebuild
 failing is proportional to the BER of the drives involved, and it scales by
 the amount of data moved, given that you don't have further redundancy
 left. I agree with previous suggestions made that scrubbing offers some
 degree of protection against that issue. It doesn't do away with the risk
 when dealing with Bit Errors in a situation that has all redundancy
 stripped for some reason. For this aspect, a second level of redundancy
 offers a clear benefit.
 AFAIU, that was the valid point of the poster raising the controversy
 about resilience of a single vdev with multiple redundancy vs. multiple
 vdevs with single redundancy.
 As much as scrubbing is concerned, it is true that it will reduce the risk
 of a bit error rearing precisely during rebuild. However, in cases where
 you will deliberately pull redundancy, i.e. for swapping drives with larger
 ones, you will want to have a valid backup, and thus you will have not have
 too much of WORN data.  In either case, it is user-driven, that is not
 scrub by itself is pro-active, but it gives the user a tool to be proactive
 about WORN data which are indeed those primarily prone to bit rot.


Yes, that was my point, bit errors when there is no remaining redundancy
are unrecoverable.  Thus, as long is it is likely that only 1 disk per vdev
fails at a time, raid-z2 will survive these bit errors fine, while 2-way
mirrors will lose data.  One of Elling's posts cited by another poster did
take that into account, but the graphs don't load due to url changes (and
they seem to have the wrong MIME type with the fixed URL, I ended up using
wget):

https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl

He linearized the bit error rate part, also.  Beware the different scale on
the graphs - at any rate, his calculation arrived at 3 orders of magnitude
difference, with raid-z2 as better than 2-way mirrors for MTTDL for
equivalent usable space (if I'm reading those correctly).

Scrubbing introduces a question about the meaning of bit error rate - how
different is the uncorrectable bit error rate on newly written data, versus
the bit error rate on data that has been read back in successfully
(additional qualifier: multiple times)?  Regular scrubs can change the
MTTDL dramatically if these probabilities are significantly different,
because the first probability only applies to data written since the most
recent scrub, which can drop a few orders of magnitude from the calculation.

As for what I said about resilver speed, I had not accounted for the fact
that data reads on a raid-z2 component device would be significantly
shorter than for the same data on 2-way mirrors.  Depending on whether you
are using enormous block sizes, or whether your data is allocated extremely
linearly in the way scrub/resilver reads it, this could be the limiting
factor on platter drives due to seek 

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
 From: Timothy Coalson [mailto:tsc...@mst.edu]
 
 As for what I said about resilver speed, I had not accounted for the fact
 that data reads on a raid-z2 component device would be significantly
 shorter than for the same data on 2-way mirrors.  Depending on whether
 you
 are using enormous block sizes, or whether your data is allocated extremely
 linearly in the way scrub/resilver reads it, this could be the limiting
 factor on platter drives due to seek times, and make raid-z2 take much
 longer to resilver.  I fear I was thinking of raid-z2 in terms of raid6.

I'm not sure if you misunderstand something, or if I misunderstand what you're 
saying, but ...

Even if you are using enormous block sizes, it's actually just enormous *max* 
block sizes.  If you write a 1 byte file (very slowly such that no write 
accumulation can occur) then ZFS only writes a 1 byte file, into a block.  So 
the enormous block sizes only come into play when you're writing large amounts 
of data ...  And when you're writing large amounts of data, you're likely to 
simply span multiple sequential blocks anyway.  So all-in-all, the blocksize is 
rarely very important.  There are some situations where it matters, but ...  
All this is a tangent.

The real thing I'm addressing here, is, you said scrub/resilver progresses 
extremely linearly.  This is unfortunately, about as wrong as it can be.  In 
actuality, scrub / resilver proceed in approximately temporal order, which in 
the typical situation of a long-time server with frequent creation  
destruction of snapshots, results in approximately random disk order.

Here's the evidence I observed:  I had a ZFS server running in production for 
about 2 years, and a disk failed.  I had measured previously, on this server, 
each disk sustains 1 Gbit/sec sequentially.  With 1T disks, linearly 
resilvering the entire disk including empty space, it should take about 2 hrs 
to resilver.  But ZFS doesn't resilver the whole disk; it only resilvers used 
space.  This would be great, if your pool is mostly empty, or if it was disk 
linearly ordered.  But it actually took 12 hours to resilver that disk.  I went 
to zfs-discuss and discussed.  Learned about the temporal ordering.  Got my 
explanation how resilvering just the used portions could take several times 
longer than resilvering the whole disk.




___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss