Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-15 Thread Richard Elling
Will Murnane wrote:
> On Tue, Jul 15, 2008 at 01:58, Ross <[EMAIL PROTECTED]> wrote:
>   
>> However, I'm not sure where the 8 is coming from in your calculations.
>> 
> Bits per byte ;)
>
>   
>> In this case approximately 13/100 or around 1 in 8 odds.
>> 
> Taking into account the factor 8, and it's around 8 in 8.
>
> Another possible factor to consider in calculations of this nature is
> that you probably won't get a single bit flipped here or there.  If
> drives take 512-byte sectors and apply Hamming codes to those 512
> bytes to get, say, 548 bytes of coded data that are actually written
> to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you
> cannot correct them from the data you have.  Thus, rather than getting
> one incorrect bit in a particular 4096-bit sector, you're likely to
> get all good sectors and one that's complete garbage.  Unless the
> manufacturers' specifications account for this, I would say the sector
> error rate of the drive is about 1 in 4*(10**17).  I have no idea
> whether they account for this or not, but it'd be interesting (and
> fairly doable) to test.  Write a 1TB disk full of known data, then
> read it and verify.  Then repeat until you have seen incorrect sectors
> a few times for a decent sample size, and store elsewhere what the
> sector was supposed to be and what it actually was.
>   

The specification is for unrecoverable reads per bits read.  I think
most people expect this to be as delivered to host, which is how we
count them.  I would expect many, many more recoverable read events.

You can also adjust by the amount of space used in ZFS and the number
of copies of the data.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-14 Thread Will Murnane
On Tue, Jul 15, 2008 at 01:58, Ross <[EMAIL PROTECTED]> wrote:
> However, I'm not sure where the 8 is coming from in your calculations.
Bits per byte ;)

> In this case approximately 13/100 or around 1 in 8 odds.
Taking into account the factor 8, and it's around 8 in 8.

Another possible factor to consider in calculations of this nature is
that you probably won't get a single bit flipped here or there.  If
drives take 512-byte sectors and apply Hamming codes to those 512
bytes to get, say, 548 bytes of coded data that are actually written
to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you
cannot correct them from the data you have.  Thus, rather than getting
one incorrect bit in a particular 4096-bit sector, you're likely to
get all good sectors and one that's complete garbage.  Unless the
manufacturers' specifications account for this, I would say the sector
error rate of the drive is about 1 in 4*(10**17).  I have no idea
whether they account for this or not, but it'd be interesting (and
fairly doable) to test.  Write a 1TB disk full of known data, then
read it and verify.  Then repeat until you have seen incorrect sectors
a few times for a decent sample size, and store elsewhere what the
sector was supposed to be and what it actually was.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-14 Thread Ross
Hey,

I just had a "D'oh!" moment I'm afraid, woke up this morning realising my 
previous post about the chances of failure was completely wrong.

You do need to multiply the chance of failure by the number of remaining disks, 
because you're reading the data of every one of them, and you risk loosing data 
from any one of them.  However, I'm not sure where the 8 is coming from in your 
calculations.  To my mind, the chance of failure on any one drive is:

amount of data reads / chance of failure
= 1TB / 10^14 
~ 10^12 / 10^14 or a 1 in 100 chance of failure

So then, once one of your 14 disks fail, you have 13 left and for raid-z you 
need to read the data of every single one of them to survive without errors, 
which means the calculation is now:

no of disks * amount of data reads / chance of failure

In this case approximately 13/100 or around 1 in 8 odds.

So with raid-z you have around a 1 in 8 chance of *some kind* of data error 
during the rebuild of the raid.  So your odds calculations weren't far off, but 
the key point is that you're not calculating entire drive failure here, you're 
calculating the odds of having a single bit of data fail.  Now that bit could 
be in a vital file, but it could just as easily be in an unimportant file, or 
even blank space.

And I can also give you the correct math for raid-z2.  Keeping in mind that 
these figures are for a *single piece of data*, not the entire drive, the 
chance of raid-z2 failing during the rebuild is very small.  I agree that the 
odds of having at least one piece of data fail during the raid-z2 rebuild are 
reasonably high (1 in 8), but for the rebuild to fail, you need two failures in 
the same place which means the calculation is for the failure rate for that 
particular bit, not for every bit on the drive:

no of disks / chance of failure

So the chance of your raid-z2 failing during the rebuild is approximately 12 in 
10^14.  Which I think you'll agree are much better odds :D

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-11 Thread Bob Friesenhahn
On Fri, 11 Jul 2008, Akhilesh Mritunjai wrote:

>> Thanks for your comments.  FWIW, I am building an
>> actual hardware array, so een though I _may_ put ZFS
>> on top of the hardware arrays 22TB "drive" that the
>> OS sees (I may not) I am focusing purely on the
>> controller rebuild.
>
> Not letting ZFS handle (at least one level of) redundancy is a bad 
> idea. Don't do that!

Agreed.

A further issue to consider is mean time to recover/restore.  This has 
quite a lot to do with actual uptime.  For example, if you decide to 
create two huge 22TB LUNs and mirror across them, if ZFS needs to 
resilver one of the LUNs it will take a *long* time.  A good design 
will try to keep any storage area which needs to be resilvered small 
enough that it may be restored quickly and risk of secondary failure 
is minimized.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-11 Thread Akhilesh Mritunjai
> Thanks for your comments.  FWIW, I am building an
> actual hardware array, so een though I _may_ put ZFS
> on top of the hardware arrays 22TB "drive" that the
> OS sees (I may not) I am focusing purely on the
> controller rebuild.

Not letting ZFS handle (at least one level of) redundancy is a bad idea. Don't 
do that!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-11 Thread Richard Elling
User Name wrote:
> Hello relling,
>
> Thanks for your comments.  FWIW, I am building an actual hardware array, so 
> een though I _may_ put ZFS on top of the hardware arrays 22TB "drive" that 
> the OS sees (I may not) I am focusing purely on the controller rebuild.
>
> So, setting aside ZFS for the moment, am I still correct in my intuition that 
> there is no way a _controller_ needs to touch a disk more times than there 
> are bits on the entire disk, and that this calculation people are doing is 
> faulty ?
>   

I think the calculation is correct, at least for the general case.
At FAST this year there was an interesting paper which tried to
measure this exposure in a large field sample by using checksum
verifications.  I like this paper and it validates what we see in the
field -- the most common failure mode is unrecoverable read.
http://www.usenix.org/event/fast08/tech/ 
full_papers/bairavasundaram/bairavasundaram.pdf

I should also point out that ZFS is already designed to offer some
diversity which should help guard against spatially clustered
media failures.  hmmm... another blog topic in my queue...
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-11 Thread User Name
Hello relling,

Thanks for your comments.  FWIW, I am building an actual hardware array, so een 
though I _may_ put ZFS on top of the hardware arrays 22TB "drive" that the OS 
sees (I may not) I am focusing purely on the controller rebuild.

So, setting aside ZFS for the moment, am I still correct in my intuition that 
there is no way a _controller_ needs to touch a disk more times than there are 
bits on the entire disk, and that this calculation people are doing is faulty ?

I will check out that blog - thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-11 Thread Ross
Without checking your math, I believe you may be confusing the risk of *any* 
data corruption with the risk of a total drive failure, but I do agree that the 
calculation should just be for the data on the drive, not the whole array.

My feeling on this from the various analyses I've read on the web is that 
you're reasonably likely to find some corruption on a drive during a rebuild, 
but raid-6 protects you from this nicely.  From memory, I think the stats were 
something like a 5% chance of an error on a 500GB drive, which would mean 
something like a 10% chance with your 1TB drives.  That would tie in with your 
figures if you took out the multiplier for the whole raid's data.  Instead of a 
guaranteed failure, you've calculated around 1 in 10 odds.

So, during any rebuild you've around a 1 in 10 chance of the rebuild 
encountering *some* corruption, but that's very likely going to be just a few 
bits of data, which can be easily recovered using raid-6 and the rest of the 
rebuild can carry on as normal.  

Of course there's always a risk of a second drive failing, which is why we have 
backups, but I believe that risk is miniscule in comparison, and also offset by 
the ability to regularly scrub your data, which helps to ensure that any 
problems with drives are caught early on.  Early replacement of failing drives 
means it's far less likely that you'll ever have two fail together.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-10 Thread Richard Elling
User Name wrote:
> I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise) 
> drives.
>
> So there will be 14 disks total, 2 of them will be parity, 12 TB space 
> available.
>
> My drives have a BER of 10^14
>
> I am quite scared by my calculations - it appears that if one drive fails, 
> and I do a rebuild, I will perform:
>
> 13*8*10^12 = 104
>
> reads.  But my BER is smaller:
>
> 10^14 = 100
>
> So I am (theoretically) guaranteed to lose another drive on raid rebuild.  
> Then the calculation for _that_ rebuild is:
>
> 12*8*10^12 = 96
>
> So no longer guaranteed, but 96% isn't good.
>
> I have looked all over, and these seem to be the accepted calculations - 
> which means if I ever have to rebuild, I'm toast.
>   

If you were using RAID-5, you might be concerned.  For RAID-6,
or at least raidz2, you could recover an unrecoverable read during
the rebuild of one disk.

> But here is the question - the part I am having trouble understanding:
>
> The 13*8*10^12 operations required for the first rebuild  isn't that the 
> number for _the entire array_ ?  Any given 1 TB disk only has 10^12 bits on 
> it _total_.  So why would I ever do more than 10^12 operations on the disk ?
>   

Actually, ZFS only rebuilds the data.  So you need to multiply by
the space utilization of the pool, which will usually be less than
100%.

> It seems very odd to me that a raid controller would have to access any given 
> bit more than once to do a rebuild ... and the total number of bits on a 
> drive is 10^12, which is far below the 10^14 BER number.
>
> So I guess my question is - why are we all doing this calculation, wherein we 
> apply the total operations across an entire array rebuild to a single drives 
> BER number ?
>   

You might also be interested in this blog
http://blogs.zdnet.com/storage/?p=162

A couple of things seem to be at work here.  I study field data
failure rates.  We tend to see unrecoverable read failure rates
at least an order of magnitude better than the specifications.
This is a good thing, but simply points out that the specifications
are often sand-bagged -- they are not a guarantee.  However,
you are quite right in your intuition that if you have a lot of
bits of data, then you need to pay attention to the bit-error
rate (BER) of unrecoverable reads on disks. This sort of model
can be used to determine a mean time to data loss (MTTDL) as
I explain here:
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

Perhaps it would help if we changed the math to show the risk
as a function of the amount of data given the protection scheme? 
hmmm something like probability of data loss per year for
N TBytes with configuration XYZ.  Would that be more
useful for evaluating configurations?
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] please help with raid / failure / rebuild calculations

2008-07-10 Thread User Name
I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise) 
drives.

So there will be 14 disks total, 2 of them will be parity, 12 TB space 
available.

My drives have a BER of 10^14

I am quite scared by my calculations - it appears that if one drive fails, and 
I do a rebuild, I will perform:

13*8*10^12 = 104

reads.  But my BER is smaller:

10^14 = 100

So I am (theoretically) guaranteed to lose another drive on raid rebuild.  Then 
the calculation for _that_ rebuild is:

12*8*10^12 = 96

So no longer guaranteed, but 96% isn't good.

I have looked all over, and these seem to be the accepted calculations - which 
means if I ever have to rebuild, I'm toast.

But here is the question - the part I am having trouble understanding:

The 13*8*10^12 operations required for the first rebuild  isn't that the 
number for _the entire array_ ?  Any given 1 TB disk only has 10^12 bits on it 
_total_.  So why would I ever do more than 10^12 operations on the disk ?

It seems very odd to me that a raid controller would have to access any given 
bit more than once to do a rebuild ... and the total number of bits on a drive 
is 10^12, which is far below the 10^14 BER number.

So I guess my question is - why are we all doing this calculation, wherein we 
apply the total operations across an entire array rebuild to a single drives 
BER number ?

Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss