Re: Triple parity and beyond

2013-11-20 Thread Stan Hoeppner
On 11/20/2013 10:16 AM, James Plank wrote:
> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
> in FAST last February presents Reed-Solomon coding with Cauchy
> matrices, and then makes special note of the common pitfall of
> assuming that you can append a Vandermonde matrix to an identity
> matrix.  Please see
> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
> slides 48-52.
> 
> Andrea, does the matrix that you included in an earlier mail (the one
> that has Linux RAID-6 in the first two rows) have a general form, or
> did you develop it in an ad hoc manner so that it would include Linux
> RAID-6 in the first two rows?

Hello Jim,

It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
today. ;)

I'm not attempting to marginalize Andrea's work here, but I can't help
but ponder what the real value of triple parity RAID is, or quad, or
beyond.  Some time ago parity RAID's primary mission ceased to be
surviving single drive failure, or a 2nd failure during rebuild, and
became mitigating UREs during a drive rebuild.  So we're now talking
about dedicating 3 drives of capacity to avoiding disaster due to
platter defects and secondary drive failure.  For small arrays this is
approaching half the array capacity.  So here parity RAID has lost the
battle with RAID10's capacity disadvantage, yet it still suffers the
vastly inferior performance in normal read/write IO, not to mention
rebuild times that are 3-10x longer.

WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
to mirror a drive at full streaming bandwidth, assuming 300MB/s
average--and that is probably being kind to the drive makers.  With 6 or
8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
minimum 72 hours or more, probably over 100, and probably more yet for
3P.  And with larger drive count arrays the rebuild times approach a
week.  Whose users can go a week with degraded performance?  This is
simply unreasonable, at best.  I say it's completely unacceptable.

With these gargantuan drives coming soon, the probability of multiple
UREs during rebuild are pretty high.  Continuing to use ever more
complex parity RAID schemes simply increases rebuild time further.  The
longer the rebuild, the more likely a subsequent drive failure due to
heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
one failure mode we're increasing the probability of another.  TANSTAFL.
 Worse yet, RAID10 isn't going to survive because UREs on a single drive
are increasingly likely with these larger drives, and one URE during
rebuild destroys the array.

I think people are going to have to come to grips with using more and
more drives simply to brace the legs holding up their arrays; comes to
grips with these insane rebuild times; or bite the bullet they so
steadfastly avoided with RAID10.  Lots more spindles solves problems,
but at a greater cost--again, no free lunch.

What I envision is an array type, something similar to RAID 51, i.e.
striped parity over mirror pairs.  In the case of Linux, this would need
to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
would need enhancement before being meshed together into this new level[1].

Potential Advantages:

1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
3.  Parity is only used during rebuild if/when a URE occurs, unless ^
4.  Single drive failure doesn't degrade the parity array, multiple
failures in different mirrors doesn't degrade the parity array
5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
in one mirror and one drive in another mirror
6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
RAID 10.  Can lose half the drives and still not degrade parity,
if no two comprise one mirror
7.  Similar or possibly better read throughput vs triple parity RAID
8.  Superior write performance with drives down
9.  Vastly superior rebuild performance, as rebuilds will rarely, if
ever, involve parity

Potential Disadvantages:

1.  +1 disk overhead vs RAID 10, many more than 2/3P w/large arrays
2.  Read-modify-write penalty vs RAID 10
3.  Slower write throughput vs triple parity RAID due to spindle deficit
4.  Development effort
5.  ??


[1]  The RAID1/5 code would need to be patched to properly handle a URE
encountered by the RAID1 code during rebuild.  There are surely other
modifications and/or optimizations that would be needed.  For large
sequential reads, more deterministic read interleaving between mirror
pairs would be a good candidate I think.  IIUC the RAID1 driver does
read interleaving on a per thread basis or some such, which I don't
believe is going to work for this "RAID 51" scenario, at least not for
single streaming reads.  If this can be done well, we double the read
performance of RAID5, and thus we don't completely "waste" all

Re: Triple parity and beyond

2013-11-20 Thread Stan Hoeppner
On 11/20/2013 12:44 PM, Andrea Mazzoleni wrote:

> Yes. There are still AMD CPUs sold without SSSE3. Most notably Athlon.
> Instead, Intel is providing SSSE3 from the Core 2 Duo.

I hate branding discontinuity, due to the resulting confusion...

Athlon, Athlon64, Athlon64 X2, Athlon X2 (K10), Athlon II X2, Athlon X2
(Piledriver).  Anyone confused?

The Trinity and Richland core "Athlon X2" and "Athlon X4" branded
processors certainly do support SSSE3, as well as SSE4, AVX, etc.  These
are the dual/quad core APUs whose graphics cores don't pass QC and are
surgically disabled.  AMD decided to brand them as "Athlon" processors.
 Available since ~2011.  For example:

http://www.cpu-world.com/CPUs/Bulldozer/AMD-Athlon%20X2%20370K%20-%20AD370KOKA23HL%20-%20AD370KOKHLBOX.html

The "Athlon II X2/X3/X4" processors have been out of production for a
couple of years now, but a scant few might still be found for sale in
the channel.  The X2 is based on the clean sheet Regor dual core 45nm
design.  The X3 and X4 are Phenom II rejects with various numbers of
defective cores and defective L3 caches.  None support SSSE3.

To say "there are still AMD CPUs sold without SSSE3... Most notably
Athlon" may be technically true if some Athlon II stragglers exist in
the channel.  But it isn't really a fair statement of today's reality.
AMD hasn't manufactured a CPU without SSSE3 for a couple of years now.
And few, if any, Athlon II X2/3/4 chips lacking SSSE3 are for sale.
Though there are certainly many such chips still in deployed desktop
machines.

> A detailed list is available at: http://en.wikipedia.org/wiki/SSSE3

Never trust Wikipedia articles to be complete and up to date.  However,
it does mention Athlon X2 and X4 as planned future product in the
Piledriver lineup.  Obviously this should be updated to past tense.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread Stan Hoeppner
On 11/20/2013 8:46 PM, John Williams wrote:
> For myself or any machines I managed for work that do not need high
> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
> similar schemes with arrays of 16 - 32 drives.

You must see a week long rebuild as acceptable...

> No need to go into detail here 

I disagree.

> on a subject Adam Leventhal has already
> covered in detail in an article "Triple-Parity RAID and Beyond" which
> seems to match the subject of this thread quite nicely:
> 
> http://queue.acm.org/detail.cfm?id=1670144

Mr. Leventhal did not address the overwhelming problem we face, which is
(multiple) parity array reconstruction time.  He assumes the time to
simply 'populate' one drive at its max throughput is the total
reconstruction time for the array.  While this is typically true for
mirror based arrays, it is clearly not for parity arrays.

The primary focus of my comments was reducing rebuild time, thus
increasing overall reliability.  RAID 51 or something similar would
achieve this.  Thus I think we should discuss alternatives to multiple
parity in detail.

-- 
Stan


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread Stan Hoeppner
On 11/21/2013 1:05 AM, John Williams wrote:
> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner  
> wrote:
>> On 11/20/2013 8:46 PM, John Williams wrote:
>>> For myself or any machines I managed for work that do not need high
>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
>>> similar schemes with arrays of 16 - 32 drives.
>>
>> You must see a week long rebuild as acceptable...
> 
> It would not be a problem if it did take that long, since I would have
> extra parity units as backup in case of a failure during a rebuild.
> 
> But of course it would not take that long. Take, for example, a 24 x
> 3TB triple-parity array (21+3) that has had two drive failures
> (perhaps the rebuild started with one failure, but there was soon
> another failure). I would expect the rebuild to take about a day.

You're looking at today.  We're discussing tomorrow's needs.  Today's
6TB 3.5" drives have sustained average throughput of ~175MB/s.
Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
previously, at that rate a straight disk-disk copy of a 20TB drive takes
18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
rebuilding a failed drive in a 3P array of say 8 of these disks will
likely take at least 3 times as long, 2 days 6 hours minimum, probably
more.  This may be perfectly acceptable to some, but probably not to all.

>>> on a subject Adam Leventhal has already
>>> covered in detail in an article "Triple-Parity RAID and Beyond" which
>>> seems to match the subject of this thread quite nicely:
>>>
>>> http://queue.acm.org/detail.cfm?id=1670144
>>
>> Mr. Leventhal did not address the overwhelming problem we face, which is
>> (multiple) parity array reconstruction time.  He assumes the time to
>> simply 'populate' one drive at its max throughput is the total
>> reconstruction time for the array.
> 
> Since Adam wrote the code for RAID-Z3 for ZFS, I'm sure he is aware of
> the time to restore data to failed drives. I do not see any flaw in
> his analysis related to the time needed to restore data to failed
> drives.

He wrote that article in late 2009.  It seems pretty clear he wasn't
looking 10 years forward to 20TB drives, where the minimum mirror
rebuild time will be ~18 hours, and parity rebuild will be much greater.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread Stan Hoeppner
On 11/21/2013 2:08 AM, joystick wrote:
> On 21/11/2013 02:28, Stan Hoeppner wrote:
...
>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>> average--and that is probably being kind to the drive makers.  With 6 or
>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>> minimum 72 hours or more, probably over 100, and probably more yet for
>> 3P.  And with larger drive count arrays the rebuild times approach a
>> week.  Whose users can go a week with degraded performance?  This is
>> simply unreasonable, at best.  I say it's completely unacceptable.
>>
>> With these gargantuan drives coming soon, the probability of multiple
>> UREs during rebuild are pretty high.
> 
> No because if you are correct about the very high CPU overhead during

I made no such claim.

> rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for
> triple-parity, probably parallelizable on multiple cores), the speed of
> rebuild decreases proportionally 

The rebuild time of a parity array normally has little to do with CPU
overhead.  The bulk of the elapsed time is due to:

1.  The serial nature of the rebuild algorithm
2.  The random IO pattern of the reads
3.  The rotational latency of the drives

#3 is typically the largest portion of the elapsed time.

> and hence the stress and heating on the
> drives proportionally reduces, approximating that of normal operation.
> And how often have you seen a drive failure in a week during normal
> operation?

This depends greatly on one's normal operation.  In general, for most
users of parity arrays, any full array operation such as a rebuild or
reshape is far more taxing on the drives, in both power draw and heat
dissipation, than 'normal' operation.

> But in reality, consider that a non-naive implementation of
> multiple-parity would probably use just the single parity during
> reconstruction if just one disk fails, using the multiple parities only
> to read the stripes which are unreadable at single parity. So the speed
> and time of reconstruction and performance penalty would be that of
> raid5 except in exceptional situations of multiple failures.

That may very well be, but it doesn't change #2,3 above.

>> What I envision is an array type, something similar to RAID 51, i.e.
>> striped parity over mirror pairs. 
> 
> I don't like your approach of raid 51: it has the write overhead of
> raid5, with the waste of space of raid1.
> So it cannot be used as neither a performance array nor a capacity array.

I don't like it either.  It's a compromise.  But as RAID1/10 will soon
be unusable due to URE probability during rebuild, I think it's a
relatively good compromise for some users, some workloads.

> In the scope of this discussion (we are talking about very large
> arrays), 

Capacity yes, drive count, no.  Drive capacities are increasing at a
much faster rate than our need for storage space.  As we move forward
the trend will be building larger capacity arrays with fewer disks.

> the waste of space of your solution, higher than 50%, will make
> your solution costing double the price.

This is the classic mirror vs parity argument.  Using 1 more disk to add
parity to striped mirrors doesn't change it.  "Waste" is in the eye of
the beholder.  Anyone currently using RAID10 will have no problem
dedicating one more disk for uptime, protection.

> A competitor for the multiple-parity scheme might be raid65 or 66, but
> this is a so much dirtier approach than multiple parity if you think at
> the kind of rmw and overhead that will occur during normal operation.

Neither of those has any advantage over multi-parity.  I suggested this
approach because it retains all of the advantages of RAID10 but one.  We
sacrifice fast random write performance for protection against UREs, the
same reason behind 3P.  That's what the single parity is for, and that
alone.

I suggest that anyone in the future needing fast random write IOPS is
going to move those workloads to SSD, which is steadily increasing in
capacity.  And I suggest anyone building arrays with 10-20TB drives
isn't in need of fast random write IOPS.  Whether this approach is
valuable to anyone depends on whether the remaining attributes of
RAID10, with the added URE protection, are worth the drive count.
Obviously proponents of traditional parity arrays will not think so.
Users of RAID10 may.  Even if md never supports such a scheme, I bet
we'll see something similar to this in enterprise gear, where rebuilds
need to be 'fast' and performance degradation due to a downed drive is
not acceptable.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
Hi David,

On 11/21/2013 3:07 AM, David Brown wrote:
> On 21/11/13 02:28, Stan Hoeppner wrote:
...
>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>> average--and that is probably being kind to the drive makers.  With 6 or
>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>> minimum 72 hours or more, probably over 100, and probably more yet for
>> 3P.  And with larger drive count arrays the rebuild times approach a
>> week.  Whose users can go a week with degraded performance?  This is
>> simply unreasonable, at best.  I say it's completely unacceptable.
>>
>> With these gargantuan drives coming soon, the probability of multiple
>> UREs during rebuild are pretty high.  Continuing to use ever more
>> complex parity RAID schemes simply increases rebuild time further.  The
>> longer the rebuild, the more likely a subsequent drive failure due to
>> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
>> one failure mode we're increasing the probability of another.  TANSTAFL.
>>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
>> are increasingly likely with these larger drives, and one URE during
>> rebuild destroys the array.


> I don't think the chances of hitting an URE during rebuild is dependent
> on the rebuild time - merely on the amount of data read during rebuild.

Please read the above paragraph again, as you misread it the first time.

>  URE rates are "per byte read" rather than "per unit time", are they not?

These are specified by the drive manufacturer, and they are per *bits*
read, not "per byte read".  Current consumer drives are typically rated
at 1 URE in 10^14 bits read, enterprise are 1 in 10^15.

> I think you are overestimating the rebuild times a bit, but there is no

Which part?  A 20TB drive mirror taking 18 hours, or parity arrays
taking many times longer than 18 hours?

> arguing that rebuild on parity raids is a lot more work (for the cpu,
> the IO system, and the disks) than for mirror raids.

It's not so much a matter of work or interface bandwidth, but a matter
of serialization and rotational latency.

...
> Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
> interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
> while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
> that you mean a raid5 set of raid1 pairs - I just think you've got the
> name wrong.

Now that you mention it, yes, RAID 15 would fit much better with
convention.  Not sure why I thought 51.  So it's RAID 15 from here.

>> Potential Advantages:
>>
>> 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
> 
> +2 disks (the raid5 parity "disk" is a raid1 pair)

One drive of each mirror is already gone.  Make a RAID 5 of the
remaining disks and you lose 1 disk.  So you lose 1 additional disk vs
RAID 10, not 2.  As I stated previously, for RAID 15 you lose [1/2]+1 of
your disks to redundancy.

...
>> [1]  The RAID1/5 code would need to be patched to properly handle a URE
>> encountered by the RAID1 code during rebuild.  There are surely other
>> modifications and/or optimizations that would be needed.  For large
>> sequential reads, more deterministic read interleaving between mirror
>> pairs would be a good candidate I think.  IIUC the RAID1 driver does
>> read interleaving on a per thread basis or some such, which I don't
>> believe is going to work for this "RAID 51" scenario, at least not for
>> single streaming reads.  If this can be done well, we double the read
>> performance of RAID5, and thus we don't completely "waste" all the extra
>> disks vs big_parity schemes.
>>
>> This proposed "RAID level 51" should have drastically lower rebuild
>> times vs traditional striped parity, should not suffer read/write
>> performance degradation with most disk failure scenarios, and with a
>> read interleaving optimization may have significantly greater streaming
>> read throughput as well.
>>
>> This is far from a perfect solution and I am certainly not promoting it
>> as such.  But I think it does have some serious advantages over
>> traditional striped parity schemes, and at minimum is worth discussion
>> as a counterpoint of sorts.
> 
> I don't see that there needs to be any changes to the existing md code
> to make raid15 work - it is merely a raid 5 made from a set of raid1
> pairs.  

The sole purpose of the parity layer

Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
On 11/21/2013 3:07 AM, David Brown wrote:

> For example, with 20 disks at 1 TB each, you can have:

All correct, and these are maximum redundancies.

Maximum:

> raid5 = 19TB, 1 disk redundancy
> raid6 = 18TB, 2 disk redundancy
> raid6.3 = 17TB, 3 disk redundancy
> raid6.4 = 16TB, 4 disk redundancy
> raid6.5 = 15TB, 5 disk redundancy


These are not fully correct, because only the minimums are stated.  With
any mirror based array one can lose half the disks as long as no two are
in one mirror.  The probability of a pair failing together is very low,
and this probability decreases even further as the number of drives in
the array increases.  This is one of the many reasons RAID 10 has been
so popular for so many years.

Minimum:

> raid10 = 10TB, 1 disk redundancy
> raid15 = 8TB, 3 disk redundancy
> raid16 = 6TB, 5 disk redundancy

Maximum:

RAID 10 = 10 disk redundancy
RAID 15 = 11 disk redundancy
RAID 16 = 12 disk redundancy

Range:

RAID 10 = 1-10 disk redundancy
RAID 15 = 3-11 disk redundancy
RAID 16 = 5-12 disk redundancy


-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
On 11/21/2013 5:38 PM, John Williams wrote:
> On Thu, Nov 21, 2013 at 2:57 PM, Stan Hoeppner  wrote:
>> He wrote that article in late 2009.  It seems pretty clear he wasn't
>> looking 10 years forward to 20TB drives, where the minimum mirror
>> rebuild time will be ~18 hours, and parity rebuild will be much greater.
> 
> Actually, it is completely obvious that he WAS looking ten years
> ahead, seeing as several of his graphs have time scales going to
> 2009+10 = 2019.

Only one graph goes to 2019, the rest are 2010 or less.  That being the
case, his 2019 graph deals with projected reliability of single, double,
and triple parity.

> And he specifically mentions longer rebuild times as one of the
> reasons why higher parity RAIDs are needed.

Yes, he certainly does.  But *only* in the context of the array
surviving for the duration of a rebuild.  He doesn't state that he cares
what the total duration is, he doesn't guess what it might be, nor does
he seem to care about the degraded performance before or during the
rebuild.  He is apparently of the mindset "more parity will save us,
until we need more parity, until we need more parity, until we need
more...".

Following this path, parity will eventually eat more disks of capacity
than RAID10 does today for average array counts, and the only reason for
it being survival of ever increasing rebuild duration.

This is precisely why I proposed "RAID 15".  It gives you the single
disk cloning rebuild speed of RAID 10.  When parity hits 5P then RAID 15
becomes very competitive for smaller arrays.  And since drives at that
point will be 40-50TB each, even small arrays will need lots of
protection against UREs and additional failures during massive rebuild
times.  Here I'd say RAID 15 will beat 5P hands down.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
> Hi David,
> 
> On 11/21/2013 3:07 AM, David Brown wrote:
...
>> I don't see that there needs to be any changes to the existing md code
>> to make raid15 work - it is merely a raid 5 made from a set of raid1
>> pairs.  
> 
> The sole purpose of the parity layer of the proposed RAID 15 is to
> replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
> 5 and RAID 1 drivers have no code to support each other in this manner.

Minor self correction here-- obviously this isn't the 'sole' purpose of
the parity layer.  It also allows us to recover from losing an entire
mirror, which is a big upshot of the proposed RAID 15.  Thinking this
through a little further, more code modification would be needed for
this scenario.

In the event of a double drive failure in one mirror, the RAID 1 code
will need to be modified in such a way as to allow the RAID 5 code to
rebuild the first replacement disk, because the RAID 1 device is still
in a failed state.  Once this rebuild is complete, the RAID 1 code will
need to switch the state to degraded, and then do its standard rebuild
routine for the 2nd replacement drive.

Or, with some (likely major) hacking it should be possible to rebuild
both drives simultaneously for no loss of throughput or additional
elapsed time on the RAID 5 rebuild.  In the 20TB drive case, this would
shave 18 hours off the total rebuild operation elapsed time.  With
current 4TB drives it would still save 6.5 hours.  Losing both drives in
one mirror set of a striped array is rare, but given the rebuild time
saved it may be worth investigating during any development of this RAID
15 idea.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
On 11/22/2013 9:01 AM, John Williams wrote:


> I see no advantage of RAID 15, and several disadvantages.

Of course not, just as I sated previously.

On 11/22/2013 2:13 AM, Stan Hoeppner wrote:

> Parity users who currently shun RAID 10 for this reason will also
> shun this "RAID 15".

With that I'll thank you for your input from the pure parity
perspective, and end our discussion.  Any further exchange would be
pointless.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread Stan Hoeppner
On 11/22/2013 5:07 PM, NeilBrown wrote:
> On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner 
> wrote:
> 
>> On 11/21/2013 1:05 AM, John Williams wrote:
>>> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner  
>>> wrote:
>>>> On 11/20/2013 8:46 PM, John Williams wrote:
>>>>> For myself or any machines I managed for work that do not need high
>>>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
>>>>> similar schemes with arrays of 16 - 32 drives.
>>>>
>>>> You must see a week long rebuild as acceptable...
>>>
>>> It would not be a problem if it did take that long, since I would have
>>> extra parity units as backup in case of a failure during a rebuild.
>>>
>>> But of course it would not take that long. Take, for example, a 24 x
>>> 3TB triple-parity array (21+3) that has had two drive failures
>>> (perhaps the rebuild started with one failure, but there was soon
>>> another failure). I would expect the rebuild to take about a day.
>>
>> You're looking at today.  We're discussing tomorrow's needs.  Today's
>> 6TB 3.5" drives have sustained average throughput of ~175MB/s.
>> Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
>> previously, at that rate a straight disk-disk copy of a 20TB drive takes
>> 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
>> rebuilding a failed drive in a 3P array of say 8 of these disks will
>> likely take at least 3 times as long, 2 days 6 hours minimum, probably
>> more.  This may be perfectly acceptable to some, but probably not to all.
> 
> Could you explain your logic here?  Why do you think rebuilding parity
> will take 3 times as long as rebuilding a copy?  Can you measure that sort of
> difference today?

I've not performed head-to-head timed rebuild tests of mirror vs parity
RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
here over the past ~3 years, in which many users reported 16-24+ hour
rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

This is likely due to their chosen rebuild priority and concurrent user
load during rebuild.  Since this seems to be the norm, instead of giving
100% to the rebuild, I thought it prudent to take this into account,
instead of the theoretical minimum rebuild time.

> Presumably when we have 20TB drives we will also have more cores and quite
> possibly dedicated co-processors which will make the CPU load less
> significant.

But (when) will we have the code to fully take advantage of these?  It's
nearly 2014 and we still don't have a working threaded write model for
levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
been around for 8 years now, SMP and ccNUMA systems even longer.  So the
need has been there for a while.

I'm strictly making an observation (possibly not fully accurate) here.
I am not casting stones.  I'm not a programmer and am thus unable to
contribute code, only ideas and troubleshooting assistance for fellow
users.  Ergo I have no right/standing to complain about the rate of
feature progress.  I know that everyone hacking md is making the most of
the time they have available.  So again, not a complaint, just an
observation.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-23 Thread Stan Hoeppner
On 11/23/2013 1:12 AM, NeilBrown wrote:
> On Fri, 22 Nov 2013 21:34:41 -0800 John Williams 

>> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
>>
>> Bottom line is that IO bandwidth is not a problem for a system with
>> prudently chosen hardware.

Quite right.

>> More likely is that you would be CPU limited (rather than bus limited)
>> in a high-parity rebuild where more than one drive failed. But even
>> that is not likely to be too bad, since Andrea's single-threaded
>> recovery code can recover two drives at nearly 1GB/s on one of my
>> machines. I think the code could probably be threaded to achieve a
>> multiple of that running on multiple cores.
> 
> Indeed.  It seems likely that with modern hardware, the  linear write speed
> would be the limiting factor for spinning-rust drives.

Parity array rebuilds are read-modify-write operations.  The main
difference from normal operation RMWs is that the write is always to the
same disk.  As long as the stripe reads and chunk reconstruction outrun
the write throughput then the rebuild speed should be as fast as a
mirror rebuild.  But this doesn't appear to be what people are
experiencing.  Parity rebuilds would seem to take much longer.

I have always surmised that the culprit is rotational latency, because
we're not able to get a real sector-by-sector streaming read from each
drive.  If even only one disk in the array has to wait for the platter
to come round again, the entire stripe read is slowed down by an
additional few milliseconds.  For example, in an 8 drive array let's say
each stripe read is slowed 5ms by only one of the 7 drives due to
rotational latency, maybe acoustical management, or some other firmware
hiccup in the drive.  This slows down the entire stripe read because we
can't do parity reconstruction until all chunks are in.  An 8x 2TB array
with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
that extra 5ms per stripe read costs us

(4,000,000 * 0.005)/3600 = 5.56 hours

Now consider that arrays typically have a few years on them before the
first drive failure.  During our rebuild it's likely that some drives
will take a few rotations to return a sector that's marginal.  So  this
might slow down a stripe read by dozens of milliseconds, maybe a full
second.  If this happens to multiple drives many times throughout the
rebuild it will add even more elapsed time, possibly additional hours.

Reading stripes asynchronously or in parallel, which I assume we already
do to some degree, can mitigate these latencies to some extent.  But I
think in the overall picture, things of this nature are what is driving
parity rebuilds to dozens of hours for many people.  And as I stated
previously, when drives reach 10-20TB, this becomes far worse because
we're reading 2-10x as many stripes.  And the more drives per array the
greater the odds of incurring latency during a stripe read.

With a mirror reconstruction we can stream the reads.  Though we can't
avoid all of the drive issues above, the total number of hiccups causing
latency will be at most 1/7th those of the parity 8 drive array case.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-24 Thread Stan Hoeppner
On 11/23/2013 11:14 PM, John Williams wrote:
> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner  wrote:
> 
>> Parity array rebuilds are read-modify-write operations.  The main
>> difference from normal operation RMWs is that the write is always to the
>> same disk.  As long as the stripe reads and chunk reconstruction outrun
>> the write throughput then the rebuild speed should be as fast as a
>> mirror rebuild.  But this doesn't appear to be what people are
>> experiencing.  Parity rebuilds would seem to take much longer.
> 
> "This" doesn't appear to be what SOME people, who have reported
> issues, are experiencing. Their issues must be examined on a case by
> case basis.

Given what you state below this may very well be the case.

> But I, and a number of other people I have talked to or corresponded
> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
> approximately the optimal sequential write speed of the replacement
> drive. It is not unusual on a reasonably configured system.

I freely admit I may have drawn an incorrect conclusion about md parity
rebuild performance based on incomplete data.  I simply don't recall
anyone stating here in ~3 years that their parity rebuilds were speedy,
but quite the opposite.  I guess it's possible that each one of those
cases was due to another factor, such as user load, slow CPU, bus
bottleneck, wonky disk firmware, backplane issues, etc.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-24 Thread Stan Hoeppner
On 11/23/2013 11:19 PM, Russell Coker wrote:
> On Sun, 24 Nov 2013, Stan Hoeppner  wrote:
>> I have always surmised that the culprit is rotational latency, because
>> we're not able to get a real sector-by-sector streaming read from each
>> drive.  If even only one disk in the array has to wait for the platter
>> to come round again, the entire stripe read is slowed down by an
>> additional few milliseconds.  For example, in an 8 drive array let's say
>> each stripe read is slowed 5ms by only one of the 7 drives due to
>> rotational latency, maybe acoustical management, or some other firmware
>> hiccup in the drive.  This slows down the entire stripe read because we
>> can't do parity reconstruction until all chunks are in.  An 8x 2TB array
>> with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
>> that extra 5ms per stripe read costs us
>>
>> (4,000,000 * 0.005)/3600 = 5.56 hours
> 
> If that is the problem then the solution would be to just enable read-ahead.  
> Don't we already have that in both the OS and the disk hardware?  The hard-
> drive read-ahead buffer should at least cover the case where a seek completes 
> but the desired sector isn't under the heads.

I'm not sure if read-ahead would solve such a problem, if indeed this is
a possible problem.  AFAIK the RAID5/6 drivers process stripes serially,
not asynchronously, so I'd think the rebuild may still stall for ms at a
time in such a situation.

> RAM size is steadily increasing, it seems that the smallest that you can get 
> nowadays is 1G in a phone and for a server the smallest is probably 4G.
> 
> On the smallest system that might have an 8 disk array you should be able to 
> use 512M for buffers which allows a read-ahead of 128 chunks.
> 
>> Now consider that arrays typically have a few years on them before the
>> first drive failure.  During our rebuild it's likely that some drives
>> will take a few rotations to return a sector that's marginal.
> 
> Are you suggesting that it would be a common case that people just write data 
> to an array and never read it or do an array scrub?  I hope that it will 
> become standard practice to have a cron job scrubbing all filesystems.

Given the frequency of RAID5 double drive failure "save me!" help
requests we see on a very regular basis here, it seems pretty clear this
is exactly what many users do.

>> So  this
>> might slow down a stripe read by dozens of milliseconds, maybe a full
>> second.  If this happens to multiple drives many times throughout the
>> rebuild it will add even more elapsed time, possibly additional hours.
> 
> Have you observed such 1 second reads in practice?

We seem to have regular reports from DIY hardware users intentionally
using mismatched consumer drives, as many believe this gives them
additional protection against a firmware bug in a given drive model.
But then they often see multiple second timeouts causing drives to be
kicked, or performance to be slow, because of the mismatched drives.

In my time on this list, it seems pretty clear that the vast majority of
posters use DIY hardware, not matched, packaged, tested solutions from
the likes of Dell, HP, IBM, etc.  Some of the things I've speculated
about in my last few posts could very well occur, and indeed be caused
by, ad hoc component selection and system assembly.  Obviously not in
all DIY cases, but probably many.

-- 
Stan


> One thing I've considered doing is placing a cheap disk on a speaker cone to 
> test vibration induced performance problems.  Then I can use a PC to control 
> the level of vibration in a reasonably repeatable manner.  I'd like to see 
> what the limits are for retries.
> 
> Some years ago a company I worked for had some vibration problems which 
> dropped the contiguous read speed from about 100MB/s to about 40MB/s on some 
> parts of the disk (other parts gave full performance).  That was a serious 
> and 
> unusual problem and it only abouty halved the overall speed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-24 Thread Stan Hoeppner
On 11/24/2013 5:53 PM, Alex Elsayed wrote:
> Stan Hoeppner wrote:
> 
>> On 11/23/2013 11:14 PM, John Williams wrote:
>>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner 
>>> wrote:
> 
>>
>>> But I, and a number of other people I have talked to or corresponded
>>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
>>> approximately the optimal sequential write speed of the replacement
>>> drive. It is not unusual on a reasonably configured system.
>>
>> I freely admit I may have drawn an incorrect conclusion about md parity
>> rebuild performance based on incomplete data.  I simply don't recall
>> anyone stating here in ~3 years that their parity rebuilds were speedy,
>> but quite the opposite.  I guess it's possible that each one of those
>> cases was due to another factor, such as user load, slow CPU, bus
>> bottleneck, wonky disk firmware, backplane issues, etc.
>>
> 
> Well, there's also the issue of selection bias - people come to the list and

"Selection bias" would infer I'm doing some kind of formal analysis,
which is obviously not the case, though I do understand the point you're
making.

> complain when their RAID is taking forever to resync. People generally don't 
> come to the list and complain when their RAID resyncs quickly and without 
> issues.

When folks report problems on linux-raid it is commonplace for others to
reply that the same feature works fine for them, that the problem may be
configuration specific, etc.  When people have reported slow RAID5/6
rebuilds in the past, and these were not always reported in direct help
requests but as "me too" posts, I don't recall others saying their
parity rebuilds are speedy.  I'm not saying nobody ever has, simply that
I don't recall such.  Which is why I've been under the impression that
parity rebuilds are generally slow for everyone.

I wish I had hardware available to perform relevant testing.  It would
be nice to have some real data on this showing apples to apples rebuild
times for the various RAID levels on the same hardware.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-27 Thread Stan Hoeppner
Late reply.  This one got lost in the flurry of activity...

On 11/22/2013 7:24 AM, David Brown wrote:
> On 22/11/13 09:38, Stan Hoeppner wrote:
>> On 11/21/2013 3:07 AM, David Brown wrote:
>>
>>> For example, with 20 disks at 1 TB each, you can have:
>>
...
>> Maximum:
>>
>> RAID 10 = 10 disk redundancy
>> RAID 15 = 11 disk redundancy
> 
> 12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
> mirrors of parity).
> 
>> RAID 16 = 12 disk redundancy
> 
> 14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
> mirrors of parity).

We must follow different definitions of "redundancy".  I view redundancy
as the number of drives that can fail without taking down the array.  In
the case of the above 20 drive RAID15 that maximum is clearly 11
drives-- one of every mirror and both of one mirror can fail.  The 12th
drive failure kills the array.

>> Range:
>>
>> RAID 10 = 1-10 disk redundancy
>> RAID 15 = 3-11 disk redundancy
>> RAID 16 = 5-12 disk redundancy
>
> Yes, I know these are the minimum redundancies.  But that's a vital
> figure for reliability (even if the range is important for statistical
> averages).  When one disk in a raid10 array fails, your main concern is
> about failures or URE's in the other half of the pair - it doesn't help
> to know that another nine disks can "safely" fail too.

Knowing this is often critical from an architectural standpoint David.
It is quite common to create the mirrors of a RAID10 across two HBAs and
two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
a beat.  "RAID15" would work the same in this scenario.

This architecture is impossible with RAID5/6.  Any of the mentioned
failures will kill the array.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html