Re: Triple parity and beyond
On 11/20/2013 10:16 AM, James Plank wrote: > Hi all -- no real comments, except as I mentioned to Ric, my tutorial > in FAST last February presents Reed-Solomon coding with Cauchy > matrices, and then makes special note of the common pitfall of > assuming that you can append a Vandermonde matrix to an identity > matrix. Please see > http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf, > slides 48-52. > > Andrea, does the matrix that you included in an earlier mail (the one > that has Linux RAID-6 in the first two rows) have a general form, or > did you develop it in an ad hoc manner so that it would include Linux > RAID-6 in the first two rows? Hello Jim, It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal today. ;) I'm not attempting to marginalize Andrea's work here, but I can't help but ponder what the real value of triple parity RAID is, or quad, or beyond. Some time ago parity RAID's primary mission ceased to be surviving single drive failure, or a 2nd failure during rebuild, and became mitigating UREs during a drive rebuild. So we're now talking about dedicating 3 drives of capacity to avoiding disaster due to platter defects and secondary drive failure. For small arrays this is approaching half the array capacity. So here parity RAID has lost the battle with RAID10's capacity disadvantage, yet it still suffers the vastly inferior performance in normal read/write IO, not to mention rebuild times that are 3-10x longer. WRT rebuild times, once drives hit 20TB we're looking at 18 hours just to mirror a drive at full streaming bandwidth, assuming 300MB/s average--and that is probably being kind to the drive makers. With 6 or 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at minimum 72 hours or more, probably over 100, and probably more yet for 3P. And with larger drive count arrays the rebuild times approach a week. Whose users can go a week with degraded performance? This is simply unreasonable, at best. I say it's completely unacceptable. With these gargantuan drives coming soon, the probability of multiple UREs during rebuild are pretty high. Continuing to use ever more complex parity RAID schemes simply increases rebuild time further. The longer the rebuild, the more likely a subsequent drive failure due to heat buildup, vibration, etc. Thus, in our maniacal efforts to mitigate one failure mode we're increasing the probability of another. TANSTAFL. Worse yet, RAID10 isn't going to survive because UREs on a single drive are increasingly likely with these larger drives, and one URE during rebuild destroys the array. I think people are going to have to come to grips with using more and more drives simply to brace the legs holding up their arrays; comes to grips with these insane rebuild times; or bite the bullet they so steadfastly avoided with RAID10. Lots more spindles solves problems, but at a greater cost--again, no free lunch. What I envision is an array type, something similar to RAID 51, i.e. striped parity over mirror pairs. In the case of Linux, this would need to be a new distinct md/RAID level, as both the RAID5 and RAID1 code would need enhancement before being meshed together into this new level[1]. Potential Advantages: 1. Only +1 disk capacity overhead vs RAID 10, regardless of drive count 2. Rebuild time is the same as RAID 10, unless a mirror pair is lost 3. Parity is only used during rebuild if/when a URE occurs, unless ^ 4. Single drive failure doesn't degrade the parity array, multiple failures in different mirrors doesn't degrade the parity array 5. Can sustain a minimum of 3 simultaneous drive failures--both drives in one mirror and one drive in another mirror 6. Can lose a maximum of 1/2 of the drives plus 1 drive--one more than RAID 10. Can lose half the drives and still not degrade parity, if no two comprise one mirror 7. Similar or possibly better read throughput vs triple parity RAID 8. Superior write performance with drives down 9. Vastly superior rebuild performance, as rebuilds will rarely, if ever, involve parity Potential Disadvantages: 1. +1 disk overhead vs RAID 10, many more than 2/3P w/large arrays 2. Read-modify-write penalty vs RAID 10 3. Slower write throughput vs triple parity RAID due to spindle deficit 4. Development effort 5. ?? [1] The RAID1/5 code would need to be patched to properly handle a URE encountered by the RAID1 code during rebuild. There are surely other modifications and/or optimizations that would be needed. For large sequential reads, more deterministic read interleaving between mirror pairs would be a good candidate I think. IIUC the RAID1 driver does read interleaving on a per thread basis or some such, which I don't believe is going to work for this "RAID 51" scenario, at least not for single streaming reads. If this can be done well, we double the read performance of RAID5, and thus we don't completely "waste" all
Re: Triple parity and beyond
On 11/20/2013 12:44 PM, Andrea Mazzoleni wrote: > Yes. There are still AMD CPUs sold without SSSE3. Most notably Athlon. > Instead, Intel is providing SSSE3 from the Core 2 Duo. I hate branding discontinuity, due to the resulting confusion... Athlon, Athlon64, Athlon64 X2, Athlon X2 (K10), Athlon II X2, Athlon X2 (Piledriver). Anyone confused? The Trinity and Richland core "Athlon X2" and "Athlon X4" branded processors certainly do support SSSE3, as well as SSE4, AVX, etc. These are the dual/quad core APUs whose graphics cores don't pass QC and are surgically disabled. AMD decided to brand them as "Athlon" processors. Available since ~2011. For example: http://www.cpu-world.com/CPUs/Bulldozer/AMD-Athlon%20X2%20370K%20-%20AD370KOKA23HL%20-%20AD370KOKHLBOX.html The "Athlon II X2/X3/X4" processors have been out of production for a couple of years now, but a scant few might still be found for sale in the channel. The X2 is based on the clean sheet Regor dual core 45nm design. The X3 and X4 are Phenom II rejects with various numbers of defective cores and defective L3 caches. None support SSSE3. To say "there are still AMD CPUs sold without SSSE3... Most notably Athlon" may be technically true if some Athlon II stragglers exist in the channel. But it isn't really a fair statement of today's reality. AMD hasn't manufactured a CPU without SSSE3 for a couple of years now. And few, if any, Athlon II X2/3/4 chips lacking SSSE3 are for sale. Though there are certainly many such chips still in deployed desktop machines. > A detailed list is available at: http://en.wikipedia.org/wiki/SSSE3 Never trust Wikipedia articles to be complete and up to date. However, it does mention Athlon X2 and X4 as planned future product in the Piledriver lineup. Obviously this should be updated to past tense. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/20/2013 8:46 PM, John Williams wrote: > For myself or any machines I managed for work that do not need high > IOPS, I would definitely choose triple- or quad-parity over RAID 51 or > similar schemes with arrays of 16 - 32 drives. You must see a week long rebuild as acceptable... > No need to go into detail here I disagree. > on a subject Adam Leventhal has already > covered in detail in an article "Triple-Parity RAID and Beyond" which > seems to match the subject of this thread quite nicely: > > http://queue.acm.org/detail.cfm?id=1670144 Mr. Leventhal did not address the overwhelming problem we face, which is (multiple) parity array reconstruction time. He assumes the time to simply 'populate' one drive at its max throughput is the total reconstruction time for the array. While this is typically true for mirror based arrays, it is clearly not for parity arrays. The primary focus of my comments was reducing rebuild time, thus increasing overall reliability. RAID 51 or something similar would achieve this. Thus I think we should discuss alternatives to multiple parity in detail. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/21/2013 1:05 AM, John Williams wrote: > On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner > wrote: >> On 11/20/2013 8:46 PM, John Williams wrote: >>> For myself or any machines I managed for work that do not need high >>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or >>> similar schemes with arrays of 16 - 32 drives. >> >> You must see a week long rebuild as acceptable... > > It would not be a problem if it did take that long, since I would have > extra parity units as backup in case of a failure during a rebuild. > > But of course it would not take that long. Take, for example, a 24 x > 3TB triple-parity array (21+3) that has had two drive failures > (perhaps the rebuild started with one failure, but there was soon > another failure). I would expect the rebuild to take about a day. You're looking at today. We're discussing tomorrow's needs. Today's 6TB 3.5" drives have sustained average throughput of ~175MB/s. Tomorrow's 20TB drives will be lucky to do 300MB/s. As I said previously, at that rate a straight disk-disk copy of a 20TB drive takes 18.6 hours. This is what you get with RAID1/10/51. In the real world, rebuilding a failed drive in a 3P array of say 8 of these disks will likely take at least 3 times as long, 2 days 6 hours minimum, probably more. This may be perfectly acceptable to some, but probably not to all. >>> on a subject Adam Leventhal has already >>> covered in detail in an article "Triple-Parity RAID and Beyond" which >>> seems to match the subject of this thread quite nicely: >>> >>> http://queue.acm.org/detail.cfm?id=1670144 >> >> Mr. Leventhal did not address the overwhelming problem we face, which is >> (multiple) parity array reconstruction time. He assumes the time to >> simply 'populate' one drive at its max throughput is the total >> reconstruction time for the array. > > Since Adam wrote the code for RAID-Z3 for ZFS, I'm sure he is aware of > the time to restore data to failed drives. I do not see any flaw in > his analysis related to the time needed to restore data to failed > drives. He wrote that article in late 2009. It seems pretty clear he wasn't looking 10 years forward to 20TB drives, where the minimum mirror rebuild time will be ~18 hours, and parity rebuild will be much greater. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/21/2013 2:08 AM, joystick wrote: > On 21/11/2013 02:28, Stan Hoeppner wrote: ... >> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just >> to mirror a drive at full streaming bandwidth, assuming 300MB/s >> average--and that is probably being kind to the drive makers. With 6 or >> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at >> minimum 72 hours or more, probably over 100, and probably more yet for >> 3P. And with larger drive count arrays the rebuild times approach a >> week. Whose users can go a week with degraded performance? This is >> simply unreasonable, at best. I say it's completely unacceptable. >> >> With these gargantuan drives coming soon, the probability of multiple >> UREs during rebuild are pretty high. > > No because if you are correct about the very high CPU overhead during I made no such claim. > rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for > triple-parity, probably parallelizable on multiple cores), the speed of > rebuild decreases proportionally The rebuild time of a parity array normally has little to do with CPU overhead. The bulk of the elapsed time is due to: 1. The serial nature of the rebuild algorithm 2. The random IO pattern of the reads 3. The rotational latency of the drives #3 is typically the largest portion of the elapsed time. > and hence the stress and heating on the > drives proportionally reduces, approximating that of normal operation. > And how often have you seen a drive failure in a week during normal > operation? This depends greatly on one's normal operation. In general, for most users of parity arrays, any full array operation such as a rebuild or reshape is far more taxing on the drives, in both power draw and heat dissipation, than 'normal' operation. > But in reality, consider that a non-naive implementation of > multiple-parity would probably use just the single parity during > reconstruction if just one disk fails, using the multiple parities only > to read the stripes which are unreadable at single parity. So the speed > and time of reconstruction and performance penalty would be that of > raid5 except in exceptional situations of multiple failures. That may very well be, but it doesn't change #2,3 above. >> What I envision is an array type, something similar to RAID 51, i.e. >> striped parity over mirror pairs. > > I don't like your approach of raid 51: it has the write overhead of > raid5, with the waste of space of raid1. > So it cannot be used as neither a performance array nor a capacity array. I don't like it either. It's a compromise. But as RAID1/10 will soon be unusable due to URE probability during rebuild, I think it's a relatively good compromise for some users, some workloads. > In the scope of this discussion (we are talking about very large > arrays), Capacity yes, drive count, no. Drive capacities are increasing at a much faster rate than our need for storage space. As we move forward the trend will be building larger capacity arrays with fewer disks. > the waste of space of your solution, higher than 50%, will make > your solution costing double the price. This is the classic mirror vs parity argument. Using 1 more disk to add parity to striped mirrors doesn't change it. "Waste" is in the eye of the beholder. Anyone currently using RAID10 will have no problem dedicating one more disk for uptime, protection. > A competitor for the multiple-parity scheme might be raid65 or 66, but > this is a so much dirtier approach than multiple parity if you think at > the kind of rmw and overhead that will occur during normal operation. Neither of those has any advantage over multi-parity. I suggested this approach because it retains all of the advantages of RAID10 but one. We sacrifice fast random write performance for protection against UREs, the same reason behind 3P. That's what the single parity is for, and that alone. I suggest that anyone in the future needing fast random write IOPS is going to move those workloads to SSD, which is steadily increasing in capacity. And I suggest anyone building arrays with 10-20TB drives isn't in need of fast random write IOPS. Whether this approach is valuable to anyone depends on whether the remaining attributes of RAID10, with the added URE protection, are worth the drive count. Obviously proponents of traditional parity arrays will not think so. Users of RAID10 may. Even if md never supports such a scheme, I bet we'll see something similar to this in enterprise gear, where rebuilds need to be 'fast' and performance degradation due to a downed drive is not acceptable. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
Hi David, On 11/21/2013 3:07 AM, David Brown wrote: > On 21/11/13 02:28, Stan Hoeppner wrote: ... >> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just >> to mirror a drive at full streaming bandwidth, assuming 300MB/s >> average--and that is probably being kind to the drive makers. With 6 or >> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at >> minimum 72 hours or more, probably over 100, and probably more yet for >> 3P. And with larger drive count arrays the rebuild times approach a >> week. Whose users can go a week with degraded performance? This is >> simply unreasonable, at best. I say it's completely unacceptable. >> >> With these gargantuan drives coming soon, the probability of multiple >> UREs during rebuild are pretty high. Continuing to use ever more >> complex parity RAID schemes simply increases rebuild time further. The >> longer the rebuild, the more likely a subsequent drive failure due to >> heat buildup, vibration, etc. Thus, in our maniacal efforts to mitigate >> one failure mode we're increasing the probability of another. TANSTAFL. >> Worse yet, RAID10 isn't going to survive because UREs on a single drive >> are increasingly likely with these larger drives, and one URE during >> rebuild destroys the array. > I don't think the chances of hitting an URE during rebuild is dependent > on the rebuild time - merely on the amount of data read during rebuild. Please read the above paragraph again, as you misread it the first time. > URE rates are "per byte read" rather than "per unit time", are they not? These are specified by the drive manufacturer, and they are per *bits* read, not "per byte read". Current consumer drives are typically rated at 1 URE in 10^14 bits read, enterprise are 1 in 10^15. > I think you are overestimating the rebuild times a bit, but there is no Which part? A 20TB drive mirror taking 18 hours, or parity arrays taking many times longer than 18 hours? > arguing that rebuild on parity raids is a lot more work (for the cpu, > the IO system, and the disks) than for mirror raids. It's not so much a matter of work or interface bandwidth, but a matter of serialization and rotational latency. ... > Shouldn't we be talking about RAID 15 here, rather than RAID 51 ? I > interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors, > while "RAID 51" would be a raid1 mirror of raid5 sets. I am certain > that you mean a raid5 set of raid1 pairs - I just think you've got the > name wrong. Now that you mention it, yes, RAID 15 would fit much better with convention. Not sure why I thought 51. So it's RAID 15 from here. >> Potential Advantages: >> >> 1. Only +1 disk capacity overhead vs RAID 10, regardless of drive count > > +2 disks (the raid5 parity "disk" is a raid1 pair) One drive of each mirror is already gone. Make a RAID 5 of the remaining disks and you lose 1 disk. So you lose 1 additional disk vs RAID 10, not 2. As I stated previously, for RAID 15 you lose [1/2]+1 of your disks to redundancy. ... >> [1] The RAID1/5 code would need to be patched to properly handle a URE >> encountered by the RAID1 code during rebuild. There are surely other >> modifications and/or optimizations that would be needed. For large >> sequential reads, more deterministic read interleaving between mirror >> pairs would be a good candidate I think. IIUC the RAID1 driver does >> read interleaving on a per thread basis or some such, which I don't >> believe is going to work for this "RAID 51" scenario, at least not for >> single streaming reads. If this can be done well, we double the read >> performance of RAID5, and thus we don't completely "waste" all the extra >> disks vs big_parity schemes. >> >> This proposed "RAID level 51" should have drastically lower rebuild >> times vs traditional striped parity, should not suffer read/write >> performance degradation with most disk failure scenarios, and with a >> read interleaving optimization may have significantly greater streaming >> read throughput as well. >> >> This is far from a perfect solution and I am certainly not promoting it >> as such. But I think it does have some serious advantages over >> traditional striped parity schemes, and at minimum is worth discussion >> as a counterpoint of sorts. > > I don't see that there needs to be any changes to the existing md code > to make raid15 work - it is merely a raid 5 made from a set of raid1 > pairs. The sole purpose of the parity layer
Re: Triple parity and beyond
On 11/21/2013 3:07 AM, David Brown wrote: > For example, with 20 disks at 1 TB each, you can have: All correct, and these are maximum redundancies. Maximum: > raid5 = 19TB, 1 disk redundancy > raid6 = 18TB, 2 disk redundancy > raid6.3 = 17TB, 3 disk redundancy > raid6.4 = 16TB, 4 disk redundancy > raid6.5 = 15TB, 5 disk redundancy These are not fully correct, because only the minimums are stated. With any mirror based array one can lose half the disks as long as no two are in one mirror. The probability of a pair failing together is very low, and this probability decreases even further as the number of drives in the array increases. This is one of the many reasons RAID 10 has been so popular for so many years. Minimum: > raid10 = 10TB, 1 disk redundancy > raid15 = 8TB, 3 disk redundancy > raid16 = 6TB, 5 disk redundancy Maximum: RAID 10 = 10 disk redundancy RAID 15 = 11 disk redundancy RAID 16 = 12 disk redundancy Range: RAID 10 = 1-10 disk redundancy RAID 15 = 3-11 disk redundancy RAID 16 = 5-12 disk redundancy -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/21/2013 5:38 PM, John Williams wrote: > On Thu, Nov 21, 2013 at 2:57 PM, Stan Hoeppner wrote: >> He wrote that article in late 2009. It seems pretty clear he wasn't >> looking 10 years forward to 20TB drives, where the minimum mirror >> rebuild time will be ~18 hours, and parity rebuild will be much greater. > > Actually, it is completely obvious that he WAS looking ten years > ahead, seeing as several of his graphs have time scales going to > 2009+10 = 2019. Only one graph goes to 2019, the rest are 2010 or less. That being the case, his 2019 graph deals with projected reliability of single, double, and triple parity. > And he specifically mentions longer rebuild times as one of the > reasons why higher parity RAIDs are needed. Yes, he certainly does. But *only* in the context of the array surviving for the duration of a rebuild. He doesn't state that he cares what the total duration is, he doesn't guess what it might be, nor does he seem to care about the degraded performance before or during the rebuild. He is apparently of the mindset "more parity will save us, until we need more parity, until we need more parity, until we need more...". Following this path, parity will eventually eat more disks of capacity than RAID10 does today for average array counts, and the only reason for it being survival of ever increasing rebuild duration. This is precisely why I proposed "RAID 15". It gives you the single disk cloning rebuild speed of RAID 10. When parity hits 5P then RAID 15 becomes very competitive for smaller arrays. And since drives at that point will be 40-50TB each, even small arrays will need lots of protection against UREs and additional failures during massive rebuild times. Here I'd say RAID 15 will beat 5P hands down. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/22/2013 2:13 AM, Stan Hoeppner wrote: > Hi David, > > On 11/21/2013 3:07 AM, David Brown wrote: ... >> I don't see that there needs to be any changes to the existing md code >> to make raid15 work - it is merely a raid 5 made from a set of raid1 >> pairs. > > The sole purpose of the parity layer of the proposed RAID 15 is to > replace sectors lost due to UREs during rebuild. AFAIK the current RAID > 5 and RAID 1 drivers have no code to support each other in this manner. Minor self correction here-- obviously this isn't the 'sole' purpose of the parity layer. It also allows us to recover from losing an entire mirror, which is a big upshot of the proposed RAID 15. Thinking this through a little further, more code modification would be needed for this scenario. In the event of a double drive failure in one mirror, the RAID 1 code will need to be modified in such a way as to allow the RAID 5 code to rebuild the first replacement disk, because the RAID 1 device is still in a failed state. Once this rebuild is complete, the RAID 1 code will need to switch the state to degraded, and then do its standard rebuild routine for the 2nd replacement drive. Or, with some (likely major) hacking it should be possible to rebuild both drives simultaneously for no loss of throughput or additional elapsed time on the RAID 5 rebuild. In the 20TB drive case, this would shave 18 hours off the total rebuild operation elapsed time. With current 4TB drives it would still save 6.5 hours. Losing both drives in one mirror set of a striped array is rare, but given the rebuild time saved it may be worth investigating during any development of this RAID 15 idea. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/22/2013 9:01 AM, John Williams wrote: > I see no advantage of RAID 15, and several disadvantages. Of course not, just as I sated previously. On 11/22/2013 2:13 AM, Stan Hoeppner wrote: > Parity users who currently shun RAID 10 for this reason will also > shun this "RAID 15". With that I'll thank you for your input from the pure parity perspective, and end our discussion. Any further exchange would be pointless. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/22/2013 5:07 PM, NeilBrown wrote: > On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner > wrote: > >> On 11/21/2013 1:05 AM, John Williams wrote: >>> On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner >>> wrote: >>>> On 11/20/2013 8:46 PM, John Williams wrote: >>>>> For myself or any machines I managed for work that do not need high >>>>> IOPS, I would definitely choose triple- or quad-parity over RAID 51 or >>>>> similar schemes with arrays of 16 - 32 drives. >>>> >>>> You must see a week long rebuild as acceptable... >>> >>> It would not be a problem if it did take that long, since I would have >>> extra parity units as backup in case of a failure during a rebuild. >>> >>> But of course it would not take that long. Take, for example, a 24 x >>> 3TB triple-parity array (21+3) that has had two drive failures >>> (perhaps the rebuild started with one failure, but there was soon >>> another failure). I would expect the rebuild to take about a day. >> >> You're looking at today. We're discussing tomorrow's needs. Today's >> 6TB 3.5" drives have sustained average throughput of ~175MB/s. >> Tomorrow's 20TB drives will be lucky to do 300MB/s. As I said >> previously, at that rate a straight disk-disk copy of a 20TB drive takes >> 18.6 hours. This is what you get with RAID1/10/51. In the real world, >> rebuilding a failed drive in a 3P array of say 8 of these disks will >> likely take at least 3 times as long, 2 days 6 hours minimum, probably >> more. This may be perfectly acceptable to some, but probably not to all. > > Could you explain your logic here? Why do you think rebuilding parity > will take 3 times as long as rebuilding a copy? Can you measure that sort of > difference today? I've not performed head-to-head timed rebuild tests of mirror vs parity RAIDs. I'm making the elapsed guess for parity RAIDs based on posts here over the past ~3 years, in which many users reported 16-24+ hour rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays. This is likely due to their chosen rebuild priority and concurrent user load during rebuild. Since this seems to be the norm, instead of giving 100% to the rebuild, I thought it prudent to take this into account, instead of the theoretical minimum rebuild time. > Presumably when we have 20TB drives we will also have more cores and quite > possibly dedicated co-processors which will make the CPU load less > significant. But (when) will we have the code to fully take advantage of these? It's nearly 2014 and we still don't have a working threaded write model for levels 5/6/10, though maybe soon. Multi-core mainstream x86 CPUs have been around for 8 years now, SMP and ccNUMA systems even longer. So the need has been there for a while. I'm strictly making an observation (possibly not fully accurate) here. I am not casting stones. I'm not a programmer and am thus unable to contribute code, only ideas and troubleshooting assistance for fellow users. Ergo I have no right/standing to complain about the rate of feature progress. I know that everyone hacking md is making the most of the time they have available. So again, not a complaint, just an observation. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/23/2013 1:12 AM, NeilBrown wrote: > On Fri, 22 Nov 2013 21:34:41 -0800 John Williams >> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth. >> >> Bottom line is that IO bandwidth is not a problem for a system with >> prudently chosen hardware. Quite right. >> More likely is that you would be CPU limited (rather than bus limited) >> in a high-parity rebuild where more than one drive failed. But even >> that is not likely to be too bad, since Andrea's single-threaded >> recovery code can recover two drives at nearly 1GB/s on one of my >> machines. I think the code could probably be threaded to achieve a >> multiple of that running on multiple cores. > > Indeed. It seems likely that with modern hardware, the linear write speed > would be the limiting factor for spinning-rust drives. Parity array rebuilds are read-modify-write operations. The main difference from normal operation RMWs is that the write is always to the same disk. As long as the stripe reads and chunk reconstruction outrun the write throughput then the rebuild speed should be as fast as a mirror rebuild. But this doesn't appear to be what people are experiencing. Parity rebuilds would seem to take much longer. I have always surmised that the culprit is rotational latency, because we're not able to get a real sector-by-sector streaming read from each drive. If even only one disk in the array has to wait for the platter to come round again, the entire stripe read is slowed down by an additional few milliseconds. For example, in an 8 drive array let's say each stripe read is slowed 5ms by only one of the 7 drives due to rotational latency, maybe acoustical management, or some other firmware hiccup in the drive. This slows down the entire stripe read because we can't do parity reconstruction until all chunks are in. An 8x 2TB array with 512KB chunk has 4 million stripes of 4MB each. Reading 4M stripes, that extra 5ms per stripe read costs us (4,000,000 * 0.005)/3600 = 5.56 hours Now consider that arrays typically have a few years on them before the first drive failure. During our rebuild it's likely that some drives will take a few rotations to return a sector that's marginal. So this might slow down a stripe read by dozens of milliseconds, maybe a full second. If this happens to multiple drives many times throughout the rebuild it will add even more elapsed time, possibly additional hours. Reading stripes asynchronously or in parallel, which I assume we already do to some degree, can mitigate these latencies to some extent. But I think in the overall picture, things of this nature are what is driving parity rebuilds to dozens of hours for many people. And as I stated previously, when drives reach 10-20TB, this becomes far worse because we're reading 2-10x as many stripes. And the more drives per array the greater the odds of incurring latency during a stripe read. With a mirror reconstruction we can stream the reads. Though we can't avoid all of the drive issues above, the total number of hiccups causing latency will be at most 1/7th those of the parity 8 drive array case. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/23/2013 11:14 PM, John Williams wrote: > On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner wrote: > >> Parity array rebuilds are read-modify-write operations. The main >> difference from normal operation RMWs is that the write is always to the >> same disk. As long as the stripe reads and chunk reconstruction outrun >> the write throughput then the rebuild speed should be as fast as a >> mirror rebuild. But this doesn't appear to be what people are >> experiencing. Parity rebuilds would seem to take much longer. > > "This" doesn't appear to be what SOME people, who have reported > issues, are experiencing. Their issues must be examined on a case by > case basis. Given what you state below this may very well be the case. > But I, and a number of other people I have talked to or corresponded > with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at > approximately the optimal sequential write speed of the replacement > drive. It is not unusual on a reasonably configured system. I freely admit I may have drawn an incorrect conclusion about md parity rebuild performance based on incomplete data. I simply don't recall anyone stating here in ~3 years that their parity rebuilds were speedy, but quite the opposite. I guess it's possible that each one of those cases was due to another factor, such as user load, slow CPU, bus bottleneck, wonky disk firmware, backplane issues, etc. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/23/2013 11:19 PM, Russell Coker wrote: > On Sun, 24 Nov 2013, Stan Hoeppner wrote: >> I have always surmised that the culprit is rotational latency, because >> we're not able to get a real sector-by-sector streaming read from each >> drive. If even only one disk in the array has to wait for the platter >> to come round again, the entire stripe read is slowed down by an >> additional few milliseconds. For example, in an 8 drive array let's say >> each stripe read is slowed 5ms by only one of the 7 drives due to >> rotational latency, maybe acoustical management, or some other firmware >> hiccup in the drive. This slows down the entire stripe read because we >> can't do parity reconstruction until all chunks are in. An 8x 2TB array >> with 512KB chunk has 4 million stripes of 4MB each. Reading 4M stripes, >> that extra 5ms per stripe read costs us >> >> (4,000,000 * 0.005)/3600 = 5.56 hours > > If that is the problem then the solution would be to just enable read-ahead. > Don't we already have that in both the OS and the disk hardware? The hard- > drive read-ahead buffer should at least cover the case where a seek completes > but the desired sector isn't under the heads. I'm not sure if read-ahead would solve such a problem, if indeed this is a possible problem. AFAIK the RAID5/6 drivers process stripes serially, not asynchronously, so I'd think the rebuild may still stall for ms at a time in such a situation. > RAM size is steadily increasing, it seems that the smallest that you can get > nowadays is 1G in a phone and for a server the smallest is probably 4G. > > On the smallest system that might have an 8 disk array you should be able to > use 512M for buffers which allows a read-ahead of 128 chunks. > >> Now consider that arrays typically have a few years on them before the >> first drive failure. During our rebuild it's likely that some drives >> will take a few rotations to return a sector that's marginal. > > Are you suggesting that it would be a common case that people just write data > to an array and never read it or do an array scrub? I hope that it will > become standard practice to have a cron job scrubbing all filesystems. Given the frequency of RAID5 double drive failure "save me!" help requests we see on a very regular basis here, it seems pretty clear this is exactly what many users do. >> So this >> might slow down a stripe read by dozens of milliseconds, maybe a full >> second. If this happens to multiple drives many times throughout the >> rebuild it will add even more elapsed time, possibly additional hours. > > Have you observed such 1 second reads in practice? We seem to have regular reports from DIY hardware users intentionally using mismatched consumer drives, as many believe this gives them additional protection against a firmware bug in a given drive model. But then they often see multiple second timeouts causing drives to be kicked, or performance to be slow, because of the mismatched drives. In my time on this list, it seems pretty clear that the vast majority of posters use DIY hardware, not matched, packaged, tested solutions from the likes of Dell, HP, IBM, etc. Some of the things I've speculated about in my last few posts could very well occur, and indeed be caused by, ad hoc component selection and system assembly. Obviously not in all DIY cases, but probably many. -- Stan > One thing I've considered doing is placing a cheap disk on a speaker cone to > test vibration induced performance problems. Then I can use a PC to control > the level of vibration in a reasonably repeatable manner. I'd like to see > what the limits are for retries. > > Some years ago a company I worked for had some vibration problems which > dropped the contiguous read speed from about 100MB/s to about 40MB/s on some > parts of the disk (other parts gave full performance). That was a serious > and > unusual problem and it only abouty halved the overall speed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/24/2013 5:53 PM, Alex Elsayed wrote: > Stan Hoeppner wrote: > >> On 11/23/2013 11:14 PM, John Williams wrote: >>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner >>> wrote: > >> >>> But I, and a number of other people I have talked to or corresponded >>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at >>> approximately the optimal sequential write speed of the replacement >>> drive. It is not unusual on a reasonably configured system. >> >> I freely admit I may have drawn an incorrect conclusion about md parity >> rebuild performance based on incomplete data. I simply don't recall >> anyone stating here in ~3 years that their parity rebuilds were speedy, >> but quite the opposite. I guess it's possible that each one of those >> cases was due to another factor, such as user load, slow CPU, bus >> bottleneck, wonky disk firmware, backplane issues, etc. >> > > Well, there's also the issue of selection bias - people come to the list and "Selection bias" would infer I'm doing some kind of formal analysis, which is obviously not the case, though I do understand the point you're making. > complain when their RAID is taking forever to resync. People generally don't > come to the list and complain when their RAID resyncs quickly and without > issues. When folks report problems on linux-raid it is commonplace for others to reply that the same feature works fine for them, that the problem may be configuration specific, etc. When people have reported slow RAID5/6 rebuilds in the past, and these were not always reported in direct help requests but as "me too" posts, I don't recall others saying their parity rebuilds are speedy. I'm not saying nobody ever has, simply that I don't recall such. Which is why I've been under the impression that parity rebuilds are generally slow for everyone. I wish I had hardware available to perform relevant testing. It would be nice to have some real data on this showing apples to apples rebuild times for the various RAID levels on the same hardware. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
Late reply. This one got lost in the flurry of activity... On 11/22/2013 7:24 AM, David Brown wrote: > On 22/11/13 09:38, Stan Hoeppner wrote: >> On 11/21/2013 3:07 AM, David Brown wrote: >> >>> For example, with 20 disks at 1 TB each, you can have: >> ... >> Maximum: >> >> RAID 10 = 10 disk redundancy >> RAID 15 = 11 disk redundancy > > 12 disks maximum (you have 8 with data, the rest are mirrors, parity, or > mirrors of parity). > >> RAID 16 = 12 disk redundancy > > 14 disks maximum (you have 6 with data, the rest are mirrors, parity, or > mirrors of parity). We must follow different definitions of "redundancy". I view redundancy as the number of drives that can fail without taking down the array. In the case of the above 20 drive RAID15 that maximum is clearly 11 drives-- one of every mirror and both of one mirror can fail. The 12th drive failure kills the array. >> Range: >> >> RAID 10 = 1-10 disk redundancy >> RAID 15 = 3-11 disk redundancy >> RAID 16 = 5-12 disk redundancy > > Yes, I know these are the minimum redundancies. But that's a vital > figure for reliability (even if the range is important for statistical > averages). When one disk in a raid10 array fails, your main concern is > about failures or URE's in the other half of the pair - it doesn't help > to know that another nine disks can "safely" fail too. Knowing this is often critical from an architectural standpoint David. It is quite common to create the mirrors of a RAID10 across two HBAs and two JBOD chassis. Some call this "duplexing". With RAID10 you know you can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip a beat. "RAID15" would work the same in this scenario. This architecture is impossible with RAID5/6. Any of the mentioned failures will kill the array. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html