Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
Am 21.04.2013 06:35, schrieb openindiana-discuss-requ...@openindiana.org: -- Message: 3 Date: Sat, 20 Apr 2013 21:13:09 -0700 From: Richard Ellingrichard.ell...@richardelling.com To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage Message-ID:0b43e9ea-10fd-41af-81ef-31644ff49...@richardelling.com Content-Type: text/plain; charset=windows-1252 Terminology warning below? On Apr 18, 2013, at 3:46 AM, Sebastian Gablersequoiamo...@gmx.net wrote: Am 18.04.2013 03:09, schriebopenindiana-discuss-requ...@openindiana.org: Message: 1 Date: Wed, 17 Apr 2013 13:21:08 -0600 From: Jan Owocjso...@gmail.com To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage Message-ID: cadcwueyc14mt5agkez7pda64h014t07ggtojkpq5js4s279...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalsontsc...@mst.edu wrote: On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) openindi...@nedharvey.com wrote: You also said the raidz2 will offer more protection against failure, because you can survive any two disk failures (but no more.) I would argue this is incorrect (I've done the probability analysis before). Mostly because the resilver time in the mirror configuration is 8x to 16x faster (there's 1/8 as much data to resilver, and IOPS is limited by a single disk, not the worst of several disks, which introduces another factor up to 2x, increasing the 8x as high as 16x), so the smaller resilver window means lower probability of concurrent failures on the critical vdev. We're talking about 12 hours versus 1 week, actual result of my machines in production. Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. On the other hand, raid-z2 will still have available redundancy, allowing every single block to have a bad read on any single component disk, without losing data. I haven't done the math on this, but I seem to recall some papers claiming that this is the more likely route to lost data on modern disks, by comparing bit error rate and capacity. Of course, a second outright failure puts raid-z2 in a much worse boat than 2-way mirrors, which is a reason for raid-z3, but this may already be a less likely case. Richard Elling wrote a blog post about mean time to data loss [1]. A few years later he graphed out a few cases for typical values of resilver times [2]. [1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl [2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html Cheers, Jan Notably, Richard's models posted do not include BER. Nevertheless it's an important factor. [..] /snip From the back of my mind it will impact reliability in different ways in ZFS: - Bit error in metadata (zfs should save us by metadata redundancy) - Bit error in full stripe data - Bit error in parity data These aren't interesting from a system design perspective. To enhance the model to deal with this, we just need to determine what percentage of the overall space contains copied data. There is no general answer, but for most systems it will be a small percentage of the total, as compared to data. In this respect, the models are worst-case, which is what we want to use for design evalulations. As others already pointed out, the case where disk-based read errors are getting into focus when you read from an array/vdev that has no more redundancy. Indeed, there is no difference between parity and stripe data. There is however a difference to metadata when those are provided redundantly, even in a non-redundant vdev layout. That is the case in ZFS. NB, traditional RAID systems don't know what is data and what is not data, so they could run into uncorrectable errors that are not actually containing data. This becomes more important for those systems which use a destructive scrub, as opposed to ZFS's readonly scrub. Hence, some studies have shown where scrubbing can propagate errors in non-ZFS RAID arrays. AFAIK, traditional RAID constructs may have two additional issues compared to ZFS. 1. As you mention, usually the whole stripe set needs to be rebuild, whereas resilver only rebuilds active data. 2. The error may or may not be detected by the controller. In ZFS, even if a read error goes silent down the whole food-chain, there are still block-based checksums separating chaff from wheat. AFAIK, a bit error in Parity or stripe data can be specifically dangerous when it is raised during resilvering, and there is only one layer of redundancy left
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On 2013-04-21 06:13, Richard Elling wrote: Terminology warning below… BER is the term most often used in networks, where the corruption is transient. For permanent data faults, the equivalent is unrecoverable read error rate (UER), also expressed as a failure rate per bit. ... Well, with computers being networks of smaller components, beside the UER contained only in the storage device as repeatably returning the error (or rather a response different from stored and expected value), there is a place for BER concept as you say it is - there are cables and soldered signal lines which can catch noise, there are protocols and firmwares which might mistreat some corner cases, etc. - providing intermittent errors which are not there the second time you look. Even UERs might not be persistent, if the HDD decides to relocate a detected-failing sector into spare areas, and returns some consistent replies to queries afterwards (I did have cases with old HDDs that did creak and rattle for a while and returned some bytes when querying bad sectors, and replies were different every time or IO errors were returned at the protocol layer instead of random garbage as data). The trend seems to be that BER data is not shown for laptop drives, which is a large part of the HDD market. Presumably, this is because the load/unload failure mode dominates in this use case as the drives are not continuously spinning. It is a good idea to use components in the environment for which they are designed, so I'm pretty sure you'd never consider using a laptop drive for a storage array. This brings up an interesting question for home-NAS users: it does not seem unreasonable to use a laptop drive or two as an rpool in an array like the popular ZFS workhorse HP N40L. I agree that it seems improper to build an array for *intensive* IO with an horde of such disks, but do you have statistics to really discourage these two cases (rpool and intensive IO)? What about home-NASes which just occasionally see some IO, maybe in intensive bursts, but idle for hours otherwise? Indeed, many portable-disk boxes contain a laptop drive. Arguably, they might also be more reliable mechanically, because intended for use in shaky environments. Thanks, //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On Apr 21, 2013, at 3:47 AM, Jim Klimov jimkli...@cos.ru wrote: On 2013-04-21 06:13, Richard Elling wrote: Terminology warning below… BER is the term most often used in networks, where the corruption is transient. For permanent data faults, the equivalent is unrecoverable read error rate (UER), also expressed as a failure rate per bit. ... Well, with computers being networks of smaller components, beside the UER contained only in the storage device as repeatably returning the error (or rather a response different from stored and expected value), there is a place for BER concept as you say it is - there are cables and soldered signal lines which can catch noise, there are protocols and firmwares which might mistreat some corner cases, etc. - providing intermittent errors which are not there the second time you look. The problem is finding a spec that you can design to. We have seen many bad cables cause all sorts of latency (due to retries on bad transfers). This information is not measured or spec'ed by disk vendors. Even UERs might not be persistent, if the HDD decides to relocate a detected-failing sector into spare areas, and returns some consistent replies to queries afterwards (I did have cases with old HDDs that did creak and rattle for a while and returned some bytes when querying bad sectors, and replies were different every time or IO errors were returned at the protocol layer instead of random garbage as data). The trend seems to be that BER data is not shown for laptop drives, which is a large part of the HDD market. Presumably, this is because the load/unload failure mode dominates in this use case as the drives are not continuously spinning. It is a good idea to use components in the environment for which they are designed, so I'm pretty sure you'd never consider using a laptop drive for a storage array. This brings up an interesting question for home-NAS users: it does not seem unreasonable to use a laptop drive or two as an rpool in an array like the popular ZFS workhorse HP N40L. I agree that it seems improper to build an array for *intensive* IO with an horde of such disks, but do you have statistics to really discourage these two cases (rpool and intensive IO)? What about home-NASes which just occasionally see some IO, maybe in intensive bursts, but idle for hours otherwise? Indeed, many portable-disk boxes contain a laptop drive. Arguably, they might also be more reliable mechanically, because intended for use in shaky environments. You get what you pay for. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
Terminology warning below… On Apr 18, 2013, at 3:46 AM, Sebastian Gabler sequoiamo...@gmx.net wrote: Am 18.04.2013 03:09, schrieb openindiana-discuss-requ...@openindiana.org: Message: 1 Date: Wed, 17 Apr 2013 13:21:08 -0600 From: Jan Owocjso...@gmail.com To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage Message-ID: cadcwueyc14mt5agkez7pda64h014t07ggtojkpq5js4s279...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalsontsc...@mst.edu wrote: On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) openindi...@nedharvey.com wrote: You also said the raidz2 will offer more protection against failure, because you can survive any two disk failures (but no more.) I would argue this is incorrect (I've done the probability analysis before). Mostly because the resilver time in the mirror configuration is 8x to 16x faster (there's 1/8 as much data to resilver, and IOPS is limited by a single disk, not the worst of several disks, which introduces another factor up to 2x, increasing the 8x as high as 16x), so the smaller resilver window means lower probability of concurrent failures on the critical vdev. We're talking about 12 hours versus 1 week, actual result of my machines in production. Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. On the other hand, raid-z2 will still have available redundancy, allowing every single block to have a bad read on any single component disk, without losing data. I haven't done the math on this, but I seem to recall some papers claiming that this is the more likely route to lost data on modern disks, by comparing bit error rate and capacity. Of course, a second outright failure puts raid-z2 in a much worse boat than 2-way mirrors, which is a reason for raid-z3, but this may already be a less likely case. Richard Elling wrote a blog post about mean time to data loss [1]. A few years later he graphed out a few cases for typical values of resilver times [2]. [1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl [2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html Cheers, Jan Notably, Richard's models posted do not include BER. Nevertheless it's an important factor. BER is the term most often used in networks, where the corruption is transient. For permanent data faults, the equivalent is unrecoverable read error rate (UER), also expressed as a failure rate per bit. My models clearly consider this. Unfortunately, terminology consistency between vendors has been slow in coming, but Seagate and WD seem to be converging on Non-recoverable read errors per bits read while Toshiba seems to be ignoring the problem, or at least can't seem to list it on their datasheets :-( From the back of my mind it will impact reliability in different ways in ZFS: - Bit error in metadata (zfs should save us by metadata redundancy) - Bit error in full stripe data - Bit error in parity data These aren't interesting from a system design perspective. To enhance the model to deal with this, we just need to determine what percentage of the overall space contains copied data. There is no general answer, but for most systems it will be a small percentage of the total, as compared to data. In this respect, the models are worst-case, which is what we want to use for design evalulations. NB, traditional RAID systems don't know what is data and what is not data, so they could run into uncorrectable errors that are not actually containing data. This becomes more important for those systems which use a destructive scrub, as opposed to ZFS's readonly scrub. Hence, some studies have shown where scrubbing can propagate errors in non-ZFS RAID arrays. AFAIK, a bit error in Parity or stripe data can be specifically dangerous when it is raised during resilvering, and there is only one layer of redundancy left. OTOH, BER issues scale with VDEV size, not with rebuild time. So, I think that Tim actually made up a valid point about a systematically weak point of 2-way mirrors or raidz1 on in vdevs that are large in comparison to the BER rating of their member drives. Consumer drives have a BER of 1:10^14..10^15, Enterprise drives start at 1:10^16. I do not think that zfs will have better resilience against rot of parity data than conventional RAID. At best, block level checksums can help raise an error, so you know at least that something went wrong. But recovery of the data will probably not be possible. So, in my opinion BER is an issue under ZFS as anywhere else. Yep, which is why my MTTDL model 2 explicitily (MTTDL[2])
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
Am 19.04.2013 11:22, schrieb openindiana-discuss-requ...@openindiana.org: Message: 1 Date: Thu, 18 Apr 2013 16:03:32 -0500 From: Timothy Coalsontsc...@mst.edu To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage Message-ID: CAK_=tazsbkcbbhie8czhqo2q+840s-sowajpykoppxc0gej...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.netwrote: One of Elling's posts cited by another poster did take that into account, but the graphs don't load due to url changes (and they seem to have the wrong MIME type with the fixed URL, I ended up using wget): https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl Thanks for pointing at that. I stand corrected with my previous statement about Richard's MTTDL model excluding BER/UER. Asking Richard Elling to accept my apology. Timothy, how do you manage to get these pngs? I get as far as extracting some links to them, but loading these links results in resets consistently. Help appreciated. BR Sebastian ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On Fri, Apr 19, 2013 at 8:14 AM, Sebastian Gabler sequoiamo...@gmx.netwrote: On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.net wrote: One of Elling's posts cited by another poster did take that into account, but the graphs don't load due to url changes (and they seem to have the wrong MIME type with the fixed URL, I ended up using wget): https://blogs.oracle.com/**relling/entry/a_story_of_two_**mttdlhttps://blogs.oracle.com/relling/entry/a_story_of_two_mttdl Thanks for pointing at that. I stand corrected with my previous statement about Richard's MTTDL model excluding BER/UER. Asking Richard Elling to accept my apology. Timothy, how do you manage to get these pngs? I get as far as extracting some links to them, but loading these links results in resets consistently. Help appreciated. One of the wrong URLs is this: http://blogs.sun.com/relling/resource/X4500-MTTDL-models-raidz2.png Note that the blog URL is blogs.oracle.com, not blogs.sun.com, so, do this: $ wget http://blogs.oracle.com/relling/resource/X4500-MTTDL-models-raidz2.png--no-check-certificate It redirects to https, and oracle's certificate has problems, hence the --no-check-certificate. Unfortunately, entering the URL in a browser treats the PNG data as text, hence using wget (or curl, or your preferred method of downloading a URL). Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
[catching up...] On Apr 19, 2013, at 6:14 AM, Sebastian Gabler sequoiamo...@gmx.net wrote: Am 19.04.2013 11:22, schrieb openindiana-discuss-requ...@openindiana.org: Message: 1 Date: Thu, 18 Apr 2013 16:03:32 -0500 From: Timothy Coalsontsc...@mst.edu To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage Message-ID: CAK_=tazsbkcbbhie8czhqo2q+840s-sowajpykoppxc0gej...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gablersequoiamo...@gmx.netwrote: One of Elling's posts cited by another poster did take that into account, but the graphs don't load due to url changes (and they seem to have the wrong MIME type with the fixed URL, I ended up using wget): https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl Thanks for pointing at that. I stand corrected with my previous statement about Richard's MTTDL model excluding BER/UER. Asking Richard Elling to accept my apology. No worries. Unfortunately, Oracle totally hosed the older Sun blogs. I do have on my todo list the task of updating these and reposting on a site with better longevity, not controlled by the lawnmower. -- richard Timothy, how do you manage to get these pngs? I get as far as extracting some links to them, but loading these links results in resets consistently. Help appreciated. BR Sebastian ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
[catching up... comment below] On Apr 18, 2013, at 2:03 PM, Timothy Coalson tsc...@mst.edu wrote: On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler sequoiamo...@gmx.netwrote: Am 18.04.2013 16:28, schrieb openindiana-discuss-request@**openindiana.orgopenindiana-discuss-requ...@openindiana.org : Message: 1 Date: Thu, 18 Apr 2013 12:17:47 + From: Edward Ned Harvey (openindiana)openindiana@**nedharvey.comopenindi...@nedharvey.com To: Discussion list for OpenIndiana openindiana-discuss@**openindiana.orgopenindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage Message-ID: D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411@SN2PRD0410MB372.** namprd04.prod.outlook.comd1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com Content-Type: text/plain; charset=us-ascii From: Timothy Coalson [mailto:tsc...@mst.edu] Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. I had to think about this comment for a little while to understand what you were saying, but I think I got it. I'm going to rephrase your question: If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy. So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error. Question is, did I calculate that probability? Answer is, I think so. Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear. Also dependent on the specific model of drive in question, and the graphs are typically not available. So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure. The thing is... Bit Errors can lead to corruption of files, or even to the loss of a whole pool, without having an additional faulted drive. Because Bit Errors do not necessarily lead to a drive error. The risk of a rebuild failing is proportional to the BER of the drives involved, and it scales by the amount of data moved, given that you don't have further redundancy left. I agree with previous suggestions made that scrubbing offers some degree of protection against that issue. It doesn't do away with the risk when dealing with Bit Errors in a situation that has all redundancy stripped for some reason. For this aspect, a second level of redundancy offers a clear benefit. AFAIU, that was the valid point of the poster raising the controversy about resilience of a single vdev with multiple redundancy vs. multiple vdevs with single redundancy. As much as scrubbing is concerned, it is true that it will reduce the risk of a bit error rearing precisely during rebuild. However, in cases where you will deliberately pull redundancy, i.e. for swapping drives with larger ones, you will want to have a valid backup, and thus you will have not have too much of WORN data. In either case, it is user-driven, that is not scrub by itself is pro-active, but it gives the user a tool to be proactive about WORN data which are indeed those primarily prone to bit rot. Yes, that was my point, bit errors when there is no remaining redundancy are unrecoverable. Thus, as long is it is likely that only 1 disk per vdev fails at a time, raid-z2 will survive these bit errors fine, while 2-way mirrors will lose data. One of Elling's posts cited by another poster did take that into account, but the graphs don't load due to url changes (and they seem to have the wrong MIME type with the fixed URL, I ended up using wget): https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl He linearized the bit error rate part, also. Beware the different scale on the graphs - at any rate, his calculation arrived at 3 orders of magnitude difference, with raid-z2 as better than 2-way mirrors for MTTDL for equivalent usable space (if I'm reading those correctly). Yes. 2-way mirror is a single-parity protection scheme and raidz2 is a double-parity protection scheme. Scrubbing introduces a question about the meaning of bit error rate - how different is the uncorrectable bit error rate on newly written data, versus the bit error rate on data that has been read back in successfully (additional qualifier: multiple times)? Regular scrubs can change the MTTDL dramatically if these probabilities are significantly different, because the first probability only applies to data written since the most recent scrub, which can drop a few orders of magnitude from the calculation. Pragmatically, it doesn't matter because the drive vendors do not publish the information. So you'll have to measure it yourself, which is not an easy task even if you
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
From: Sebastian Gabler [mailto:sequoiamo...@gmx.net] AFAIK, a bit error in Parity or stripe data can be specifically dangerous when it is raised during resilvering, and there is only one layer of redundancy left. You're saying error in parity, but that's because you're thinking of raidz, which I don't usually use. You really mean error in redundant copy, and the only risk, as you've identified, is the error in the *last* redundant copy. The answer to this is: You *do* scrub every week or two, don't you?You should. I do not think that zfs will have better resilience against rot of parity data than conventional RAID. That's incorrect, because conventional raid cannot scrub proactively. Sure, if you have a pool with only one level of redundancy, and the bit error creeped in between the most recent scrub and the present failure time, then that's a problem, and zfs cannot protect you against it. This is, by definition, simultaneous failure of all redundant copies of the data. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On 2013-04-18 12:46, Sebastian Gabler wrote: I do not think that zfs will have better resilience against rot of parity data than conventional RAID. At best, block level checksums can help raise an error, so you know at least that something went wrong. But recovery of the data will probably not be possible. So, in my opinion BER is an issue under ZFS as anywhere else. Well, thanks to checksums we can know which variant of userdata is correct, and thanks to parities we can verify which bytes are wrong in a particular block. If there's relatively few such bytes, it is theoretically possible to brute-force match values into the wrong bytes and recalculate checksums. So if a broken range is on the order of 30-40 bytes (which someone said is typical for a CRC error and HDD returning uncertain data) you have a chance of recovering the block in a few days if lucky ;) This is a very compute-intensive task; I proposed this idea half a year ago on the zfs list (I had unrecoverable errors on raidz2 made of 4 data disks and 2 parity disks, meaning corruptions on 3 or more drives, but not necessarily whole-sector corruptions) and tried to take known byte values from different components at known bad byte offsets and put them into the puzzle. Complexity (size of recursive iteration) grows very quickly even if we only have about 5 values to match (unlike 256 in full recovery above), and we estimated that for a 4096 byte block it would take Earth's compute resources longer than the lifetime of the universe to do the full search and recovery. So such approach is really limited to just a few dozen broken bytes. But it is possible :) //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
From: Jim Klimov [mailto:jimkli...@cos.ru] Well, thanks to checksums we can know which variant of userdata is correct, and thanks to parities we can verify which bytes are wrong in a particular block. If there's relatively few such bytes, it is theoretically possible to brute-force match values into the wrong bytes and recalculate checksums. So if a broken range is on the order of 30-40 bytes (which someone said is typical for a CRC error and HDD returning uncertain data) you have a chance of recovering the block in a few days if lucky ;) This is a very compute-intensive task; I proposed this idea half a year ago on the zfs list (I had unrecoverable errors on raidz2 made of 4 data disks and 2 parity disks, meaning corruptions on 3 or more drives, but not necessarily whole-sector corruptions) and tried to take known byte values from different components at known bad byte offsets and put them into the puzzle. Complexity (size of recursive iteration) grows very quickly even if we only have about 5 values to match (unlike 256 in full recovery above), and we estimated that for a 4096 byte block it would take Earth's compute resources longer than the lifetime of the universe to do the full search and recovery. So such approach is really limited to just a few dozen broken bytes. But it is possible :) I think you're misplacing a decimal, confusing bits for bytes, and mixing up exponents. Cuz you're way off. With merely 70 unknown *bits* that is, less than 10 bytes, you'll need a 3-letter government agency devoting all its computational resources to the problem for a few years. Furthermore, when you find a matching cksum, you haven't found the correct data yet. You'll need to exhaustively search the entire space requiring 2^70 operations, find all the matches (there will be a lot) and from those matches, choose the one you think is right. Even with merely 70 unknown bits, and a 32-bit cksum (the default in zfs fletcher-4) you will have 2^38 (that is, 256 billion) results that produce the right cksum. You'll have to rely on your knowledge of the jpg file or txt file or whatever, to choose which one of the 256 billion cksum-passing-results is *actually* the right result. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On 2013-04-18 15:57, Edward Ned Harvey (openindiana) wrote: I think you're misplacing a decimal, confusing bits for bytes, and mixing up exponents. Cuz you're way off. With merely 70 unknown *bits* that is, less than 10 bytes, you'll need a 3-letter government agency devoting all its computational resources to the problem for a few years. Uh... I'll raise a white flag and say I'm too preoccupied now to really go into math and see which one of us is wrong. Indeed, however, my post was based on practical experiment with a few values to test in each location (about 2^2 variants instead of 2^8 for a full-byte match), so here's a few orders of magnitude in complexity difference right away ;) So likely you're more correct, and solution of the problem is then out of practical boundaries even for a small corruption, except for true single-bit rots (if these are at all possible after CRC corrections). BTW, in terms of the amount of matching checksums - does it matter if we are talking about just a 70-bit string or about the same string at a fixed location in the 4kb or 128kb block? //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
Am 18.04.2013 16:28, schrieb openindiana-discuss-requ...@openindiana.org: Message: 1 Date: Thu, 18 Apr 2013 12:17:47 + From: Edward Ned Harvey (openindiana)openindi...@nedharvey.com To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage Message-ID: d1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com Content-Type: text/plain; charset=us-ascii From: Timothy Coalson [mailto:tsc...@mst.edu] Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. I had to think about this comment for a little while to understand what you were saying, but I think I got it. I'm going to rephrase your question: If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy. So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error. Question is, did I calculate that probability? Answer is, I think so. Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear. Also dependent on the specific model of drive in question, and the graphs are typically not available. So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure. The thing is... Bit Errors can lead to corruption of files, or even to the loss of a whole pool, without having an additional faulted drive. Because Bit Errors do not necessarily lead to a drive error. The risk of a rebuild failing is proportional to the BER of the drives involved, and it scales by the amount of data moved, given that you don't have further redundancy left. I agree with previous suggestions made that scrubbing offers some degree of protection against that issue. It doesn't do away with the risk when dealing with Bit Errors in a situation that has all redundancy stripped for some reason. For this aspect, a second level of redundancy offers a clear benefit. AFAIU, that was the valid point of the poster raising the controversy about resilience of a single vdev with multiple redundancy vs. multiple vdevs with single redundancy. As much as scrubbing is concerned, it is true that it will reduce the risk of a bit error rearing precisely during rebuild. However, in cases where you will deliberately pull redundancy, i.e. for swapping drives with larger ones, you will want to have a valid backup, and thus you will have not have too much of WORN data. In either case, it is user-driven, that is not scrub by itself is pro-active, but it gives the user a tool to be proactive about WORN data which are indeed those primarily prone to bit rot. BR Sebastian ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler sequoiamo...@gmx.netwrote: Am 18.04.2013 16:28, schrieb openindiana-discuss-request@**openindiana.orgopenindiana-discuss-requ...@openindiana.org : Message: 1 Date: Thu, 18 Apr 2013 12:17:47 + From: Edward Ned Harvey (openindiana)openindiana@**nedharvey.comopenindi...@nedharvey.com To: Discussion list for OpenIndiana openindiana-discuss@**openindiana.orgopenindiana-discuss@openindiana.org Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage Message-ID: D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411@SN2PRD0410MB372.** namprd04.prod.outlook.comd1b1a95fbdcf7341ac8eb0a97fccc4773bbf3...@sn2prd0410mb372.namprd04.prod.outlook.com Content-Type: text/plain; charset=us-ascii From: Timothy Coalson [mailto:tsc...@mst.edu] Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. I had to think about this comment for a little while to understand what you were saying, but I think I got it. I'm going to rephrase your question: If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy. So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error. Question is, did I calculate that probability? Answer is, I think so. Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear. Also dependent on the specific model of drive in question, and the graphs are typically not available. So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure. The thing is... Bit Errors can lead to corruption of files, or even to the loss of a whole pool, without having an additional faulted drive. Because Bit Errors do not necessarily lead to a drive error. The risk of a rebuild failing is proportional to the BER of the drives involved, and it scales by the amount of data moved, given that you don't have further redundancy left. I agree with previous suggestions made that scrubbing offers some degree of protection against that issue. It doesn't do away with the risk when dealing with Bit Errors in a situation that has all redundancy stripped for some reason. For this aspect, a second level of redundancy offers a clear benefit. AFAIU, that was the valid point of the poster raising the controversy about resilience of a single vdev with multiple redundancy vs. multiple vdevs with single redundancy. As much as scrubbing is concerned, it is true that it will reduce the risk of a bit error rearing precisely during rebuild. However, in cases where you will deliberately pull redundancy, i.e. for swapping drives with larger ones, you will want to have a valid backup, and thus you will have not have too much of WORN data. In either case, it is user-driven, that is not scrub by itself is pro-active, but it gives the user a tool to be proactive about WORN data which are indeed those primarily prone to bit rot. Yes, that was my point, bit errors when there is no remaining redundancy are unrecoverable. Thus, as long is it is likely that only 1 disk per vdev fails at a time, raid-z2 will survive these bit errors fine, while 2-way mirrors will lose data. One of Elling's posts cited by another poster did take that into account, but the graphs don't load due to url changes (and they seem to have the wrong MIME type with the fixed URL, I ended up using wget): https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl He linearized the bit error rate part, also. Beware the different scale on the graphs - at any rate, his calculation arrived at 3 orders of magnitude difference, with raid-z2 as better than 2-way mirrors for MTTDL for equivalent usable space (if I'm reading those correctly). Scrubbing introduces a question about the meaning of bit error rate - how different is the uncorrectable bit error rate on newly written data, versus the bit error rate on data that has been read back in successfully (additional qualifier: multiple times)? Regular scrubs can change the MTTDL dramatically if these probabilities are significantly different, because the first probability only applies to data written since the most recent scrub, which can drop a few orders of magnitude from the calculation. As for what I said about resilver speed, I had not accounted for the fact that data reads on a raid-z2 component device would be significantly shorter than for the same data on 2-way mirrors. Depending on whether you are using enormous block sizes, or whether your data is allocated extremely linearly in the way scrub/resilver reads it, this could be the limiting factor on platter drives due to seek
Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage
From: Timothy Coalson [mailto:tsc...@mst.edu] As for what I said about resilver speed, I had not accounted for the fact that data reads on a raid-z2 component device would be significantly shorter than for the same data on 2-way mirrors. Depending on whether you are using enormous block sizes, or whether your data is allocated extremely linearly in the way scrub/resilver reads it, this could be the limiting factor on platter drives due to seek times, and make raid-z2 take much longer to resilver. I fear I was thinking of raid-z2 in terms of raid6. I'm not sure if you misunderstand something, or if I misunderstand what you're saying, but ... Even if you are using enormous block sizes, it's actually just enormous *max* block sizes. If you write a 1 byte file (very slowly such that no write accumulation can occur) then ZFS only writes a 1 byte file, into a block. So the enormous block sizes only come into play when you're writing large amounts of data ... And when you're writing large amounts of data, you're likely to simply span multiple sequential blocks anyway. So all-in-all, the blocksize is rarely very important. There are some situations where it matters, but ... All this is a tangent. The real thing I'm addressing here, is, you said scrub/resilver progresses extremely linearly. This is unfortunately, about as wrong as it can be. In actuality, scrub / resilver proceed in approximately temporal order, which in the typical situation of a long-time server with frequent creation destruction of snapshots, results in approximately random disk order. Here's the evidence I observed: I had a ZFS server running in production for about 2 years, and a disk failed. I had measured previously, on this server, each disk sustains 1 Gbit/sec sequentially. With 1T disks, linearly resilvering the entire disk including empty space, it should take about 2 hrs to resilver. But ZFS doesn't resilver the whole disk; it only resilvers used space. This would be great, if your pool is mostly empty, or if it was disk linearly ordered. But it actually took 12 hours to resilver that disk. I went to zfs-discuss and discussed. Learned about the temporal ordering. Got my explanation how resilvering just the used portions could take several times longer than resilvering the whole disk. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss