Re: Encountered kernel bug#72811. Advice on recovery?
Marat Khalili posted on Sun, 16 Apr 2017 11:01:00 +0300 as excerpted: >> Even making such a warning conditional on kernel version is >> problematic, because many distros backport major blocks of code, >> including perhaps btrfs fixes, and the nominally 3.14 or whatever >> kernel may actually be running btrfs and other fixes from 4.14 or >> later, by the time they actually drop support for whatever LTS distro >> version and quit backporting fixes. > > This information could be stored in kernel and made available for > usermode tools via some proc file. This would be very useful > _especially_ considering backporting. Raid56 could be fixed already (or > not) by the time it is implemented, but no doubt there will still be > other highly experimental capabilities judging by how things go. And > this feature itself could easily be backported. What they /could/ do would be something very similar to what they already did for the free-space-tree (as opposed to the free-space-cache, the original and still default implementation). There was a critical bug in the early implementations of free-space- tree. But btrfs has incompatibility/feature flags for a reason, and they set it up in such a way that the flaw could be detected and fixed. In theory they could grab another bit from it and make that raid56v2, or something similar, and if the raid56 flag is there but not raid56v2, warn, etc. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
Even making such a warning conditional on kernel version is problematic, because many distros backport major blocks of code, including perhaps btrfs fixes, and the nominally 3.14 or whatever kernel may actually be running btrfs and other fixes from 4.14 or later, by the time they actually drop support for whatever LTS distro version and quit backporting fixes. This information could be stored in kernel and made available for usermode tools via some proc file. This would be very useful _especially_ considering backporting. Raid56 could be fixed already (or not) by the time it is implemented, but no doubt there will still be other highly experimental capabilities judging by how things go. And this feature itself could easily be backported. Some machine-readable readiness level (ok/warning/override flag needed/known but disabled in kernel) plus one-line text message displayed to users in cases 2-4 is all we need. If proc file is missing or doesn't contain information about specific capability, tools could default to current behavior (AFAIR there're already warnings in some cases). Message should tersely cover any known issues, including stability, performance, compatibility and general readiness, and may contain links (to btrfs wiki?) for more information. I expect whole file to easily fit in 512 bytes. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
On Sat, Apr 15, 2017 at 11:28:41PM +, Duncan wrote: > Duncan posted on Sat, 15 Apr 2017 01:41:28 + as excerpted: > > > Besides which, if the patch was submitted now, the earliest it could > > really hit btrfs-progs would be 4.12, > > Well, maybe 3.11.x... Can I borrow your time machine? Would last Wednesday be OK? Hugo. -- Hugo Mills | We teach people management skills by examining hugo@... carfax.org.uk | characters in Shakespeare. You could look at http://carfax.org.uk/ | Claudius's crisis management techniques, for PGP: E2AB1DE4 | example. Richard Smith-Jones, Slings and Arrows signature.asc Description: Digital signature
Re: Encountered kernel bug#72811. Advice on recovery?
Duncan posted on Sat, 15 Apr 2017 01:41:28 + as excerpted: > Besides which, if the patch was submitted now, the earliest it could > really hit btrfs-progs would be 4.12, Well, maybe 3.11.x... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
ronnie sahlberg posted on Fri, 14 Apr 2017 09:56:30 -0700 as excerpted: > On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted: > ... >> OK, I'm one of the ones that's going to "go off" on you, but FWIW, I >> expect pretty much everyone else would pretty much agree. At least you >> do have backups. =:^) >> >> I don't think you appreciate just how bad raid56 is ATM. There are >> just too many REALLY serious bugs like the one you mention with it, and >> it's actively NEGATIVELY recommended here as a result. It's bad enough >> with even current kernels, and the problems are well known enough to >> the devs, >> that there's really not a whole lot to test ATM... > > Can we please hide the ability to even create any new raid56 filesystems > behind a new flag : > > --i-accept-total-data-loss > > to make sure that folks are prepared for how risky it currently is. That > should be an easy patch to the userland utilities. The biggest problem with such a flag in general is that people often use a kernel and userland that are /vastly/ out of sync, version-wise. Were such a flag to be introduced, people would still be seeing it five years or more after it no longer applied to the kernel they're using (because the kernel's what actually does the work in many cases, including scrub). Even making such a warning conditional on kernel version is problematic, because many distros backport major blocks of code, including perhaps btrfs fixes, and the nominally 3.14 or whatever kernel may actually be running btrfs and other fixes from 4.14 or later, by the time they actually drop support for whatever LTS distro version and quit backporting fixes. Besides which, if the patch was submitted now, the earliest it could really hit btrfs-progs would be 4.12, and by the time people actually get that in their distro they may well be on 4.13 or 4.15 or whatever, and the patches fixing raid56 mode to actually work may already be in place. The only place such a warning really works is on the wiki at https://btrfs.wiki.kernel.org , because that's really the only place that can be updated to current status in a realistic timeframe. And there's already a feature maturity matrix there, with raid56 mode marked appropriately, last I checked. Meanwhile, it can be argued that admins (and anyone making the choice of filesystem and device layout they're going to run is an admin of those systems, even if they're just running them at home for their own use) who don't care enough about the safety of their data to actually research the stability of the filesystem and filesystem features they plan to use... really don't value that data very highly in the first place. And the status is out there both on this list and on the wiki, so even a trivial google should find it without issue. Indeed: https://www.google.com/search?q=btrfs+raid56+stability -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
On Fri, Apr 14, 2017 at 10:46 AM, Chris Murphy wrote: > > The passive repair works when it's a few bad sectors on the drive. But > when it's piles of missing data, this is the wrong mode. It needs a > limited scrub or balance to fix things. Right now you have to manually > do a full scrub or balance after you've mounted for even one second > using degraded,rw. That's why you want to avoid it at all costs. Small clarification on "right now you have to manually do" I don't mean YOU personally, with your array. I mean, anyone who happens to have done even the tiniest amount of writes to a Btrfs volume while mounted in rw,degraded. Once a new device is added and the bad/missing device deleted, you still have to manually do a scrub or balance of the entire array. That's the only way to fix up the array back to normal. It's not automatic. The way to avoid this is to *immediately* before any new writes, do a device add and device delete missing*. That prevents any degraded chunks from being written. * ON non-raid56 volumes, you can use 'btrfs replace'. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted: ... > OK, I'm one of the ones that's going to "go off" on you, but FWIW, I > expect pretty much everyone else would pretty much agree. At least you > do have backups. =:^) > > I don't think you appreciate just how bad raid56 is ATM. There are just > too many REALLY serious bugs like the one you mention with it, and it's > actively NEGATIVELY recommended here as a result. It's bad enough with > even current kernels, and the problems are well known enough to the devs, > that there's really not a whole lot to test ATM... Can we please hide the ability to even create any new raid56 filesystems behind a new flag : --i-accept-total-data-loss to make sure that folks are prepared for how risky it currently is. That should be an easy patch to the userland utilities. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
Summary: 22x device raid6 (data and metadata). One device vanished, and the volume is rw,degraded mounted with writes happening; next time it's mounted the formerly missing device is not missing so it's a normal mount, and writes are happening. Then later, the filesystem goes read only. Now there are problems, what are the escape routes? OK the Autopsy Report: > In my case, I had rebooted my system and one of the drives on my main > array did not come up. I was able to mount in degraded mode. I needed > to re-boot the following day. This time, all the drives in the array > came up. Several hours later, the array went into read only mode. > That's when I discovered the odd device out had been re-added without > any kind of error message or notice. The instant Btrfs complains about something, you cannot make assumptions, and you have to fix it. You can't turn your back on it. It's an angry goose with an egg nearby. And if you turn your back on it, it'll beat your ass down. But because this is raid6, you thought it's OK, it's a reliable predictable mule. And you made a lot of assumptions that are totally reasonable because it's called raid6, except that those assumptions are all wrong because Btrfs is not like anything else, and it's raid doesn't work like anything else. 1. The first mount attempt fails. OK why? On Btrfs you must find out why normal mount failed, because you don't want to use degraded mode unless absolutely necessary. But you didn't troubleshoot it. 2. The second mount attempt with degraded works. This mode exists for one reason, you are ready right now to add a new device and delete the missing one. Other raid56's you can wait and just hope another drive doesn't die. Not Btrfs. You might get one chance with rw,degraded to do a device replacement and you have to make 'dev add' and 'dev del missing' the top priority before writing anything else to the volume. So if you're not ready to do this, the default first action is ro,degraded. You can get data off the volume but not change it and lose your chance to use degraded,rw which has a decent chance of being a one time event. But you didn't do this, you assumed Btrfs raid56 is OK to use rw,degraded like any other raid. 3. The third mount, you must have mounted with -o degraded right off the bat, assuming the formerly missing device was still missing and you'd still need -o degraded. If you'd tried a normal mount, it would have succeeded, which would have informed you the formerly missing device had been found and was being used. Now you have normal chunks, degraded chunks, and more normal chunks. This array is very confused. 4. Btrfs does not do active heals (auto generation limited scrub) when a previously missing device becomes available again. It only does passive healing as it encounters wrong or missing data. 5. Btrfs raid6 is obviously broken somehow, because you're not the only person who has had a file system with all available information and two copies, and it still breaks. Most of your data is raid6, that's three copies (data plus two parity). Some of it is degraded raid6 which is effectively raid5, so that's data plus one copy. And yet at some point Btrfs gets confused in normal, non-degraded mount, and splats to read-only. This is definitely a bug. This requires a complete call traces prior to and include the read-only splat, in a bug report. Or it simply won't get better. It's unclear where the devs are at priority wise with raid56, it's also unclear if they're going to fix it, or rewrite it. The point is, you made a lot of mistakes by making too many assumptions, and not realizing that degraded state in Btrfs is basically an emergency. Finally at the very end, it still could have saved you from your own mistakes, but there's a missing feature (active auto heal to catch up the missing device), and there's a bug making the fs read-only. And now it's in a sufficiently non-deterministic state that the repair tools probably can't repair it. > > The practical problem with bug#72811 is that all the csum and transid > information is treated as being just as valid on the automatically > re-added drive as the same information on all the other drives. My guess is that the first normal mount after degraded writes, the readded drive has a new super block that has current valid information, pointing to missing data, and only as it goes looking for the data or metadata, does it start fixing things up. Passive. So it's own passive healing is eventually hitting a brick wall the farther backward in time it has to go to do these fix ups. The passive repair works when it's a few bad sectors on the drive. But when it's piles of missing data, this is the wrong mode. It needs a limited scrub or balance to fix things. Right now you have to manually do a full scrub or balance after you've mounted for even one second using degraded,rw. That's why you want to avoid it at all costs. > > I don't have issues with the above tools not being
Re: Encountered kernel bug#72811. Advice on recovery?
Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted: > I've encountered kernel bug#72811 "If a raid5 array gets into degraded > mode, gets modified, and the missing drive re-added, the filesystem > loses state". > The array normally consists of 22 devices with data and meta in raid6. > Physically, the devices are split 16 devices in a NORCO DS-24 cage and > the remaining devices are in the server itself. All the devices are SATA > III. > I don't have issues with the above tools not being ready for for raid56. > Despite the mass quantities, none of the data involved it irretrievable, > irreplaceable or of earth shattering importance on any level. This is a > purely personal setup. > As such, I'm not bothered by the 'not ready for prime time status' of > raid56. This bug however, is really really nasty bad. Once a drive is > out of sync, if should never be automatically re-added back. > I mention all this because I KNOW someone is going to go off on how I > should have back ups of everything and how I should not run raid56 and > how I should run mirrored instead etc. Been there. Done that. I have the > same canned lecture for people running data centers for businesses. > > I am not a business. This is my personal hobby. The risk does not bother > me. I don't mind running this setup because I think real life runtimes > can contribute to the general betterment of btrfs for everyone. I'm not > in any particular hurry. My income is completely independent from this. > The potential problem is controlling what happens once I mount the > degraded array in read/write mode to delete copied data and perform > device reduction. I have no clue how to or even if this can be done > safely. > > The alternative is to continue to run this array in read only degraded > mode until I can accumulate sufficient funds for a second chassis and > approximately 20 more drives.This probably won't be until Jan 2018. > > Is such a recovery strategy even possible? While I would expect a > strategy involving 'btrfs restore' to be possible for raid0, raid1, > raid10 configure arrays, I don't know that such a strategy will work for > raid56. > > As I see it, the key here to to be able to safely delete copied files > and to safely reduce the number of devices in the array. OK, I'm one of the ones that's going to "go off" on you, but FWIW, I expect pretty much everyone else would pretty much agree. At least you do have backups. =:^) I don't think you appreciate just how bad raid56 is ATM. There are just too many REALLY serious bugs like the one you mention with it, and it's actively NEGATIVELY recommended here as a result. It's bad enough with even current kernels, and the problems are well known enough to the devs, that there's really not a whole lot to test ATM... Well, unless you're REALLY into building kernels with a whole slew of pre- merge patches and reporting back the results to the dev working on it, as there /are/ a significant number of raid56 patches floating around in a pre-merge state here on the list. Some of them may be in btrfs-next already, but I don't believe all of them are. The problem with that is, despite how willing you may be, you obviously aren't running them now. So you obviously didn't know the current really /really/ bad state. If you're /willing/ to run them and have the skills to do that sort of patching, etc, including possibly ones that won't fix problems, only help further trace them down, then either followup with the dev working on it (which I've not tracked specifically so I can't tell you who) if he posts a reply, or go looking on the list for raid56 patches and get ahold of the dev posting them. You'll need to get the opinion of the dev as to whether with the patches it's worth running yet or not. I'm not sure if he's thru patching the worst of the known issues, or if there's more to go. One of the big problems is that in the current state, the repair tools, scrub, etc, can actively make the problem MUCH worse. They're simply broken. Normal raid56 runtime has been working for quite awhile, so it's no surprise that has worked for you. And under specific circumstances, pulling a drive and replacing it can work too. But the problem is, those circumstances are precisely the type that people test, but not the type that tends to actually happen in the real world. So effectively, raid56 mode is little more dependable than raid0 mode. While you /may/ be able to recover, it's uncertain enough that it's better to just treat the array as a raid0, and consider that you may well lose everything on it with pretty much any problem at all. As such, it's simply irresponsible to recommend that anyone use it /as/ raid56, which is why it's actively NEGATIVELY recommended ATM. Meanwhile, people that want raid0s... tend to configure raid0s, not raid5s or raid6s. FWIW, I /think/ at least /some/ of the patches have been reviewed and cleared for, hopefull
Encountered kernel bug#72811. Advice on recovery?
I've encountered kernel bug#72811 "If a raid5 array gets into degraded mode, gets modified, and the missing drive re-added, the filesystem loses state". In my case, I had rebooted my system and one of the drives on my main array did not come up. I was able to mount in degraded mode. I needed to re-boot the following day. This time, all the drives in the array came up. Several hours later, the array went into read only mode. That's when I discovered the odd device out had been re-added without any kind of error message or notice. SMART does not report any errors on the device itself. I did have a failed fan inside the server case and I suspect a thermally sensitive issue with the responsible drive controller. Since replacing the failed fan plus another fan, all drives devices report a running temperature in the range of 34~35 Celsius. This is normal. None of the drives report recording any errors. The array normally consists of 22 devices with data and meta in raid6. Physically, the devices are split 16 devices in a NORCO DS-24 cage and the remaining devices are in the server itself. All the devices are SATA III. I've added "noauto" to the options in my fstab file for this array. I've also disabled the odd drive out so it's no longer seen as part of the array. Current fstab line: LABEL="PublicB" /PublicBbtrfs autodefrag,compress=lzo,space_cache,noatime,noauto 0 0 I manually mount the array: mount -o recovery,ro,degraded Current device list for the array: Label: 'PublicB' uuid: 76d87b95-5651-4707-b5bf-168210af7c3f Total devices 22 FS bytes used 83.63TiB devid1 size 5.46TiB used 5.12TiB path /dev/sdt devid2 size 5.46TiB used 5.12TiB path /dev/sdv devid3 size 5.46TiB used 5.12TiB path /dev/sdaa devid4 size 5.46TiB used 5.12TiB path /dev/sdx devid5 size 5.46TiB used 5.12TiB path /dev/sdo devid6 size 5.46TiB used 5.12TiB path /dev/sdq devid7 size 5.46TiB used 5.12TiB path /dev/sds devid8 size 5.46TiB used 5.12TiB path /dev/sdu devid9 size 5.46TiB used 4.25TiB path /dev/sdr devid 10 size 5.46TiB used 4.25TiB path /dev/sdy devid 11 size 5.46TiB used 4.25TiB path /dev/sdab devid 12 size 3.64TiB used 3.64TiB path /dev/sdb devid 13 size 3.64TiB used 3.64TiB path /dev/sdc devid 14 size 4.55TiB used 4.25TiB path /dev/sdd devid 17 size 4.55TiB used 4.25TiB path /dev/sdg devid 18 size 4.55TiB used 4.25TiB path /dev/sdh devid 19 size 5.46TiB used 4.25TiB path /dev/sdm devid 20 size 5.46TiB used 2.33TiB path /dev/sdp devid 21 size 5.46TiB used 2.33TiB path /dev/sdn devid 22 size 5.46TiB used 2.33TiB path /dev/sdw devid 23 size 5.46TiB used 2.33TiB path /dev/sdz *** Some devices missing The missing device is a {nominal} 5.0TB drive and would usually show up in this list as: devid 15 size 4.55TiB used 4.25TiB path /dev/sde Other than "mount -o recovery,ro" when all 22 were present {and before I understood I had encountered #72811}, I have NOT run any of the more advanced recovery/repair commands/techniques. As best as I can tell using independent {non btrfs related} all data {approximately 80TB} prior to the initial event is intact. Directories and files written/updated after the automatic {and silent} device re-add are suspect and occasionally exhibit either missing files or missing chunks of files. Regardless of the fact the data is intact, I get runs of csum and other errors - sample: [114427.223006] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223011] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223012] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223015] BTRFS info (device sdw): no csum found for inode 913818 start 1219862528 [114427.223019] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223021] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223022] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223024] BTRFS info (device sdw): no csum found for inode 913818 start 1219866624 [114427.223027] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223029] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223030] BTRFS error (device sdw): parent transid verify failed on 59281854676992 wanted 328408 found 328388 [114427.223032] BTRFS info (device sdw): no csum found for inode 913818 start 1219870720 [114427.223035] BTRFS error (device sdw): parent transid verify failed