Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-16 Thread Duncan
Marat Khalili posted on Sun, 16 Apr 2017 11:01:00 +0300 as excerpted:

>> Even making such a warning conditional on kernel version is
>> problematic, because many distros backport major blocks of code,
>> including perhaps btrfs fixes, and the nominally 3.14 or whatever
>> kernel may actually be running btrfs and other fixes from 4.14 or
>> later, by the time they actually drop support for whatever LTS distro
>> version and quit backporting fixes.
> 
> This information could be stored in kernel and made available for
> usermode tools via some proc file. This would be very useful
> _especially_ considering backporting. Raid56 could be fixed already (or
> not) by the time it is implemented, but no doubt there will still be
> other highly experimental capabilities judging by how things go. And
> this feature itself could easily be backported.

What they /could/ do would be something very similar to what they already 
did for the free-space-tree (as opposed to the free-space-cache, the 
original and still default implementation).

There was a critical bug in the early implementations of free-space-
tree.  But btrfs has incompatibility/feature flags for a reason, and they 
set it up in such a way that the flaw could be detected and fixed.

In theory they could grab another bit from it and make that raid56v2, or 
something similar, and if the raid56 flag is there but not raid56v2, 
warn, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-16 Thread Marat Khalili
Even making such a warning conditional on kernel version is 
problematic, because many distros backport major blocks of code, 
including perhaps btrfs fixes, and the nominally 3.14 or whatever 
kernel may actually be running btrfs and other fixes from 4.14 or 
later, by the time they actually drop support for whatever LTS distro 
version and quit backporting fixes.


This information could be stored in kernel and made available for 
usermode tools via some proc file. This would be very useful 
_especially_ considering backporting. Raid56 could be fixed already (or 
not) by the time it is implemented, but no doubt there will still be 
other highly experimental capabilities judging by how things go. And 
this feature itself could easily be backported.


Some machine-readable readiness level (ok/warning/override flag 
needed/known but disabled in kernel) plus one-line text message 
displayed to users in cases 2-4 is all we need. If proc file is missing 
or doesn't contain information about specific capability, tools could 
default to current behavior (AFAIR there're already warnings in some 
cases). Message should tersely cover any known issues, including 
stability, performance, compatibility and general readiness, and may 
contain links (to btrfs wiki?) for more information. I expect whole file 
to easily fit in 512 bytes.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-15 Thread Hugo Mills
On Sat, Apr 15, 2017 at 11:28:41PM +, Duncan wrote:
> Duncan posted on Sat, 15 Apr 2017 01:41:28 + as excerpted:
> 
> > Besides which, if the patch was submitted now, the earliest it could
> > really hit btrfs-progs would be 4.12,
> 
> Well, maybe 3.11.x...

   Can I borrow your time machine? Would last Wednesday be OK?

   Hugo.

-- 
Hugo Mills | We teach people management skills by examining
hugo@... carfax.org.uk | characters in Shakespeare. You could look at
http://carfax.org.uk/  | Claudius's crisis management techniques, for
PGP: E2AB1DE4  | example.   Richard Smith-Jones, Slings and Arrows


signature.asc
Description: Digital signature


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-15 Thread Duncan
Duncan posted on Sat, 15 Apr 2017 01:41:28 + as excerpted:

> Besides which, if the patch was submitted now, the earliest it could
> really hit btrfs-progs would be 4.12,

Well, maybe 3.11.x...



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-14 Thread Duncan
ronnie sahlberg posted on Fri, 14 Apr 2017 09:56:30 -0700 as excerpted:

> On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:
> ...
>> OK, I'm one of the ones that's going to "go off" on you, but FWIW, I
>> expect pretty much everyone else would pretty much agree.  At least you
>> do have backups. =:^)
>>
>> I don't think you appreciate just how bad raid56 is ATM.  There are
>> just too many REALLY serious bugs like the one you mention with it, and
>> it's actively NEGATIVELY recommended here as a result.  It's bad enough
>> with even current kernels, and the problems are well known enough to
>> the devs,
>> that there's really not a whole lot to test ATM...
> 
> Can we please hide the ability to even create any new raid56 filesystems
> behind a new flag :
> 
> --i-accept-total-data-loss
> 
> to make sure that folks are prepared for how risky it currently is. That
> should be an easy patch to the userland utilities.

The biggest problem with such a flag in general is that people often use 
a kernel and userland that are /vastly/ out of sync, version-wise.  Were 
such a flag to be introduced, people would still be seeing it five years 
or more after it no longer applied to the kernel they're using (because 
the kernel's what actually does the work in many cases, including scrub).

Even making such a warning conditional on kernel version is problematic, 
because many distros backport major blocks of code, including perhaps 
btrfs fixes, and the nominally 3.14 or whatever kernel may actually be 
running btrfs and other fixes from 4.14 or later, by the time they 
actually drop support for whatever LTS distro version and quit backporting 
fixes.

Besides which, if the patch was submitted now, the earliest it could 
really hit btrfs-progs would be 4.12, and by the time people actually get 
that in their distro they may well be on 4.13 or 4.15 or whatever, and 
the patches fixing raid56 mode to actually work may already be in place.

The only place such a warning really works is on the wiki at
https://btrfs.wiki.kernel.org , because that's really the only place that 
can be updated to current status in a realistic timeframe.  And there's 
already a feature maturity matrix there, with raid56 mode marked 
appropriately, last I checked.

Meanwhile, it can be argued that admins (and anyone making the choice of 
filesystem and device layout they're going to run is an admin of those 
systems, even if they're just running them at home for their own use) who 
don't care enough about the safety of their data to actually research the 
stability of the filesystem and filesystem features they plan to use... 
really don't value that data very highly in the first place.  And the 
status is out there both on this list and on the wiki, so even a trivial 
google should find it without issue.

Indeed:  https://www.google.com/search?q=btrfs+raid56+stability



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-14 Thread Chris Murphy
On Fri, Apr 14, 2017 at 10:46 AM, Chris Murphy  wrote:

>
> The passive repair works when it's a few bad sectors on the drive. But
> when it's piles of missing data, this is the wrong mode. It needs a
> limited scrub or balance to fix things. Right now you have to manually
> do a full scrub or balance after you've mounted for even one second
> using degraded,rw. That's why you want to avoid it at all costs.


Small clarification on "right now you have to manually do"

I don't mean YOU personally, with your array. I mean, anyone who
happens to have done even the tiniest amount of writes to a Btrfs
volume while mounted in rw,degraded. Once a new device is added and
the bad/missing device deleted, you still have to manually do a scrub
or balance of the entire array. That's the only way to fix up the
array back to normal. It's not automatic.

The way to avoid this is to *immediately* before any new writes, do a
device add and device delete missing*. That prevents any degraded
chunks from being written.



* ON non-raid56 volumes, you can use 'btrfs replace'.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-14 Thread ronnie sahlberg
On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:
...
> OK, I'm one of the ones that's going to "go off" on you, but FWIW, I
> expect pretty much everyone else would pretty much agree.  At least you
> do have backups. =:^)
>
> I don't think you appreciate just how bad raid56 is ATM.  There are just
> too many REALLY serious bugs like the one you mention with it, and it's
> actively NEGATIVELY recommended here as a result.  It's bad enough with
> even current kernels, and the problems are well known enough to the devs,
> that there's really not a whole lot to test ATM...

Can we please hide the ability to even create any new raid56
filesystems behind a new flag :

--i-accept-total-data-loss

to make sure that folks are prepared for how risky it currently is.
That should be an easy patch to the userland utilities.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-14 Thread Chris Murphy
Summary: 22x device raid6 (data and metadata). One device vanished,
and the volume is rw,degraded mounted with writes happening; next time
it's mounted the formerly missing device is not missing so it's a
normal mount, and writes are happening. Then later, the filesystem
goes read only. Now there are problems, what are the escape routes?



OK the Autopsy Report:

> In my case, I had rebooted my system and one of the drives on my main
> array did not come up. I was able to mount in degraded mode. I needed
> to re-boot the following day. This time, all the drives in the array
> came up. Several hours later, the array went into read only mode.
> That's when I discovered the odd device out had been re-added without
> any kind of error message or notice.

The instant Btrfs complains about something, you cannot make
assumptions, and you have to fix it. You can't turn your back on it.
It's an angry goose with an egg nearby. And if you turn your back on
it, it'll beat your ass down. But because this is raid6, you thought
it's OK, it's a reliable predictable mule. And you made a lot of
assumptions that are totally reasonable because it's called raid6,
except that those assumptions are all wrong because Btrfs is not like
anything else, and it's raid doesn't work like anything else.



1. The first mount attempt fails. OK why? On Btrfs you must find out
why normal mount failed, because you don't want to use degraded mode
unless absolutely necessary. But you didn't troubleshoot it.

2. The second mount attempt with degraded works. This mode exists for
one reason, you are ready right now to add a new device and delete the
missing one. Other raid56's you can wait and just hope another drive
doesn't die. Not Btrfs. You might get one chance with rw,degraded to
do a device replacement and you have to make 'dev add' and 'dev del
missing' the top priority before writing anything else to the volume.
So if you're not ready to do this, the default first action is
ro,degraded. You can get data off the volume but not change it and
lose your chance to use degraded,rw which has a decent chance of being
a one time event. But you didn't do this, you assumed Btrfs raid56 is
OK to use rw,degraded like any other raid.

3. The third mount, you must have mounted with -o degraded right off
the bat, assuming the formerly missing device was still missing and
you'd  still need -o degraded. If you'd tried a normal mount, it would
have succeeded, which would have informed you the formerly missing
device had been found and was being used. Now you have normal chunks,
degraded chunks, and more normal chunks. This array is very confused.

4. Btrfs does not do active heals (auto generation limited scrub) when
a previously missing device becomes available again. It only does
passive healing as it encounters wrong or missing data.

5. Btrfs raid6 is obviously broken somehow, because you're not the
only person who has had a file system with all available information
and two copies, and it still breaks. Most of your data is raid6,
that's three copies (data plus two parity). Some of it is degraded
raid6 which is effectively raid5, so that's data plus one copy. And
yet at some point Btrfs gets confused in normal, non-degraded mount,
and splats to read-only.  This is definitely a bug. This requires a
complete call traces prior to and include the read-only splat, in a
bug report. Or it simply won't get better. It's unclear where the devs
are at priority wise with raid56, it's also unclear if they're going
to fix it, or rewrite it.


The point is, you made a lot of mistakes by making too many
assumptions, and not realizing that degraded state in Btrfs is
basically an emergency. Finally at the very end, it still could have
saved you from your own mistakes, but there's a missing feature
(active auto heal to catch up the missing device), and there's a bug
making the fs read-only. And now it's in a sufficiently
non-deterministic state that the repair tools probably can't repair
it.


>
> The practical problem with bug#72811 is that all the csum and transid
> information is treated as being just as valid on the automatically
> re-added drive as the same information on all the other drives.

My guess is that the first normal mount after degraded writes, the
readded drive has a new super block that has current valid
information, pointing to missing data, and only as it goes looking for
the data or metadata, does it start fixing things up. Passive. So it's
own passive healing is eventually hitting a brick wall the farther
backward in time it has to go to do these fix ups.

The passive repair works when it's a few bad sectors on the drive. But
when it's piles of missing data, this is the wrong mode. It needs a
limited scrub or balance to fix things. Right now you have to manually
do a full scrub or balance after you've mounted for even one second
using degraded,rw. That's why you want to avoid it at all costs.


>
> I don't have issues with the above tools not being

Re: Encountered kernel bug#72811. Advice on recovery?

2017-04-13 Thread Duncan
Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:

> I've encountered kernel bug#72811 "If a raid5 array gets into degraded
> mode, gets modified, and the missing drive re-added, the filesystem
> loses state".

> The array normally consists of 22 devices with data and meta in raid6.
> Physically, the devices are split 16 devices in a NORCO DS-24 cage and
> the remaining devices are in the server itself. All the devices are SATA
> III.

> I don't have issues with the above tools not being ready for for raid56.
> Despite the mass quantities, none of the data involved it irretrievable,
> irreplaceable or of earth shattering importance on any level. This is a
> purely personal setup.

> As such, I'm not bothered by the 'not ready for prime time status' of
> raid56. This bug however, is really really nasty bad. Once a drive is
> out of sync, if should never be automatically re-added back.

> I mention all this because I KNOW someone is going to go off on how I
> should have back ups of everything and how I should not run raid56 and
> how I should run mirrored instead etc. Been there. Done that. I have the
> same canned lecture for people running data centers for businesses.
> 
> I am not a business. This is my personal hobby. The risk does not bother
> me. I don't mind running this setup because I think real life runtimes
> can contribute to the general betterment of btrfs for everyone. I'm not
> in any particular hurry. My income is completely independent from this.

> The potential problem is controlling what happens once I mount the
> degraded array in read/write mode to delete copied data and perform
> device reduction. I have no clue how to or even if this can be done
> safely.
> 
> The alternative is to continue to run this array in read only degraded
> mode until I can accumulate sufficient funds for a second chassis and
> approximately 20 more drives.This probably won't be until Jan 2018.
> 
> Is such a recovery strategy even possible? While I would expect a
> strategy involving 'btrfs restore' to be possible for raid0, raid1,
> raid10 configure arrays, I don't know that such a strategy will work for
> raid56.
> 
> As I see it, the key here to to be able to safely delete copied files
> and to safely reduce the number of devices in the array.

OK, I'm one of the ones that's going to "go off" on you, but FWIW, I 
expect pretty much everyone else would pretty much agree.  At least you 
do have backups. =:^)

I don't think you appreciate just how bad raid56 is ATM.  There are just 
too many REALLY serious bugs like the one you mention with it, and it's 
actively NEGATIVELY recommended here as a result.  It's bad enough with 
even current kernels, and the problems are well known enough to the devs, 
that there's really not a whole lot to test ATM...

Well, unless you're REALLY into building kernels with a whole slew of pre-
merge patches and reporting back the results to the dev working on it, as 
there /are/ a significant number of raid56 patches floating around in a 
pre-merge state here on the list.  Some of them may be in btrfs-next 
already, but I don't believe all of them are.

The problem with that is, despite how willing you may be, you obviously 
aren't running them now.  So you obviously didn't know the current 
really /really/ bad state.  If you're /willing/ to run them and have the 
skills to do that sort of patching, etc, including possibly ones that 
won't fix problems, only help further trace them down, then either 
followup with the dev working on it (which I've not tracked specifically 
so I can't tell you who) if he posts a reply, or go looking on the list 
for raid56 patches and get ahold of the dev posting them.

You'll need to get the opinion of the dev as to whether with the patches 
it's worth running yet or not.  I'm not sure if he's thru patching the 
worst of the known issues, or if there's more to go.

One of the big problems is that in the current state, the repair tools, 
scrub, etc, can actively make the problem MUCH worse.  They're simply 
broken.  Normal raid56 runtime has been working for quite awhile, so it's 
no surprise that has worked for you.  And under specific circumstances, 
pulling a drive and replacing it can work too.  But the problem is, those 
circumstances are precisely the type that people test, but not the type 
that tends to actually happen in the real world.

So effectively, raid56 mode is little more dependable than raid0 mode.  
While you /may/ be able to recover, it's uncertain enough that it's 
better to just treat the array as a raid0, and consider that you may well 
lose everything on it with pretty much any problem at all.  As such, it's 
simply irresponsible to recommend that anyone use it /as/ raid56, which 
is why it's actively NEGATIVELY recommended ATM.  Meanwhile, people that 
want raid0s... tend to configure raid0s, not raid5s or raid6s.

FWIW, I /think/ at least /some/ of the patches have been reviewed and 
cleared for, hopefull

Encountered kernel bug#72811. Advice on recovery?

2017-04-13 Thread Ank Ular
I've encountered kernel bug#72811 "If a raid5 array gets into degraded
mode, gets modified, and the missing drive re-added, the filesystem
loses state".

In my case, I had rebooted my system and one of the drives on my main
array did not come up. I was able to mount in degraded mode. I needed
to re-boot the following day. This time, all the drives in the array
came up. Several hours later, the array went into read only mode.
That's when I discovered the odd device out had been re-added without
any kind of error message or notice.

SMART does not report any errors on the device itself. I did have a
failed fan inside the server case and I suspect a thermally sensitive
issue with the responsible drive controller. Since replacing the
failed fan plus another fan, all drives devices report a running
temperature in the range of 34~35 Celsius. This is normal. None of the
drives report recording any errors.

The array normally consists of 22 devices with data and meta in raid6.
Physically, the devices are split 16 devices in a NORCO DS-24 cage and
the remaining devices are in the server itself. All the devices are
SATA III.

I've added "noauto" to the options in my fstab file for this array.
I've also disabled the odd drive out so it's no longer seen as part of
the array.

Current fstab line:
LABEL="PublicB" /PublicBbtrfs
 autodefrag,compress=lzo,space_cache,noatime,noauto  0 0

I manually mount the array:
mount -o recovery,ro,degraded

Current device list for the array:
Label: 'PublicB'  uuid: 76d87b95-5651-4707-b5bf-168210af7c3f
   Total devices 22 FS bytes used 83.63TiB
   devid1 size 5.46TiB used 5.12TiB path /dev/sdt
   devid2 size 5.46TiB used 5.12TiB path /dev/sdv
   devid3 size 5.46TiB used 5.12TiB path /dev/sdaa
   devid4 size 5.46TiB used 5.12TiB path /dev/sdx
   devid5 size 5.46TiB used 5.12TiB path /dev/sdo
   devid6 size 5.46TiB used 5.12TiB path /dev/sdq
   devid7 size 5.46TiB used 5.12TiB path /dev/sds
   devid8 size 5.46TiB used 5.12TiB path /dev/sdu
   devid9 size 5.46TiB used 4.25TiB path /dev/sdr
   devid   10 size 5.46TiB used 4.25TiB path /dev/sdy
   devid   11 size 5.46TiB used 4.25TiB path /dev/sdab
   devid   12 size 3.64TiB used 3.64TiB path /dev/sdb
   devid   13 size 3.64TiB used 3.64TiB path /dev/sdc
   devid   14 size 4.55TiB used 4.25TiB path /dev/sdd
   devid   17 size 4.55TiB used 4.25TiB path /dev/sdg
   devid   18 size 4.55TiB used 4.25TiB path /dev/sdh
   devid   19 size 5.46TiB used 4.25TiB path /dev/sdm
   devid   20 size 5.46TiB used 2.33TiB path /dev/sdp
   devid   21 size 5.46TiB used 2.33TiB path /dev/sdn
   devid   22 size 5.46TiB used 2.33TiB path /dev/sdw
   devid   23 size 5.46TiB used 2.33TiB path /dev/sdz
   *** Some devices missing

The missing device is a {nominal} 5.0TB drive and would usually show
up in this list as:
   devid   15 size 4.55TiB used 4.25TiB path /dev/sde

Other than "mount -o recovery,ro" when all 22 were present {and before
I understood I had encountered #72811}, I have NOT run any of the more
advanced recovery/repair commands/techniques.

As best as I can tell using independent {non btrfs related} all data
{approximately 80TB} prior to the initial event is intact. Directories
and files written/updated after the automatic {and silent} device
re-add are suspect and occasionally exhibit either missing files or
missing chunks of files.

Regardless of the fact the data is intact, I get runs of csum and
other errors - sample:
[114427.223006] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223011] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223012] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223015] BTRFS info (device sdw): no csum found for inode
913818 start 1219862528
[114427.223019] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223021] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223022] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223024] BTRFS info (device sdw): no csum found for inode
913818 start 1219866624
[114427.223027] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223029] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223030] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223032] BTRFS info (device sdw): no csum found for inode
913818 start 1219870720
[114427.223035] BTRFS error (device sdw): parent transid verify failed