RE: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-04-01 Thread James Johnston
> I grabbed this part from the log after the machine crashed again
> following trying to transfer a bunch of files that included ones with
> csum errors, let me know if this looks like the same issue you were
> having:
> 

Idk?  You hit a soft lockup, mine got a "kernel BUG at..."

Your stack trace diverges from mine after bio_endio.

James 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-04-01 Thread mitch
I grabbed this part from the log after the machine crashed again
following trying to transfer a bunch of files that included ones with
csum errors, let me know if this looks like the same issue you were
having:


Mar 31 00:49:42 sl-server kernel: NMI watchdog: BUG: soft lockup -
CPU#21 stuck for 22s! [kworker/u67:5:80994]
Mar 31 00:49:42 sl-server kernel: Modules linked in: fuse xt_CHECKSUM
ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat
ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw iptable_filter dm_mirror dm_region_hash
dm_log dm_mod kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel xfs aesni_intel lrw gf128mul glue_helper libcrc32c
ablk_helper cryptd joydev input_leds edac_mce_amd k10temp edac_core
fam15h_power sp5100_tco sg i2c_piix4 8250_fintek acpi_cpufreq shpchp
nfsd auth_rpcgss nfs_acl
Mar 31 00:49:42 sl-server kernel:  lockd grace sunrpc ip_tables btrfs
xor ata_generic pata_acpi raid6_pq sd_mod mgag200 crc32c_intel
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci
serio_raw pata_atiixp libahci igb drm ptp pps_core mpt3sas dca
raid_class libata i2c_algo_bit scsi_transport_sas fjes uas usb_storage
Mar 31 00:49:42 sl-server kernel: CPU: 21 PID: 80994 Comm:
kworker/u67:5 Not tainted 4.5.0-1.el7.elrepo.x86_64 #1
Mar 31 00:49:42 sl-server kernel: Hardware name: Supermicro
H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.511/25/2013
Mar 31 00:49:42 sl-server kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Mar 31 00:49:42 sl-server kernel: task: 8817f6fa8000 ti:
8800b731 task.ti: 8800b731
Mar 31 00:49:42 sl-server kernel: RIP:
0010:[]  []
btrfs_decompress_buf2page+0x123/0x200 [btrfs]
Mar 31 00:49:42 sl-server kernel: RSP: 0018:8800b7313be0  EFLAGS:
0246
Mar 31 00:49:42 sl-server kernel: RAX:  RBX:
 RCX: 
Mar 31 00:49:42 sl-server kernel: RDX:  RSI:
c9000e3d8000 RDI: 88144c7cc000
Mar 31 00:49:42 sl-server kernel: RBP: 8800b7313c48 R08:
8810f0295000 R09: 0020
Mar 31 00:49:42 sl-server kernel: R10: 8810d2ba7869 R11:
00010008 R12: 8817f6fa8000
Mar 31 00:49:42 sl-server kernel: R13: 8800b7313ce0 R14:
0008 R15: 1000
Mar 31 00:49:42 sl-server kernel: FS:  7efce58fb740()
GS:881807d4() knlGS:
Mar 31 00:49:42 sl-server kernel: CS:  0010 DS:  ES:  CR0:
8005003b
Mar 31 00:49:42 sl-server kernel: CR2: 7f00caf249e8 CR3:
001062121000 CR4: 000406e0
Mar 31 00:49:42 sl-server kernel: Stack:
Mar 31 00:49:42 sl-server kernel:  0020 f000
8810f0295000 8744
Mar 31 00:49:42 sl-server kernel:  00010008 c9000e3d7000
ea005131f300 0001
Mar 31 00:49:42 sl-server kernel:  0797 2869
0869 8810d2ba7000
Mar 31 00:49:42 sl-server kernel: Call Trace:
Mar 31 00:49:42 sl-server kernel:  []
lzo_decompress_biovec+0x202/0x300 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
end_compressed_bio_read+0x1f6/0x2f0 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
bio_endio+0x40/0x60
Mar 31 00:49:42 sl-server kernel:  []
end_workqueue_fn+0x3c/0x40 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
normal_work_helper+0xc0/0x2c0 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
btrfs_endio_helper+0x12/0x20 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
process_one_work+0x14f/0x400
Mar 31 00:49:42 sl-server kernel:  []
worker_thread+0x125/0x4b0
Mar 31 00:49:42 sl-server kernel:  [] ?
rescuer_thread+0x370/0x370
Mar 31 00:49:42 sl-server kernel:  []
kthread+0xd8/0xf0
Mar 31 00:49:42 sl-server kernel:  [] ?
kthread_park+0x60/0x60
Mar 31 00:49:42 sl-server kernel:  []
ret_from_fork+0x3f/0x70
Mar 31 00:49:42 sl-server kernel:  [] ?
kthread_park+0x60/0x60
Mar 31 00:49:42 sl-server kernel: Code: c7 48 8b 45 c0 49 03 7d 00 4a
8d 34 38 e8 06 18 00 e1 41 83 ac 24 28 12 00 00 01 41 8b 84 24 28 12 00
00 85 c0 0f 88 bf 00 00 00 <48> 89 d8 49 03 45 00 49 01 df 49 29 de 48
01 5d d0 48 3d 00 10 
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Found oopses: 1
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Creating problem
directories
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Not going to make
dump directories world readable because PrivateReports is on
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-29 Thread Mitch Fossen
Hello,

Your experience looks similar to an issue that I've been running into
recently. I have a btrfs array in RAID0 with compression=lzo set.

The machine runs fine for awhile, then crashes at (seemingly) random
with an error message in the journal about a stuck CPU and an issue
with the kworker process.

There are also a bunch of files on it that have been corrupted and
throw csum errors when trying to access them.

Combine that with some scheduled jobs that run every night that
transfer files, and it's making more sense that this issue could be
the same as you encountered.

This happened on Scientific Linux 7.2 with kernel-ml (which I think is
on version 4.5 now) installed from elrepo and the latest btrfs-progs.

I also booted from an Ubuntu 15.10 USB drive and mounted the damaged
array and ran "find /home -type f -exec cat {} /dev/null \;" from it
and it looks like that has failed as well.

I'll try to get the journal output posted and see if that could help
narrow down the cause of the problem.

Let me know if there's anything else you want me to take a look at or
test on my machine that could help.

Thanks,

Mitch Fossen

On Mon, Mar 28, 2016 at 9:36 AM James Johnston
 wrote:
>
> Hi,
>
> Thanks for the corroborating report - it does sound to me like you ran into 
> the
> same problem I've found.  (I don't suppose you ever captured any of the
> crashes?  If they assert on the same thing as me then it's even stronger
> evidence.)
>
> > The failure mode of this particular ssd was premature failure of more and
> > more sectors, about 3 MiB worth over several months based on the raw
> > count of reallocated sectors in smartctl -A, but using scrub to rewrite
> > them from the good device would normally work, forcing the firmware to
> > remap that sector to one of the spares as scrub corrected the problem.
>
> I wonder what the risk of a CRC collision was in your situation?
>
> Certainly my test of "dd if=/dev/zero of=/dev/sdb" was very abusive, and I
> wonder if the result after scrubbing is trustworthy, or if there was some
> collisions.  But I wasn't checking to see if data coming out the other end was
> OK - I was just trying to see if the kernel crashes or not (e.g. a USB stick
> holding a bad btrfs file system should not crash a system).
>
> > But /home (on an entirely separate filesystem, but a filesystem still on
> > a pair of partitions, one on each of the same two ssds) would often have
> > more, and because I have a particular program that I start with my X and
> > KDE session that reads a bunch of files into cache as it starts up, I had
> > a systemd service configured to start at boot and cat all the files in
> > that particular directory to /dev/null, thus caching them so when I later
> > started X and KDE (I don't run a *DM and thus login at the text CLI and
> > startx, with a kde session, from the CLI) and thus this program, all the
> > files it reads would already be in cache.
> >
> >  If that service was allowed to run, it would read in all
> > those files and the resulting errors would often crash the kernel.
>
> This sounds oddly familiar to how I made it crash. :)
>
> > So I quickly learned that if I powered up and the kernel crashed at that
> > point, I could reboot with the emergency kernel parameter, which would
> > tell systemd to give me a maintenance-mode root login prompt after doing
> > its normal mounts but before starting the normal post-mount services, and
> > I could run scrub from there.  That would normally repair things without
> > triggering the crash, and when I had run scrub repeatedly if necessary to
> > correct any unverified errors in the first runs, I could then exit
> > emergency mode and let systemd start the normal services, including the
> > service that read all these files off the now freshly scrubbed
> > filesystem, without further issues.
>
> That is one thing I did not test.  I only ever scrubbed after first doing the
> "cat all files to null" test.  So in the case of compression, I never got that
> far.  Probably someone should test the scrubbing more thoroughly (i.e. with
> that abusive "dd" test I did) just to be sure that it is stable to confirm 
> your
> observations, and that the problem is only limited to ordinary file I/O on the
> file system.
>
> > And apparently the devs don't test the
> > someone less common combination of both compression and high numbers of
> > raid1 correctable checksum errors, or they would have probably detected
> > and fixed the problem from that.
>
> Well, I've only tested with RAID-1.  I don't know if:
>
> 1.  The problem occurs with other RAID levels like RAID-10, RAID5/6.
>
> 2.  The kernel crashes in non-duplicated levels.  In these cases, data loss is
> inevitable since the data is missing, but these losses should be handled
> cleanly, and not by crashing the kernel.  For example:
>
> a.  Checksum errors in RAID-0.
> b.  Checksum errors on a single hard drive (not 

Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread Duncan
James Johnston posted on Mon, 28 Mar 2016 14:34:14 + as excerpted:

> Thanks for the corroborating report - it does sound to me like you ran
> into the same problem I've found.  (I don't suppose you ever captured
> any of the crashes?  If they assert on the same thing as me then it's
> even stronger evidence.)

No...  In fact, as I have compress=lzo on all my btrfs, until you found 
out that it didn't happen in the uncompressed case, I simply considered 
that part and parcel of btrfs not being fully stabilized and mature yet.  
I didn't even consider it a specific bug on its own, and thus didn't 
report it or trace it in any way, and simply worked around it, even tho I 
certainly found it frustrating.

>> The failure mode of this particular ssd was premature failure of more
>> and more sectors, about 3 MiB worth over several months based on the
>> raw count of reallocated sectors in smartctl -A, but using scrub to
>> rewrite them from the good device would normally work, forcing the
>> firmware to remap that sector to one of the spares as scrub corrected
>> the problem.
> 
> I wonder what the risk of a CRC collision was in your situation?
> 
> Certainly my test of "dd if=/dev/zero of=/dev/sdb" was very abusive, and
> I wonder if the result after scrubbing is trustworthy, or if there was
> some collisions.  But I wasn't checking to see if data coming out the
> other end was OK - I was just trying to see if the kernel crashes or not
> (e.g. a USB stick holding a bad btrfs file system should not crash a
> system).

I had absolutely no trouble with the scrubbed data, or at least none I 
attributed to that, tho I didn't have the data cross-hashed and cross-
check the post-scrub result against earlier hashes or anything, so a few 
CRC collisions could have certainly snuck thru.

But even were some to have done so, or even if they didn't in practice, 
if they could have in theory, just the standard crc checks are so far 
beyond what's built into a normal filesystem like the reiserfs that's 
still my second (and non-btrfs) level backup.  So it's not like I'm 
majorly concerned.  If I was paranoid, as I mentioned I could certainly 
be doing cross-checks against multiple hashes, but I survived without any 
sort of routine data integrity checking for years, and even a practical 
worst-case-scenario crc-collision is already an infinite percentage 
better than that (just as 1 is an infinite percentage of 0), so it's 
nothing I'm going to worry about unless I actually start seeing real 
cases of it.

>> So I quickly learned that if I powered up and the kernel crashed at
>> that point, I could reboot with the emergency kernel parameter, which
>> would tell systemd to give me a maintenance-mode root login prompt
>> after doing its normal mounts but before starting the normal post-mount
>> services, and I could run scrub from there.  That would normally repair
>> things without triggering the crash, and when I had run scrub
>> repeatedly if necessary to correct any unverified errors in the first
>> runs, I could then exit emergency mode and let systemd start the normal
>> services, including the service that read all these files off the now
>> freshly scrubbed filesystem, without further issues.
> 
> That is one thing I did not test.  I only ever scrubbed after first
> doing the "cat all files to null" test.  So in the case of compression,
> I never got that far.  Probably someone should test the scrubbing more
> thoroughly (i.e. with that abusive "dd" test I did) just to be sure that
> it is stable to confirm your observations, and that the problem is only
> limited to ordinary file I/O on the file system.

I suspect that when the devs duplicate the bug and ultimately trace it 
down, we'll know from the code-path whether scrub could have hit it or 
not, without actually testing the scrub case on its own.

And along with the fix it's a fair bet will be an fstests patch that will 
verify no regressions there once fixed, as well.

Once the fstests patch is in, it should be just a small tweak to test 
whether scrub's subject to the problem if it uses a different code-path, 
or not, and in fact once they find and verify with a fix the problem 
here, even if scrub doesn't use that code-path, I expect they'll be 
verifying scrub's own code-paths as well.

>> And apparently the devs don't test the someone less common combination
>> of both compression and high numbers of raid1 correctable checksum
>> errors, or they would have probably detected and fixed the problem from
>> that.
> 
> Well, I've only tested with RAID-1.  I don't know if:
> 
> 1.  The problem occurs with other RAID levels like RAID-10, RAID5/6.
> 
> 2.  The kernel crashes in non-duplicated levels.  In these cases, data
> loss is inevitable since the data is missing, but these losses should be
> handled cleanly, and not by crashing the kernel.

Good points.  Again, I expect the extent of the bug based on its code-
path and what actually uses it, should be 

RE: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread James Johnston
Hi,

Thanks for the corroborating report - it does sound to me like you ran into the
same problem I've found.  (I don't suppose you ever captured any of the
crashes?  If they assert on the same thing as me then it's even stronger
evidence.)

> The failure mode of this particular ssd was premature failure of more and
> more sectors, about 3 MiB worth over several months based on the raw
> count of reallocated sectors in smartctl -A, but using scrub to rewrite
> them from the good device would normally work, forcing the firmware to
> remap that sector to one of the spares as scrub corrected the problem.

I wonder what the risk of a CRC collision was in your situation?

Certainly my test of "dd if=/dev/zero of=/dev/sdb" was very abusive, and I
wonder if the result after scrubbing is trustworthy, or if there was some
collisions.  But I wasn't checking to see if data coming out the other end was
OK - I was just trying to see if the kernel crashes or not (e.g. a USB stick
holding a bad btrfs file system should not crash a system).

> But /home (on an entirely separate filesystem, but a filesystem still on
> a pair of partitions, one on each of the same two ssds) would often have
> more, and because I have a particular program that I start with my X and
> KDE session that reads a bunch of files into cache as it starts up, I had
> a systemd service configured to start at boot and cat all the files in
> that particular directory to /dev/null, thus caching them so when I later
> started X and KDE (I don't run a *DM and thus login at the text CLI and
> startx, with a kde session, from the CLI) and thus this program, all the
> files it reads would already be in cache.
>
>  If that service was allowed to run, it would read in all
> those files and the resulting errors would often crash the kernel.

This sounds oddly familiar to how I made it crash. :)

> So I quickly learned that if I powered up and the kernel crashed at that
> point, I could reboot with the emergency kernel parameter, which would
> tell systemd to give me a maintenance-mode root login prompt after doing
> its normal mounts but before starting the normal post-mount services, and
> I could run scrub from there.  That would normally repair things without
> triggering the crash, and when I had run scrub repeatedly if necessary to
> correct any unverified errors in the first runs, I could then exit
> emergency mode and let systemd start the normal services, including the
> service that read all these files off the now freshly scrubbed
> filesystem, without further issues.

That is one thing I did not test.  I only ever scrubbed after first doing the
"cat all files to null" test.  So in the case of compression, I never got that
far.  Probably someone should test the scrubbing more thoroughly (i.e. with
that abusive "dd" test I did) just to be sure that it is stable to confirm your
observations, and that the problem is only limited to ordinary file I/O on the
file system.

> And apparently the devs don't test the
> someone less common combination of both compression and high numbers of
> raid1 correctable checksum errors, or they would have probably detected
> and fixed the problem from that.

Well, I've only tested with RAID-1.  I don't know if:

1.  The problem occurs with other RAID levels like RAID-10, RAID5/6.

2.  The kernel crashes in non-duplicated levels.  In these cases, data loss is
inevitable since the data is missing, but these losses should be handled
cleanly, and not by crashing the kernel.  For example:

a.  Checksum errors in RAID-0.
b.  Checksum errors on a single hard drive (not multiple device array).

I guess more testing is needed, but I don't have time to do this more
exhaustive testing right now, especially for these other RAID levels I'm not
planning to use (as I'm doing this in my limited free time).  (For now, I can
just turn off compression & move on.)

Do any devs do regular regression testing for these sorts of edge cases once
they come up? (i.e. this problem won't come back, will it?)

> So thanks for the additional tests and narrowing it down to the
> compression on raid1 with many checksum errors case.  Now that you've
> found out how the problem can be replicated, I'd guess we'll have a fix
> patch in relatively short order. =:^)

Hopefully!  Like I said, it might not be limited to RAID-1 though.  I only
tested RAID-1.

> That said, based on my own experience, I don't consider the problem dire
> enough to switch off compression on my btrfs raid1s here.  After all, I
> both figured out how to live with the problem on my failing ssd before I
> knew all this detail, and have eliminated the symptoms for the time being
> at least, as the devices I'm using now are currently reliable enough that
> I don't have to deal with this issue.
> 
> And in the even that I do encounter the problem again, in severe enough
> form that I can't even get a successful scrub in to fix it, possibly due
> to catastrophic failure of a 

Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread Duncan
James Johnston posted on Mon, 28 Mar 2016 04:41:24 + as excerpted:

> After puzzling over the btrfs failure I reported here a week ago, I
> think there is a bad incompatibility between compression and RAID-1
> (maybe other RAID levels too?).  I think it is unsafe for users to use
> compression, at least with multiple devices until this is
> fixed/investigated further.  That seems like a drastic claim, but I know
> I will not be using it for now.  Otherwise, checksum errors scattered
> across multiple devices that *should* be recoverable will render the
> file system unusable, even to read data from.  (One alternative
> hypothesis might be that defragmentation causes the issue, since I used
> defragment to compress existing files.)
> 
> I finally was able to simplify this to a hopefully easy to reproduce
> test case, described in lengthier detail below.  In summary, suppose we
> start with an uncompressed btrfs file system on only one disk containing
> the root file system,
> such as created by a clean install of a Linux distribution.  I then:
> (1) enable compress=lzo in fstab, reboot, and then defragment the disk
> to compress all the existing files, (2) add a second drive to the array
> and balance for RAID-1, (3) reboot for good measure, (4) cause a high
> level of I/O errors, such as hot-removal of the second drive, OR simply
> a high level of bit rot (i.e. use dd to corrupt most of the disk, while
> either mounted or unmounted). This is guaranteed to cause the kernel to
> crash.

Described that way, my own experience confirms your tests, except that 
(1) I hadn't tested the no-compression case to know it was any different, 
and (2) in my case I was actually using btrfs raid1 mode and scrub to be 
able to continue to deal with a failing ssd out of a pair, for quite some 
while after I would have ordinarily had to replace it were I not using 
something like btrfs raid1 with checksummed file integrity and scrubbing 
errors with replacements from the good device.

Here's how it worked for me and why I ultimately agree with your 
conclusions, at least regarding compressed raid1 mode crashes due to too 
many failed checksum failures (since I have no reference to agree or 
disagree with the uncompressed case).

As I said above, I had one ssd failing, but was taking the opportunity 
while I had it to watch its behavior deeper into the failure than I 
normally would, and while I was at it, get familiar enough with btrfs 
scrub to repair errors that it became just another routine command for me 
(to the point that I even scripted up a custom scrub command complete 
with my normally used options, etc).  On the relatively small (largest 
was 24 GiB per device, paired device btrfs raid1) multiple btrfs on 
partitions on the two devices scrub was normally under a minute to run 
even when doing quite a few repairs, so it wasn't as if it was taking me 
the hours to days it can take at TB scale on spinning rust.

The failure mode of this particular ssd was premature failure of more and 
more sectors, about 3 MiB worth over several months based on the raw 
count of reallocated sectors in smartctl -A, but using scrub to rewrite 
them from the good device would normally work, forcing the firmware to 
remap that sector to one of the spares as scrub corrected the problem.

One not immediately intuitive thing I found with scrub, BTW, was that if 
it finished with unverified errors, I needed to rerun scrub again to do 
further repairs.  I've since confirmed with someone who can read code (I 
sort of do but more at the admin playing with patches level than the dev 
level) that my guess at the reason behind this behavior was correct.  
When a metadata node fails checksum verification and is repaired, the 
checksums that it in turn contained cannot be verified in that pass and 
show up as unverified errors.  A repeated scrub once those errors are 
fixed can verify and fix if necessary those additional nodes, and 
occasionally up to three or four runs were necessary to fully verify and 
repair all blocks, eliminating all unverified errors, at which point 
further scrubs found no further errors.

It occurred to me as I write this, that the problem I saw and you have 
confirmed with testing and now reported, may actually be related to some 
interaction between these unverified errors and compressed blocks.

Anyway, as it happens, my / filesystem is normally mounted ro except 
during updates and by the end I was scrubbing after updates, and even 
after extended power-downs, so it generally had only a few errors.

But /home (on an entirely separate filesystem, but a filesystem still on 
a pair of partitions, one on each of the same two ssds) would often have 
more, and because I have a particular program that I start with my X and 
KDE session that reads a bunch of files into cache as it starts up, I had 
a systemd service configured to start at boot and cat all the files in 
that particular directory to /dev/null, thus caching 

Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-27 Thread James Johnston
Hi,

After puzzling over the btrfs failure I reported here a week ago, I think there
is a bad incompatibility between compression and RAID-1 (maybe other RAID
levels too?).  I think it is unsafe for users to use compression, at least with
multiple devices until this is fixed/investigated further.  That seems like a
drastic claim, but I know I will not be using it for now.  Otherwise, checksum
errors scattered across multiple devices that *should* be recoverable will
render the file system unusable, even to read data from.  (One alternative
hypothesis might be that defragmentation causes the issue, since I used
defragment to compress existing files.)

I finally was able to simplify this to a hopefully easy to reproduce test case,
described in lengthier detail below.  In summary, suppose we start with an
uncompressed btrfs file system on only one disk containing the root file system,
such as created by a clean install of a Linux distribution.  I then:
(1) enable compress=lzo in fstab, reboot, and then defragment the disk to
compress all the existing files, (2) add a second drive to the array and balance
for RAID-1, (3) reboot for good measure, (4) cause a high level of I/O errors,
such as hot-removal of the second drive, OR simply a high level of bit rot
(i.e. use dd to corrupt most of the disk, while either mounted or unmounted).
This is guaranteed to cause the kernel to crash.

If the compression step is skipped such that the volume is uncompressed, you
get lots of I/O errors logged - as expected.  For hot-removal, as you point out,
patches to auto-degrade the array aren't merged yet.  For bit rot, the file
system should log lots of checksum errors and corrections, but again should
succeed.  Most importantly, the kernel _does not fall over_ and bring the
system down.  I think that's acceptable behavior until the patches you mention
are merged.

> There are a number of things missing from multiple device support,
> including any concept of a device becoming faulty (i.e. persistent
> failures rather than transient which Btrfs seems to handle OK for the
> most part), and then also getting it to go degraded automatically, and
> finally hot spare support. There are patches that could use testing.

I think in general, if the system can't handle a persistent failure, it can't
reliably handle a transient failure either... you're just less likely to
notice...  The permanent failure just stress-tests the failure code - if you
pay attention to the test case when hot removing, you'll note that oftentimes
dozens of I/O errors are mitigated successfully before one of them finally
brings the system down.

What you've described above in the patch series are nice-to-have "fancy
features" and I do hope they eventually get tested and merged, but I also hope
the above patches let you disable them all so that one can stress-test the code
handling I/O failures without having a drive get auto-dropped from the array
before you tested the failure code enough.  The I/O errors in my dmesg I'm OK
with, but I think if the file system crashes the kernel it's bad news.

> I think when testing, it's simpler to not use any additional device
> mapper layers.

The test case eliminates all device mapper layers, and just uses raw
disks/partitions.  Here it is - skip to step #5 for the meat of it:

1.  Set up a new VirtualBox VM with:
* System: Enable EFI
* System: 8 GB RAM
* System: 1 processor
* Storage: Two SATA hard drives, 8 GB each, backed by dynamic VDI files
* Storage: Default IDE CD-ROM is fine
* Storage: The SATA hard drives must be hot-pluggable
* Network: As you require
* Serial port for debugging

2.  Boot to http://releases.ubuntu.com/15.10/ubuntu-15.10-server-amd64.iso

3.  Install Ubuntu 15.10 with default settings except as noted below:
a.  Network/user settings: make up settings/accounts as needed.
b.  Use Manual partitioning with these partitions on /dev/sda, in the
following order:
* 100 MB EFI System Partition
* 500 MB btrfs, mount point at /boot
* Remaining space: btrfs: mount point at /

4.  Install and boot into 4.6 rc-1 mainline kernel:

wget 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc1-wily/linux-image-4.6.0-040600rc1-generic_4.6.0-040600rc1.201603261930_amd64.deb
dpkg -i 
linux-image-4.6.0-040600rc1-generic_4.6.0-040600rc1.201603261930_amd64.deb
reboot

5.  Set up compression and RAID-1 for root partition onto /dev/sdb:

# Add "compress=lzo" to all btrfs mounts:
vim /etc/fstab
reboot   # to take effect
# Add second drive
btrfs device add /dev/sdb /
# Defragment to compress files
btrfs filesystem defragment -v -c -r /home
btrfs filesystem defragment -v -c -r /
# Balance to RAID-1
btrfs balance start -dconvert=raid1 -mconvert=raid1 -v /
# btrfs fi usage says there was some single data until I did this, too:
btrfs balance start -dconvert=raid1 -mconvert=raid1 -v /home
# Make