Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Duncan
Janos Toth F. posted on Sun, 30 Jul 2017 03:39:10 +0200 as excerpted:

[OT but related topic continues...]

> I still get shivers if I need to resize a filesystems due to the
> memories of those early tragic experiences when I never won the lottery
> on the "trial and error" runs but lost filesystems with both hands and
> learned what wild-spread silent corruption is and how you can refresh
> your backups with corrupted copies...). Let's not take me back to those
> early days, please. I don't want to live in a cave anymore. Thank you
> modern filesystems (and their authors). :)
> 
> And on that note... Assuming I had interference problems, it was caused
> by my human mistake/negligence. I can always make similar or bigger
> human mistakes, independent of disk-level segregation. (For example, no
> amount of partitions will save any data if I accidentally wipe the
> entire drive with DD, or if I have it security-locked by the controller
> and loose the passwords, etc...)

I was glad to say goodbye to MSDOS/MBR style partitions as well, but just 
as happy to enthusiastically endorse GPT/EFI style partitions, with their 
effectively unlimited partition numbers (128 allowed at the default table 
size), no primary/logical partition stuff to worry about, partition (as 
opposed to filesystem in the partition) labels/names, integrity 
checksums, and second copy at the other end of the device. =:^)

And while all admins have their fat-finger or fat-head, aka brown-bag, 
experiences, I've never erased the wrong partition, tho I can certainly 
remember being /very/ careful the first couple times I did partitioning, 
back in the 90s on MSDOS.  Thankfully, these days even ssds are 
"reasonably" priced, and spinning rust is the trivial cost of perhaps a 
couple meals out, so as long as there's backups on other physical 
devices, getting even the device name wrong simply means losing perhaps 
your working copy instead of redoing the layout of one of your backups.

And of course you can see the existing layout on the device before you 
repartition it, and if it's not what you expected or there are any other 
problems, you just back out without doing that final writeout of the new 
partition table.

FWIW my last brown-bag was writing and running a script as root, with a 
variable-name typo that made varname empty with an rm -rf $varname/.  I 
caught and stopped it after it had emptied /bin, while it was in /etc, I 
believe.  Luckily I could boot to the (primary) backup.

But meanwhile, two experiences that set in concrete the practicality of 
separate filesystems on their own partitions, for me:

1) Back on MS, IE4-beta era.  I was running the public beta when the MSIE 
devs decided that for performance reasons they needed to write directly 
to the IE cache index on disk, bypassing the usual filesystem methods.  
What they didn't think about, however, was IE's new integration into the 
Explorer shell, meaning it was running all the time.

So along come people running the beta, running their scheduled defrag, 
which decides the index is fragmented and moves it out from under the of 
course still running Explorer shell, so the next time it direct-writes to 
what WAS the cache index, it's overwriting whatever file defrag moved to 
that spot after it moved the cache file out.

The eventual fix was to set the system attribute on the cache index, so 
the defragger wouldn't touch it.

I know a number of people running that beta that lost important files to 
that, when those files got moved into the old on-disk location of the 
cache index file and overwritten by IE when it direct-wrote to what it 
/thought/ was still the on-disk location of its index file.

But I was fine, never in any danger, because IE's "Temporary Internet 
Files" cache was on a dedicated tempfiles filesystem.  So the only files 
it overwrote for me were temporary in any case.

2) Some years ago, during a Phoenix summer, my AC went out.  I was in a 
trailer at the time, so without the AC it got hot pretty quickly, and I 
was away, with the computer left on, at the time it went out.

The high in the shade that day was about 47C/117F, and the trailer was in 
the sun, so it easily hit 55-60C/131-140F inside.  The computer was 
obviously going to be hotter than that, and the spinning disk in the 
computer hotter still, so it easily hit 70C/158F or higher.

The CPU shut down of course, and was fine when I turned it back on after 
a cooldown.

The disk... not so fine.  I'm sure it physically head-crashed and if I 
had taken it apart I'd have found grooves on the platter.

But... disks were a lot more expensive back then, and I didn't have 
another disk with backups.

What I *DID* have were backup partitions on the same disk, and because 
they weren't mounted at the time, the head didn't try seeking to them, 
and they weren't damaged (at least not beyond what could be repaired). 
When I went to assess things after everything cooled down, the damage was 
(almost) all o

Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Janos Toth F.
Reply to the TL;DR part, so TL;DR marker again...

Well, I live on the other extreme now. I want as few filesystems as
possible and viable (it's obviously impossible to have a real backup
within the same fs and/or device and with the current
size/performance/price differences between HDD and SSD, it's evident
to separate the "small and fast" from the "big and slow" storage but
other than that...). I always believed (even before I got a real grasp
on these things and could explain my view or argue about this)
"subvolumes" (in a general sense but let's use this word here) should
reside below filesystems (and be totally optional) and filesystems
should spread over a whole disk or(md- or hardware) RAID volume
(forget the MSDOS partitions) and even these ZFS/Brtfs style
subvolumes should be used sparingly (only when you really have a good
enough reason to create a subvolume, although it doesn't matter nearly
as much with subvolumes than it does with partitions).

I remember the days when I thought it's important to create separate
partitions for different kinds of data (10+ years ago when I was aware
I didn't have the experience to deviate from common general
teachings). I remember all the pain of randomly running out of space
on any and all filesystems and eventually mixing the various kinds of
data on every theoretically-segregated filesystems (wherever I found
free space), causing a nightmare of broken sorting system (like a
library after a tornado) and then all the horror of my first russian
rulett like experiences of resizing partitions and filesystem to make
the segregation decent again. And I saw much worse on other peoples's
machines. At one point, I decided to create as few partitions as
possible (and I really like the idea of zero partitions, I don't miss
MSDOS).
I still get shivers if I need to resize a filesystems due to the
memories of those early tragic experiences when I never won the
lottery on the "trial and error" runs but lost filesystems with both
hands and learned what wild-spread silent corruption is and how you
can refresh your backups with corrupted copies...). Let's not take me
back to those early days, please. I don't want to live in a cave
anymore. Thank you modern filesystems (and their authors). :)

And on that note... Assuming I had interference problems, it was
caused by my human mistake/negligence. I can always make similar or
bigger human mistakes, independent of disk-level segregation. (For
example, no amount of partitions will save any data if I accidentally
wipe the entire drive with DD, or if I have it security-locked by the
controller and loose the passwords, etc...)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Duncan
Janos Toth F. posted on Sat, 29 Jul 2017 05:02:48 +0200 as excerpted:

> The read-only scrub finished without errors/hangs (with kernel
> 4.12.3). So, I guess the hangs were caused by:
> 1: other bug in 4.13-RC1
> 2: crazy-random SATA/disk-controller issue
> 3: interference between various btrfs tools [*]
> 4: something in the background did DIO write with 4.13-RC1 (but all
> affected content was eventually overwritten/deleted between the scrub
> attempts)
> 
> [*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
> expect interference issues), so I didn't disable the scheduled
> maintenance script which deletes old files, recursively defrags the
> whole fs and runs a balance with usage=33 filters. I guess either of
> those (especially balance) could potentially cause scrub to hang.

That #3, interference between btrfs tools, could be it.  It seems btrfs
in general is getting stable enough now that we're beginning to see
bugs exposed from running two or more tools at once, because the devs
have apparently caught and fixed enough of the single-usage race bugs
that individual tools are working reasonably well, and it's now the
concurrent multi-usage case races that no one was thinking about when
they were writing the code that are being exposed.  At least, there
have been a number of such bugs either definitely or probability-traced
to concurrent usage, reported and traced/fixed, lately, more than I
remember seeing in the past.


(TL;DR folks can stop at that.)

Incidentally, that's one more advantage to my own strategy of multiple
independent small btrfs, keeping everything small enough that
maintenance jobs are at least tolerably short, making it actually
practical to run them.

Tho my case is surely an extreme, with everything on ssd and my largest
btrfs, even after recently switching my media filesystems to ssd and
btrfs, being 80 GiB (usable and per device, btrfs raid1 on paired
partitions, each on a different physical ssd).  I use neither quotas,
which don't scale well on btrfs and I don't need them, nor snapshots,
which have a bit of a scaling issue (tho apparently not as bad as
quotas) as well, because weekly or monthly backups are enough here, and
the filesystems are small enough (and on ssd) to do full-copy backups
in minutes each.  In fact, making backups easier was a major reason I
switched even the backups and media devices to all ssd, this time.

So scrubs are trivially short enough I can run them and wait for the
results while composing posts such as this (bscrub is my scrub script,
run here by my admin user with a stub setup so sudo isn't required):

$$ bscrub /mm 
scrub device /dev/sda11 (id 1) done
scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
total bytes scrubbed: 33.98GiB with 0 errors
scrub device /dev/sdb11 (id 2) done
scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
total bytes scrubbed: 33.98GiB with 0 errors

Just over a minute for a scrub of both devices on my largest 80 gig per
device btrfs. =:^)  Tho projecting to full it might be 2 and a half minutes...

Tho of course parity-raid scrubs would be far slower, at a WAG an hour or two,
for similar size on spinning rust...

Balances are similar, but being on ssd and not needing one on any of the still
relatively freshly redone filesystems ATM, I don't feel inclined to needlessly
spend a write cycle just for demonstration.

With filesystem maintenance runtimes of a minute, definitely under five minutes,
per filesystem, and with full backups under 10, I don't /need/ to run more than
one tool at once, and backups can trivially be kept fresh enough that I don't
really feel the need to schedule maintenance and risk running more than one
that way, either, particularly when I know it'll be done in minutes if I run it
manually. =:^)

Like I said, I'm obviously an extreme case, but equally obviously, while I see
the runtime concurrency bug reports on the list, it's not something likely to
affect me personally. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-28 Thread Janos Toth F.
The read-only scrub finished without errors/hangs (with kernel
4.12.3). So, I guess the hangs were caused by:
1: other bug in 4.13-RC1
2: crazy-random SATA/disk-controller issue
3: interference between various btrfs tools [*]
4: something in the background did DIO write with 4.13-RC1 (but all
affected content was eventually overwritten/deleted between the scrub
attempts)

[*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
expect interference issues), so I didn't disable the scheduled
maintenance script which deletes old files, recursively defrags the
whole fs and runs a balance with usage=33 filters. I guess either of
those (especially balance) could potentially cause scrub to hang.

On Thu, Jul 27, 2017 at 10:44 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:
>
>> * This is off-topic but raid5 scrub is painful. The disks run at
>> constant ~100% utilization while performing at ~1/5 of their sequential
>> read speeds. And despite explicitly asking idle IO priority when
>> launching scrub, the filesystem becomes unbearably slow (while scrub
>> takes a days or so to finish ... or get to the point where it hung the
>> last time around, close to the end).
>
> That's because basically all the userspace scrub command does is make the
> appropriate kernel calls to have it do the real scrub.  So priority-
> idling the userspace scrub doesn't do what it does on normal userspace
> jobs that do much of the work themselves.
>
> The problem is that idle-prioritizing the kernel threads actually doing
> the work could risk a deadlock due to lock inversion, since they're
> kernel threads and aren't designed with the idea of people messing with
> their priority in mind.
>
> Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended
> at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs
> raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho
> parity-raid mode in general (that is, not btrfs-specific) is known for
> being slow in various cases, with raid10 normally being the best
> performing closest alternative.  (Tho in the btrfs-specific case, btrfs
> raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally
> recommended higher performance reasonably low danger alternative.)

If this applies to all RAID flavors then I consider the built-in help
and the manual pages of scrub misleading (if it's RAID56-only, the
manual should still mention how RAID56 is an exception).

Also, a resumed scrub seems to skip a lot of data. It picks up where
it left but then prematurely reports a job well done. I remember
noticing a similar behavior with balance cancel/resume on RAID5 a few
years ago (it went on for a few more chunks but left the rest alone
and reported completion --- I am not sure if that's fixed now or these
have a common root cause).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-27 Thread Duncan
Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:

> * This is off-topic but raid5 scrub is painful. The disks run at
> constant ~100% utilization while performing at ~1/5 of their sequential
> read speeds. And despite explicitly asking idle IO priority when
> launching scrub, the filesystem becomes unbearably slow (while scrub
> takes a days or so to finish ... or get to the point where it hung the
> last time around, close to the end).

That's because basically all the userspace scrub command does is make the 
appropriate kernel calls to have it do the real scrub.  So priority-
idling the userspace scrub doesn't do what it does on normal userspace 
jobs that do much of the work themselves.

The problem is that idle-prioritizing the kernel threads actually doing 
the work could risk a deadlock due to lock inversion, since they're 
kernel threads and aren't designed with the idea of people messing with 
their priority in mind.

Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended 
at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs 
raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho 
parity-raid mode in general (that is, not btrfs-specific) is known for 
being slow in various cases, with raid10 normally being the best 
performing closest alternative.  (Tho in the btrfs-specific case, btrfs 
raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally 
recommended higher performance reasonably low danger alternative.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-27 Thread Janos Toth F.
> It should only affect the dio-written files, the mentioned bug makes
> btrfs write garbage into those files, so checksum fails when reading
> files, nothing else from this bug.

Thanks for confirming that.  I thought so but I removed the affected
temporary files even before I knew they were corrupt, yet I had
trouble with the follow-up scrub, so I got confused about the scope of
the issue.
However, I am not sure if some other software which regularly runs in
the background might use DIO (I don't think so but can't say for
sure).

> A hang could normally be caught by sysrq-w, could you please try it
> and see if there is a difference in kernel log?

It's not a total system hang. The filesystem in question effectively
becomes read-only (I forgot to check if it actually turns RO or writes
just silently hang) and scrub hangs (it doesn't seem to do any disk
I/O and can't be cancelled gracefully). A graceful reboot or shutdown
silently fails.

In the mean time, I switched to Linux 4.12.3, updated the firmware on
the HDDs and ran an extended SMART self-test (which found no errors),
used cp to copy everything (not for backup but as a form of "crude
scrub" [see *], which yielded zero errors) and now finally started a
scrub (in foreground and read-only mode this time).

* This is off-topic but raid5 scrub is painful. The disks run at
constant ~100% utilization while performing at ~1/5 of their
sequential read speeds. And despite explicitly asking idle IO priority
when launching scrub, the filesystem becomes unbearably slow (while
scrub takes a days or so to finish ... or get to the point where it
hung the last time around, close to the end).

I find it a little strange that BFQ and the on-board disk caching with
NCQ + FUA (look-ahead read caching and write cache reordering enabled
with 128Mb on-board caches) can't mitigate the issue a little better
(whatever scrub is doing "wrong" from a performance perspective).

If scrub hangs again, I will try to extract something useful from the logs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-26 Thread Liu Bo
On Mon, Jul 24, 2017 at 10:22:53PM +0200, Janos Toth F. wrote:
> I accidentally ran into this problem (it's pretty silly because I
> almost never run RC kernels or do dio writes but somehow I just
> happened to do both at once, exactly before I read your patch notes).
> I didn't initially catch any issues (I see no related messages in the
> kernel log) but after seeing your patch, I started a scrub (*) and it
> hung.
> 
> Is there a way to fix a filesystem corrupted by this bug or does it
> need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
> 5x4Tb HDDs.) There is a partial backup (of everything really
> important, the rest is not important enough to be kept in multiple
> copies, hence the desire for raid5...) and everything seems to be
> readable anyway (so could be saved if needed) but nuking a big fs is
> never fun...

It should only affect the dio-written files, the mentioned bug makes
btrfs write garbage into those files, so checksum fails when reading
files, nothing else from this bug.

As you use m=s=raid10, filesystem metadata is OK, so I think hang of
scrub could be another problem.


> 
> Scrub just hangs and pretty much makes the whole system hanging (it
> needs a power cycling for a reboot). Although everything runs smooth
> besides this. Btrfs check (read-only normal-mem mode) finds no errors,
> the kernel log is clean, etc.
> 
> I think I deleted all the affected dio-written test-files even before
> I started scrubbing, so that doesn't seem to do the trick. Any other
> ideas?
>

A hang could normally be caught by sysrq-w, could you please try it
and see if there is a difference in kernel log?

Thanks,

-liubo
> 
> * By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
> disk with raw disk speeds of >100 Mb/s). I forgot about this issue
> since I last used raid5 a few years ago.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-24 Thread Janos Toth F.
I accidentally ran into this problem (it's pretty silly because I
almost never run RC kernels or do dio writes but somehow I just
happened to do both at once, exactly before I read your patch notes).
I didn't initially catch any issues (I see no related messages in the
kernel log) but after seeing your patch, I started a scrub (*) and it
hung.

Is there a way to fix a filesystem corrupted by this bug or does it
need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
5x4Tb HDDs.) There is a partial backup (of everything really
important, the rest is not important enough to be kept in multiple
copies, hence the desire for raid5...) and everything seems to be
readable anyway (so could be saved if needed) but nuking a big fs is
never fun...

Scrub just hangs and pretty much makes the whole system hanging (it
needs a power cycling for a reboot). Although everything runs smooth
besides this. Btrfs check (read-only normal-mem mode) finds no errors,
the kernel log is clean, etc.

I think I deleted all the affected dio-written test-files even before
I started scrubbing, so that doesn't seem to do the trick. Any other
ideas?


* By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
disk with raw disk speeds of >100 Mb/s). I forgot about this issue
since I last used raid5 a few years ago.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-14 Thread Liu Bo
On Fri, Jul 14, 2017 at 10:04:34AM +0100, Filipe Manana wrote:
> On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo  wrote:
> > On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
> >> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> >> > From: Filipe Manana 
> >> >
> >> > The recent changes to make bio cloning faster (added in the 4.13 merge
> >> > window) by using the bio_clone_fast() API introduced a regression on
> >> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field
> >> > (therefore it can not be used) and the raid5/6 code uses the
> >> > bio_for_each_segment_all() API to iterate the segments of a bio, and this
> >> > API uses a bio's bi_vcnt field.
> >> >
> >> > The issue is very simple to trigger by doing for example a direct IO 
> >> > write
> >> > against a raid5 or raid6 filesystem and then attempting to read what we
> >> > wrote before:
> >> >
> >> >   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
> >> >   $ mount /dev/sdc /mnt
> >> >   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
> >> >   $ od -t x1 /mnt/foobar
> >> >   od: /mnt/foobar: read error: Input/output error
> >> >
> >> > For that example, the following is also reported in dmesg/syslog:
> >> >
> >> >   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
> >> >   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
> >> >   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
> >> >   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 
> >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >> >
> >> > A scrub will also fail correcting bad copies, mentioning the following in
> >> > dmesg/syslog:
> >> >
> >> >   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
> >> >   [18276.129617] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 
> >> > 65536, length 4096, links $
> >> >   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
> >> >   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 
> >> > 0, flush 0, corrupt 1, gen 0
> >> >   [18276.206059] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 
> >> > 196608, length 4096, links$
> >> >   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 
> >> > 0, flush 0, corrupt 1, gen 0
> >> >   [18276.306552] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 
> >> > 262144, length 4096, links$
> >> >   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 
> >> > 0, flush 0, corrupt 2, gen 0
> >> >   [18276.394316] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 
> >> > 458752, length 4096, links$
> >> >   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 
> >> > 0, flush 0, corrupt 1, gen 0
> >> >   [18276.434127] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 
> >> > 589824, length 4096, links$
> >> >   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 
> >> > 0, flush 0, corrupt 2, gen 0
> >> >   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) 
> >> > error at logical 2186477568 on dev /dev/sdd
> >> >   [18276.538400] BTRFS warning (device sdf): checksum error at logical 
> >> > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 
> >> > 200704, length 4096, links$
> >> >   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 
> >> 

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-14 Thread David Sterba
On Fri, Jul 14, 2017 at 10:04:34AM +0100, Filipe Manana wrote:
> On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo  wrote:
> > On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
> >> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> >> > @@ -1423,11 +1430,14 @@ static int fail_bio_stripe(struct btrfs_raid_bio 
> >> > *rbio,
> >> >   */
> >> >  static void set_bio_pages_uptodate(struct bio *bio)
> >> >  {
> >> > -   struct bio_vec *bvec;
> >> > -   int i;
> >> > +   struct bio_vec bvec;
> >> > +   struct bvec_iter iter;
> >> > +
> >> > +   if (bio_flagged(bio, BIO_CLONED))
> >> > +   bio->bi_iter = btrfs_io_bio(bio)->iter;
> >> >
> >
> > It is not necessary to update set_bio_pages_uptodate() because the bio
> > here is for reading pages in RMW and no way it could be a cloned one.
> 
> >From my testing I've never seen this function getting cloned bios as
> well. But I'm afraid we don't have full test coverage and it scares me
> to have code around that can cause silent corruptions or crashes due
> to such subtle details (bi_vcnt can't be used for cloned bios and it's
> not obvious which APIs from the block layer make use of it without
> reading their code). Plus a quick look I had at btrfs_map_bio()
> suggested it can clone bios used for read operations through
> btrfs_submit_bio_hook() for example. So that's why I changed that
> function too to be able to deal both with cloned and non-cloned bios.

I agree with that approach. My first reaction was to switch to
bio_for_each_segment everywhere, but this would increase stack
consumption by 32 bytes, so with the asserts in place I hope we'd catch
potential cloned/bi_vcnt violations. As you point out, the API is not
clear on that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-14 Thread Filipe Manana
On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo  wrote:
> On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
>> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
>> > From: Filipe Manana 
>> >
>> > The recent changes to make bio cloning faster (added in the 4.13 merge
>> > window) by using the bio_clone_fast() API introduced a regression on
>> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field
>> > (therefore it can not be used) and the raid5/6 code uses the
>> > bio_for_each_segment_all() API to iterate the segments of a bio, and this
>> > API uses a bio's bi_vcnt field.
>> >
>> > The issue is very simple to trigger by doing for example a direct IO write
>> > against a raid5 or raid6 filesystem and then attempting to read what we
>> > wrote before:
>> >
>> >   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
>> >   $ mount /dev/sdc /mnt
>> >   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
>> >   $ od -t x1 /mnt/foobar
>> >   od: /mnt/foobar: read error: Input/output error
>> >
>> > For that example, the following is also reported in dmesg/syslog:
>> >
>> >   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
>> >   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
>> >   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
>> >   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 
>> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >
>> > A scrub will also fail correcting bad copies, mentioning the following in
>> > dmesg/syslog:
>> >
>> >   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
>> >   [18276.129617] BTRFS warning (device sdf): checksum error at logical 
>> > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 
>> > 65536, length 4096, links $
>> >   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
>> >   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
>> > flush 0, corrupt 1, gen 0
>> >   [18276.206059] BTRFS warning (device sdf): checksum error at logical 
>> > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 
>> > 196608, length 4096, links$
>> >   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> > flush 0, corrupt 1, gen 0
>> >   [18276.306552] BTRFS warning (device sdf): checksum error at logical 
>> > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 
>> > 262144, length 4096, links$
>> >   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> > flush 0, corrupt 2, gen 0
>> >   [18276.394316] BTRFS warning (device sdf): checksum error at logical 
>> > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 
>> > 458752, length 4096, links$
>> >   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, 
>> > flush 0, corrupt 1, gen 0
>> >   [18276.434127] BTRFS warning (device sdf): checksum error at logical 
>> > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 
>> > 589824, length 4096, links$
>> >   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
>> > flush 0, corrupt 2, gen 0
>> >   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error 
>> > at logical 2186477568 on dev /dev/sdd
>> >   [18276.538400] BTRFS warning (device sdf): checksum error at logical 
>> > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 
>> > 200704, length 4096, links$
>> >   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> > flush 0, corrupt 3, gen 0
>> >   [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error 
>> > at logical 2186481664 on dev /dev/sdd
>> >   [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error 
>

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-14 Thread Filipe Manana
On Thu, Jul 13, 2017 at 8:09 PM, Liu Bo  wrote:
> On Thu, Jul 13, 2017 at 07:11:24PM +0100, Filipe Manana wrote:
>> On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo  wrote:
>> > On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote:
>> >> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote:
>> >> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
>> >> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
>> >> > >> From: Filipe Manana 
>> >> > >>
>> >> [...]
>> >> > >
>> >> > >
>> >> > >> Signed-off-by: Filipe Manana 
>> >> > >> ---
>> >> > >>  fs/btrfs/raid56.c | 26 ++
>> >> > >>  1 file changed, 18 insertions(+), 8 deletions(-)
>> >> > >>
>> >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> >> > >> index b9abb0b01021..b89d07003697 100644
>> >> > >> --- a/fs/btrfs/raid56.c
>> >> > >> +++ b/fs/btrfs/raid56.c
>> >> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct 
>> >> > >> btrfs_raid_bio *rbio)
>> >> > >>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
>> >> > >>  {
>> >> > >>   struct bio *bio;
>> >> > >> - struct bio_vec *bvec;
>> >> > >>   u64 start;
>> >> > >>   unsigned long stripe_offset;
>> >> > >>   unsigned long page_index;
>> >> > >> - int i;
>> >> > >>
>> >> > >>   spin_lock_irq(&rbio->bio_list_lock);
>> >> > >>   bio_list_for_each(bio, &rbio->bio_list) {
>> >> > >> + struct bio_vec bvec;
>> >> > >> + struct bvec_iter iter;
>> >> > >> + int i = 0;
>> >> > >> +
>> >> > >>   start = (u64)bio->bi_iter.bi_sector << 9;
>> >> > >>   stripe_offset = start - rbio->bbio->raid_map[0];
>> >> > >>   page_index = stripe_offset >> PAGE_SHIFT;
>> >> > >>
>> >> > >> - bio_for_each_segment_all(bvec, bio, i)
>> >> > >> - rbio->bio_pages[page_index + i] = 
>> >> > >> bvec->bv_page;
>> >> > >> + if (bio_flagged(bio, BIO_CLONED))
>> >> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter;
>> >> > >> +
>> >> > >
>> >> > > I think we can use use bio->bi_iter directly as the bio is not
>> >> > > submitted yet, i.e. bi_iter is not advanced yet.
>> >> >
>> >> > I've seen it cause problems by not setting it. I.e. the bio was
>> >> > iterated before somewhere else.
>> >> > Which lead to the following trace:
>> >> >
>> >>
>> >> This is coming from a free space inode flush during mount time and
>> >> it's full of buffered writes, the bio in use is not a cloned one.
>> >>
>> >> Interesting...let me check locally.
>> >>
>> >
>> > I ran the mentioned script in the commit log without the part of
>> > setting bio->bi_iter to btrfs_io_bio(bio)->iter.
>> >
>> > $ mkfs.btrfs -f -d raid5 /dev/sd[cde]
>> >
>> > $ mount /dev/sde /mnt/btrfs
>> >
>> > $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar
>> > wrote 131072/131072 bytes at offset 0
>> > 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec)
>> >
>> > $ od -t x1 /mnt/btrfs/foobar
>> > 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
>> > *
>> > 040
>> >
>> > $ uname -r
>> > 4.12.0+
>> >
>> > It works fine without causing 'generic protection' during mount.
>>
>> Yes, the trace was not from running the test mentioned in the
>> changelog, but rather btrfs/0[60-74] iirc or some other stress tests.
>> It doesn't happen always as well.
>>
>
> I see, then it might be caused by some other problems, as the trace
> shows, it's a buffered write whose bio is not cloned, so it still
> takes bio->bi_iter for the iteration.
>
> The 'general protection' was in finish_rmw(), could you please check
> if that points to somewhere inside this bio_for_each_segment() helper?

(gdb) list *(finish_rmw+0x1d6)
0x97eb4 is in finish_rmw (fs/btrfs/raid56.c:1252).
1247
1248 raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
1249 pointers);
1250 } else {
1251 /* raid5 */
1252 memcpy(pointers[nr_data], pointers[0], PAGE_SIZE);
1253 run_xor(pointers + 1, nr_data - 1, PAGE_SIZE);
1254 }
1255
1256

finish_rmw() does not call bio_for_each_segment nor any equivalent, so
it's the consequence of a previous call to one of those iterators,
such as at index_rbio_pages setting bio_pages[] with some invalid
memory address for example.

>
> thanks,
>
> -liubo
>
>> >
>> > thanks,
>> >
>> > -liubo
>> >
>> >> -liubo
>> >>
>> >> > [ 3955.729020] general protection fault:  [#1] PREEMPT SMP
>> >> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse
>> >> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw
>> >> > i2c_core button sunrpc loop auto
>> >> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov
>> >> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
>> >> > crc32c_generic raid1 raid0 multipath line
>> >> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata
>> >> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded:
>> >> > btrfs]
>> >> > [ 3955.73173

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Liu Bo
On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> > 
> > The recent changes to make bio cloning faster (added in the 4.13 merge
> > window) by using the bio_clone_fast() API introduced a regression on
> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field
> > (therefore it can not be used) and the raid5/6 code uses the
> > bio_for_each_segment_all() API to iterate the segments of a bio, and this
> > API uses a bio's bi_vcnt field.
> > 
> > The issue is very simple to trigger by doing for example a direct IO write
> > against a raid5 or raid6 filesystem and then attempting to read what we
> > wrote before:
> > 
> >   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
> >   $ mount /dev/sdc /mnt
> >   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
> >   $ od -t x1 /mnt/foobar
> >   od: /mnt/foobar: read error: Input/output error
> > 
> > For that example, the following is also reported in dmesg/syslog:
> > 
> >   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
> >   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
> >   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
> >   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> > 
> > A scrub will also fail correcting bad copies, mentioning the following in
> > dmesg/syslog:
> > 
> >   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
> >   [18276.129617] BTRFS warning (device sdf): checksum error at logical 
> > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 
> > 65536, length 4096, links $
> >   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
> >   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
> > flush 0, corrupt 1, gen 0
> >   [18276.206059] BTRFS warning (device sdf): checksum error at logical 
> > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 
> > 196608, length 4096, links$
> >   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> > flush 0, corrupt 1, gen 0
> >   [18276.306552] BTRFS warning (device sdf): checksum error at logical 
> > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 
> > 262144, length 4096, links$
> >   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> > flush 0, corrupt 2, gen 0
> >   [18276.394316] BTRFS warning (device sdf): checksum error at logical 
> > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 
> > 458752, length 4096, links$
> >   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, 
> > flush 0, corrupt 1, gen 0
> >   [18276.434127] BTRFS warning (device sdf): checksum error at logical 
> > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 
> > 589824, length 4096, links$
> >   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
> > flush 0, corrupt 2, gen 0
> >   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error 
> > at logical 2186477568 on dev /dev/sdd
> >   [18276.538400] BTRFS warning (device sdf): checksum error at logical 
> > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 
> > 200704, length 4096, links$
> >   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> > flush 0, corrupt 3, gen 0
> >   [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error 
> > at logical 2186481664 on dev /dev/sdd
> >   [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error 
> > at logical 2186346496 on dev /dev/sde
> >   [18276.598306] BTRFS warning (device sdf): checksum error at logical 
> > 2186412

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Liu Bo
On Thu, Jul 13, 2017 at 07:11:24PM +0100, Filipe Manana wrote:
> On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo  wrote:
> > On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote:
> >> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote:
> >> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
> >> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> >> > >> From: Filipe Manana 
> >> > >>
> >> [...]
> >> > >
> >> > >
> >> > >> Signed-off-by: Filipe Manana 
> >> > >> ---
> >> > >>  fs/btrfs/raid56.c | 26 ++
> >> > >>  1 file changed, 18 insertions(+), 8 deletions(-)
> >> > >>
> >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >> > >> index b9abb0b01021..b89d07003697 100644
> >> > >> --- a/fs/btrfs/raid56.c
> >> > >> +++ b/fs/btrfs/raid56.c
> >> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct 
> >> > >> btrfs_raid_bio *rbio)
> >> > >>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
> >> > >>  {
> >> > >>   struct bio *bio;
> >> > >> - struct bio_vec *bvec;
> >> > >>   u64 start;
> >> > >>   unsigned long stripe_offset;
> >> > >>   unsigned long page_index;
> >> > >> - int i;
> >> > >>
> >> > >>   spin_lock_irq(&rbio->bio_list_lock);
> >> > >>   bio_list_for_each(bio, &rbio->bio_list) {
> >> > >> + struct bio_vec bvec;
> >> > >> + struct bvec_iter iter;
> >> > >> + int i = 0;
> >> > >> +
> >> > >>   start = (u64)bio->bi_iter.bi_sector << 9;
> >> > >>   stripe_offset = start - rbio->bbio->raid_map[0];
> >> > >>   page_index = stripe_offset >> PAGE_SHIFT;
> >> > >>
> >> > >> - bio_for_each_segment_all(bvec, bio, i)
> >> > >> - rbio->bio_pages[page_index + i] = bvec->bv_page;
> >> > >> + if (bio_flagged(bio, BIO_CLONED))
> >> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter;
> >> > >> +
> >> > >
> >> > > I think we can use use bio->bi_iter directly as the bio is not
> >> > > submitted yet, i.e. bi_iter is not advanced yet.
> >> >
> >> > I've seen it cause problems by not setting it. I.e. the bio was
> >> > iterated before somewhere else.
> >> > Which lead to the following trace:
> >> >
> >>
> >> This is coming from a free space inode flush during mount time and
> >> it's full of buffered writes, the bio in use is not a cloned one.
> >>
> >> Interesting...let me check locally.
> >>
> >
> > I ran the mentioned script in the commit log without the part of
> > setting bio->bi_iter to btrfs_io_bio(bio)->iter.
> >
> > $ mkfs.btrfs -f -d raid5 /dev/sd[cde]
> >
> > $ mount /dev/sde /mnt/btrfs
> >
> > $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar
> > wrote 131072/131072 bytes at offset 0
> > 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec)
> >
> > $ od -t x1 /mnt/btrfs/foobar
> > 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
> > *
> > 040
> >
> > $ uname -r
> > 4.12.0+
> >
> > It works fine without causing 'generic protection' during mount.
> 
> Yes, the trace was not from running the test mentioned in the
> changelog, but rather btrfs/0[60-74] iirc or some other stress tests.
> It doesn't happen always as well.
>

I see, then it might be caused by some other problems, as the trace
shows, it's a buffered write whose bio is not cloned, so it still
takes bio->bi_iter for the iteration.

The 'general protection' was in finish_rmw(), could you please check
if that points to somewhere inside this bio_for_each_segment() helper?

thanks,

-liubo

> >
> > thanks,
> >
> > -liubo
> >
> >> -liubo
> >>
> >> > [ 3955.729020] general protection fault:  [#1] PREEMPT SMP
> >> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse
> >> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw
> >> > i2c_core button sunrpc loop auto
> >> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov
> >> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
> >> > crc32c_generic raid1 raid0 multipath line
> >> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata
> >> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded:
> >> > btrfs]
> >> > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted
> >> > 4.12.0-rc6-btrfs-next-47+ #1
> >> > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >> > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> >> > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000
> >> > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs]
> >> > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246
> >> > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: 
> >> > 0400
> >> > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: 
> >> > 880235e86000
> >> > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: 
> >> > 0001
> >> > [ 3955.731

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Filipe Manana
On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo  wrote:
> On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote:
>> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote:
>> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
>> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
>> > >> From: Filipe Manana 
>> > >>
>> [...]
>> > >
>> > >
>> > >> Signed-off-by: Filipe Manana 
>> > >> ---
>> > >>  fs/btrfs/raid56.c | 26 ++
>> > >>  1 file changed, 18 insertions(+), 8 deletions(-)
>> > >>
>> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> > >> index b9abb0b01021..b89d07003697 100644
>> > >> --- a/fs/btrfs/raid56.c
>> > >> +++ b/fs/btrfs/raid56.c
>> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct 
>> > >> btrfs_raid_bio *rbio)
>> > >>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
>> > >>  {
>> > >>   struct bio *bio;
>> > >> - struct bio_vec *bvec;
>> > >>   u64 start;
>> > >>   unsigned long stripe_offset;
>> > >>   unsigned long page_index;
>> > >> - int i;
>> > >>
>> > >>   spin_lock_irq(&rbio->bio_list_lock);
>> > >>   bio_list_for_each(bio, &rbio->bio_list) {
>> > >> + struct bio_vec bvec;
>> > >> + struct bvec_iter iter;
>> > >> + int i = 0;
>> > >> +
>> > >>   start = (u64)bio->bi_iter.bi_sector << 9;
>> > >>   stripe_offset = start - rbio->bbio->raid_map[0];
>> > >>   page_index = stripe_offset >> PAGE_SHIFT;
>> > >>
>> > >> - bio_for_each_segment_all(bvec, bio, i)
>> > >> - rbio->bio_pages[page_index + i] = bvec->bv_page;
>> > >> + if (bio_flagged(bio, BIO_CLONED))
>> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter;
>> > >> +
>> > >
>> > > I think we can use use bio->bi_iter directly as the bio is not
>> > > submitted yet, i.e. bi_iter is not advanced yet.
>> >
>> > I've seen it cause problems by not setting it. I.e. the bio was
>> > iterated before somewhere else.
>> > Which lead to the following trace:
>> >
>>
>> This is coming from a free space inode flush during mount time and
>> it's full of buffered writes, the bio in use is not a cloned one.
>>
>> Interesting...let me check locally.
>>
>
> I ran the mentioned script in the commit log without the part of
> setting bio->bi_iter to btrfs_io_bio(bio)->iter.
>
> $ mkfs.btrfs -f -d raid5 /dev/sd[cde]
>
> $ mount /dev/sde /mnt/btrfs
>
> $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar
> wrote 131072/131072 bytes at offset 0
> 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec)
>
> $ od -t x1 /mnt/btrfs/foobar
> 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
> *
> 040
>
> $ uname -r
> 4.12.0+
>
> It works fine without causing 'generic protection' during mount.

Yes, the trace was not from running the test mentioned in the
changelog, but rather btrfs/0[60-74] iirc or some other stress tests.
It doesn't happen always as well.

>
> thanks,
>
> -liubo
>
>> -liubo
>>
>> > [ 3955.729020] general protection fault:  [#1] PREEMPT SMP
>> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse
>> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw
>> > i2c_core button sunrpc loop auto
>> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov
>> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
>> > crc32c_generic raid1 raid0 multipath line
>> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata
>> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded:
>> > btrfs]
>> > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted
>> > 4.12.0-rc6-btrfs-next-47+ #1
>> > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
>> > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000
>> > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs]
>> > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246
>> > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: 
>> > 0400
>> > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: 
>> > 880235e86000
>> > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: 
>> > 0001
>> > [ 3955.731730] R10: c9000249b310 R11: 0003 R12: 
>> > c9000249b3d8
>> > [ 3955.731730] R13:  R14: 0003 R15: 
>> > c9000249b3f0
>> > [ 3955.731730] FS:  7f68fda61480() GS:88023f5c()
>> > knlGS:
>> > [ 3955.731730] CS:  0010 DS:  ES:  CR0: 80050033
>> > [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: 
>> > 06e0
>> > [ 3955.731730] Call Trace:
>> > [ 3955.731730]  full_stripe_write+0x69/0x93 [btrfs]
>> > [ 3955.731730]  raid56_parity_write+0xa4/0x110 [btrfs]
>> > [ 3955.731730]  btrfs_map_bio+0xca/0x25

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Liu Bo
On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote:
> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote:
> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> > >> From: Filipe Manana 
> > >>
> [...]
> > >
> > >
> > >> Signed-off-by: Filipe Manana 
> > >> ---
> > >>  fs/btrfs/raid56.c | 26 ++
> > >>  1 file changed, 18 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > >> index b9abb0b01021..b89d07003697 100644
> > >> --- a/fs/btrfs/raid56.c
> > >> +++ b/fs/btrfs/raid56.c
> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct 
> > >> btrfs_raid_bio *rbio)
> > >>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
> > >>  {
> > >>   struct bio *bio;
> > >> - struct bio_vec *bvec;
> > >>   u64 start;
> > >>   unsigned long stripe_offset;
> > >>   unsigned long page_index;
> > >> - int i;
> > >>
> > >>   spin_lock_irq(&rbio->bio_list_lock);
> > >>   bio_list_for_each(bio, &rbio->bio_list) {
> > >> + struct bio_vec bvec;
> > >> + struct bvec_iter iter;
> > >> + int i = 0;
> > >> +
> > >>   start = (u64)bio->bi_iter.bi_sector << 9;
> > >>   stripe_offset = start - rbio->bbio->raid_map[0];
> > >>   page_index = stripe_offset >> PAGE_SHIFT;
> > >>
> > >> - bio_for_each_segment_all(bvec, bio, i)
> > >> - rbio->bio_pages[page_index + i] = bvec->bv_page;
> > >> + if (bio_flagged(bio, BIO_CLONED))
> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter;
> > >> +
> > >
> > > I think we can use use bio->bi_iter directly as the bio is not
> > > submitted yet, i.e. bi_iter is not advanced yet.
> > 
> > I've seen it cause problems by not setting it. I.e. the bio was
> > iterated before somewhere else.
> > Which lead to the following trace:
> >
> 
> This is coming from a free space inode flush during mount time and
> it's full of buffered writes, the bio in use is not a cloned one.
> 
> Interesting...let me check locally.
>

I ran the mentioned script in the commit log without the part of
setting bio->bi_iter to btrfs_io_bio(bio)->iter.

$ mkfs.btrfs -f -d raid5 /dev/sd[cde]

$ mount /dev/sde /mnt/btrfs

$ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar
wrote 131072/131072 bytes at offset 0
128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec)

$ od -t x1 /mnt/btrfs/foobar 
000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
040

$ uname -r
4.12.0+

It works fine without causing 'generic protection' during mount.

thanks,

-liubo

> -liubo
> 
> > [ 3955.729020] general protection fault:  [#1] PREEMPT SMP
> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse
> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw
> > i2c_core button sunrpc loop auto
> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov
> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
> > crc32c_generic raid1 raid0 multipath line
> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata
> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded:
> > btrfs]
> > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted
> > 4.12.0-rc6-btrfs-next-47+ #1
> > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000
> > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs]
> > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246
> > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: 
> > 0400
> > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: 
> > 880235e86000
> > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: 
> > 0001
> > [ 3955.731730] R10: c9000249b310 R11: 0003 R12: 
> > c9000249b3d8
> > [ 3955.731730] R13:  R14: 0003 R15: 
> > c9000249b3f0
> > [ 3955.731730] FS:  7f68fda61480() GS:88023f5c()
> > knlGS:
> > [ 3955.731730] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: 
> > 06e0
> > [ 3955.731730] Call Trace:
> > [ 3955.731730]  full_stripe_write+0x69/0x93 [btrfs]
> > [ 3955.731730]  raid56_parity_write+0xa4/0x110 [btrfs]
> > [ 3955.731730]  btrfs_map_bio+0xca/0x259 [btrfs]
> > [ 3955.731730]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
> > [ 3955.731730]  submit_one_bio+0x5d/0x7b [btrfs]
> > [ 3955.731730]  submit_extent_page.constprop.30+0x99/0x165 [btrfs]
> > [ 3955.731730]  ? __test_set_page_writeback+0x1aa/0x1bc
> > [ 3955.731730]  __extent_writepage_io+0x328/0x3a2 [btrfs]
> > [ 3955.731730]  ? end_extent_

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Liu Bo
On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote:
> On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
> > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> >> From: Filipe Manana 
> >>
[...]
> >
> >
> >> Signed-off-by: Filipe Manana 
> >> ---
> >>  fs/btrfs/raid56.c | 26 ++
> >>  1 file changed, 18 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >> index b9abb0b01021..b89d07003697 100644
> >> --- a/fs/btrfs/raid56.c
> >> +++ b/fs/btrfs/raid56.c
> >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct 
> >> btrfs_raid_bio *rbio)
> >>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
> >>  {
> >>   struct bio *bio;
> >> - struct bio_vec *bvec;
> >>   u64 start;
> >>   unsigned long stripe_offset;
> >>   unsigned long page_index;
> >> - int i;
> >>
> >>   spin_lock_irq(&rbio->bio_list_lock);
> >>   bio_list_for_each(bio, &rbio->bio_list) {
> >> + struct bio_vec bvec;
> >> + struct bvec_iter iter;
> >> + int i = 0;
> >> +
> >>   start = (u64)bio->bi_iter.bi_sector << 9;
> >>   stripe_offset = start - rbio->bbio->raid_map[0];
> >>   page_index = stripe_offset >> PAGE_SHIFT;
> >>
> >> - bio_for_each_segment_all(bvec, bio, i)
> >> - rbio->bio_pages[page_index + i] = bvec->bv_page;
> >> + if (bio_flagged(bio, BIO_CLONED))
> >> + bio->bi_iter = btrfs_io_bio(bio)->iter;
> >> +
> >
> > I think we can use use bio->bi_iter directly as the bio is not
> > submitted yet, i.e. bi_iter is not advanced yet.
> 
> I've seen it cause problems by not setting it. I.e. the bio was
> iterated before somewhere else.
> Which lead to the following trace:
>

This is coming from a free space inode flush during mount time and
it's full of buffered writes, the bio in use is not a cloned one.

Interesting...let me check locally.

-liubo

> [ 3955.729020] general protection fault:  [#1] PREEMPT SMP
> [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse
> parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw
> i2c_core button sunrpc loop auto
> fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
> crc32c_generic raid1 raid0 multipath line
> ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata
> virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded:
> btrfs]
> [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted
> 4.12.0-rc6-btrfs-next-47+ #1
> [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [ 3955.731730] task: 880235b5d680 task.stack: c90002498000
> [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs]
> [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246
> [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: 
> 0400
> [ 3955.731730] RDX: 8800 RSI: db738800 RDI: 
> 880235e86000
> [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: 
> 0001
> [ 3955.731730] R10: c9000249b310 R11: 0003 R12: 
> c9000249b3d8
> [ 3955.731730] R13:  R14: 0003 R15: 
> c9000249b3f0
> [ 3955.731730] FS:  7f68fda61480() GS:88023f5c()
> knlGS:
> [ 3955.731730] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: 
> 06e0
> [ 3955.731730] Call Trace:
> [ 3955.731730]  full_stripe_write+0x69/0x93 [btrfs]
> [ 3955.731730]  raid56_parity_write+0xa4/0x110 [btrfs]
> [ 3955.731730]  btrfs_map_bio+0xca/0x259 [btrfs]
> [ 3955.731730]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
> [ 3955.731730]  submit_one_bio+0x5d/0x7b [btrfs]
> [ 3955.731730]  submit_extent_page.constprop.30+0x99/0x165 [btrfs]
> [ 3955.731730]  ? __test_set_page_writeback+0x1aa/0x1bc
> [ 3955.731730]  __extent_writepage_io+0x328/0x3a2 [btrfs]
> [ 3955.731730]  ? end_extent_writepage+0xc4/0xc4 [btrfs]
> [ 3955.731730]  __extent_writepage+0x236/0x2bd [btrfs]
> [ 3955.731730]  extent_write_cache_pages.constprop.27+0x1ed/0x35e [btrfs]
> [ 3955.731730]  extent_writepages+0x4c/0x5d [btrfs]
> [ 3955.731730]  ? rcu_read_lock_sched_held+0x57/0x6c
> [ 3955.731730]  ? btrfs_set_bit_hook+0x233/0x233 [btrfs]
> [ 3955.731730]  ? __clear_extent_bit+0x2ab/0x2ca [btrfs]
> [ 3955.731730]  btrfs_writepages+0x28/0x2a [btrfs]
> [ 3955.731730]  do_writepages+0x3c/0x72
> [ 3955.731730]  __filemap_fdatawrite_range+0x5a/0x63
> [ 3955.731730]  filemap_fdatawrite_range+0x13/0x15
> [ 3955.731730]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
> [ 3955.731730]  __btrfs_write_out_cache+0x2db/0x3aa [btrfs]
> [ 3955.731730]  btrfs_write_out_cache+0x8a/0xcd [btrfs]
> [ 3955.731730]  btrfs_star

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Filipe Manana
On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo  wrote:
> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> The recent changes to make bio cloning faster (added in the 4.13 merge
>> window) by using the bio_clone_fast() API introduced a regression on
>> raid5/6 modes, because cloned bios have an invalid bi_vcnt field
>> (therefore it can not be used) and the raid5/6 code uses the
>> bio_for_each_segment_all() API to iterate the segments of a bio, and this
>> API uses a bio's bi_vcnt field.
>>
>> The issue is very simple to trigger by doing for example a direct IO write
>> against a raid5 or raid6 filesystem and then attempting to read what we
>> wrote before:
>>
>>   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
>>   $ mount /dev/sdc /mnt
>>   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
>>   $ od -t x1 /mnt/foobar
>>   od: /mnt/foobar: read error: Input/output error
>>
>> For that example, the following is also reported in dmesg/syslog:
>>
>>   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
>>   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
>>   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
>>   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
>> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>>
>> A scrub will also fail correcting bad copies, mentioning the following in
>> dmesg/syslog:
>>
>>   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
>>   [18276.129617] BTRFS warning (device sdf): checksum error at logical 
>> 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, 
>> length 4096, links $
>>   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
>>   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
>> flush 0, corrupt 1, gen 0
>>   [18276.206059] BTRFS warning (device sdf): checksum error at logical 
>> 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 
>> 196608, length 4096, links$
>>   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> flush 0, corrupt 1, gen 0
>>   [18276.306552] BTRFS warning (device sdf): checksum error at logical 
>> 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 
>> 262144, length 4096, links$
>>   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> flush 0, corrupt 2, gen 0
>>   [18276.394316] BTRFS warning (device sdf): checksum error at logical 
>> 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 
>> 458752, length 4096, links$
>>   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, 
>> flush 0, corrupt 1, gen 0
>>   [18276.434127] BTRFS warning (device sdf): checksum error at logical 
>> 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 
>> 589824, length 4096, links$
>>   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
>> flush 0, corrupt 2, gen 0
>>   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error 
>> at logical 2186477568 on dev /dev/sdd
>>   [18276.538400] BTRFS warning (device sdf): checksum error at logical 
>> 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 
>> 200704, length 4096, links$
>>   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
>> flush 0, corrupt 3, gen 0
>>   [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error 
>> at logical 2186481664 on dev /dev/sdd
>>   [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error 
>> at logical 2186346496 on dev /dev/sde
>>   [18276.598306] BTRFS warning (device sdf): checksum error at logical 
>> 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 
>> 131072, length 4096, links$
>>   [

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread Liu Bo
On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The recent changes to make bio cloning faster (added in the 4.13 merge
> window) by using the bio_clone_fast() API introduced a regression on
> raid5/6 modes, because cloned bios have an invalid bi_vcnt field
> (therefore it can not be used) and the raid5/6 code uses the
> bio_for_each_segment_all() API to iterate the segments of a bio, and this
> API uses a bio's bi_vcnt field.
> 
> The issue is very simple to trigger by doing for example a direct IO write
> against a raid5 or raid6 filesystem and then attempting to read what we
> wrote before:
> 
>   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
>   $ mount /dev/sdc /mnt
>   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
>   $ od -t x1 /mnt/foobar
>   od: /mnt/foobar: read error: Input/output error
> 
> For that example, the following is also reported in dmesg/syslog:
> 
>   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
>   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
>   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
>   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x94374193 mirror 1
>   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> 
> A scrub will also fail correcting bad copies, mentioning the following in
> dmesg/syslog:
> 
>   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
>   [18276.129617] BTRFS warning (device sdf): checksum error at logical 
> 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, 
> length 4096, links $
>   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
>   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
> flush 0, corrupt 1, gen 0
>   [18276.206059] BTRFS warning (device sdf): checksum error at logical 
> 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, 
> length 4096, links$
>   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> flush 0, corrupt 1, gen 0
>   [18276.306552] BTRFS warning (device sdf): checksum error at logical 
> 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, 
> length 4096, links$
>   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> flush 0, corrupt 2, gen 0
>   [18276.394316] BTRFS warning (device sdf): checksum error at logical 
> 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, 
> length 4096, links$
>   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, 
> flush 0, corrupt 1, gen 0
>   [18276.434127] BTRFS warning (device sdf): checksum error at logical 
> 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, 
> length 4096, links$
>   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
> flush 0, corrupt 2, gen 0
>   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at 
> logical 2186477568 on dev /dev/sdd
>   [18276.538400] BTRFS warning (device sdf): checksum error at logical 
> 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, 
> length 4096, links$
>   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
> flush 0, corrupt 3, gen 0
>   [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at 
> logical 2186481664 on dev /dev/sdd
>   [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at 
> logical 2186346496 on dev /dev/sde
>   [18276.598306] BTRFS warning (device sdf): checksum error at logical 
> 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, 
> length 4096, links$
>   [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
> flush 0, corrupt 3, gen 0
>   [18276.598582] BTRFS error

Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread David Sterba
On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The recent changes to make bio cloning faster (added in the 4.13 merge
> window) by using the bio_clone_fast() API introduced a regression on
> raid5/6 modes, because cloned bios have an invalid bi_vcnt field
> (therefore it can not be used) and the raid5/6 code uses the
> bio_for_each_segment_all() API to iterate the segments of a bio, and this
> API uses a bio's bi_vcnt field.

Good catch! I'm going to add assertions to all remaining
bio_for_each_segment_all calls so that cloned bios do not get
accidentally iterated that way.

Other raid levels seem to be fine with the usecase you've provided, I'm
now running fstests with the assertions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6

2017-07-13 Thread fdmanana
From: Filipe Manana 

The recent changes to make bio cloning faster (added in the 4.13 merge
window) by using the bio_clone_fast() API introduced a regression on
raid5/6 modes, because cloned bios have an invalid bi_vcnt field
(therefore it can not be used) and the raid5/6 code uses the
bio_for_each_segment_all() API to iterate the segments of a bio, and this
API uses a bio's bi_vcnt field.

The issue is very simple to trigger by doing for example a direct IO write
against a raid5 or raid6 filesystem and then attempting to read what we
wrote before:

  $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
  $ mount /dev/sdc /mnt
  $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
  $ od -t x1 /mnt/foobar
  od: /mnt/foobar: read error: Input/output error

For that example, the following is also reported in dmesg/syslog:

  [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
  [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 
csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
  [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
  [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 
csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 
12288 csum 0x98f94189 expected csum 0x94374193 mirror 1

A scrub will also fail correcting bad copies, mentioning the following in
dmesg/syslog:

  [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
  [18276.129617] BTRFS warning (device sdf): checksum error at logical 
2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, 
length 4096, links $
  [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
  [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
flush 0, corrupt 1, gen 0
  [18276.206059] BTRFS warning (device sdf): checksum error at logical 
2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, 
length 4096, links$
  [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
flush 0, corrupt 1, gen 0
  [18276.306552] BTRFS warning (device sdf): checksum error at logical 
2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, 
length 4096, links$
  [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
flush 0, corrupt 2, gen 0
  [18276.394316] BTRFS warning (device sdf): checksum error at logical 
2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, 
length 4096, links$
  [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, 
flush 0, corrupt 1, gen 0
  [18276.434127] BTRFS warning (device sdf): checksum error at logical 
2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, 
length 4096, links$
  [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
flush 0, corrupt 2, gen 0
  [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at 
logical 2186477568 on dev /dev/sdd
  [18276.538400] BTRFS warning (device sdf): checksum error at logical 
2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, 
length 4096, links$
  [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, 
flush 0, corrupt 3, gen 0
  [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at 
logical 2186481664 on dev /dev/sdd
  [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at 
logical 2186346496 on dev /dev/sde
  [18276.598306] BTRFS warning (device sdf): checksum error at logical 
2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, 
length 4096, links$
  [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
flush 0, corrupt 3, gen 0
  [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at 
logical 2186350592 on dev /dev/sde
  [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, 
flush 0, corrupt 4, gen 0
  [18276.638362] BTRFS warning (device sdf): checksum error a