Re: write corruption due to bio cloning on raid5/6
Janos Toth F. posted on Sun, 30 Jul 2017 03:39:10 +0200 as excerpted: [OT but related topic continues...] > I still get shivers if I need to resize a filesystems due to the > memories of those early tragic experiences when I never won the lottery > on the "trial and error" runs but lost filesystems with both hands and > learned what wild-spread silent corruption is and how you can refresh > your backups with corrupted copies...). Let's not take me back to those > early days, please. I don't want to live in a cave anymore. Thank you > modern filesystems (and their authors). :) > > And on that note... Assuming I had interference problems, it was caused > by my human mistake/negligence. I can always make similar or bigger > human mistakes, independent of disk-level segregation. (For example, no > amount of partitions will save any data if I accidentally wipe the > entire drive with DD, or if I have it security-locked by the controller > and loose the passwords, etc...) I was glad to say goodbye to MSDOS/MBR style partitions as well, but just as happy to enthusiastically endorse GPT/EFI style partitions, with their effectively unlimited partition numbers (128 allowed at the default table size), no primary/logical partition stuff to worry about, partition (as opposed to filesystem in the partition) labels/names, integrity checksums, and second copy at the other end of the device. =:^) And while all admins have their fat-finger or fat-head, aka brown-bag, experiences, I've never erased the wrong partition, tho I can certainly remember being /very/ careful the first couple times I did partitioning, back in the 90s on MSDOS. Thankfully, these days even ssds are "reasonably" priced, and spinning rust is the trivial cost of perhaps a couple meals out, so as long as there's backups on other physical devices, getting even the device name wrong simply means losing perhaps your working copy instead of redoing the layout of one of your backups. And of course you can see the existing layout on the device before you repartition it, and if it's not what you expected or there are any other problems, you just back out without doing that final writeout of the new partition table. FWIW my last brown-bag was writing and running a script as root, with a variable-name typo that made varname empty with an rm -rf $varname/. I caught and stopped it after it had emptied /bin, while it was in /etc, I believe. Luckily I could boot to the (primary) backup. But meanwhile, two experiences that set in concrete the practicality of separate filesystems on their own partitions, for me: 1) Back on MS, IE4-beta era. I was running the public beta when the MSIE devs decided that for performance reasons they needed to write directly to the IE cache index on disk, bypassing the usual filesystem methods. What they didn't think about, however, was IE's new integration into the Explorer shell, meaning it was running all the time. So along come people running the beta, running their scheduled defrag, which decides the index is fragmented and moves it out from under the of course still running Explorer shell, so the next time it direct-writes to what WAS the cache index, it's overwriting whatever file defrag moved to that spot after it moved the cache file out. The eventual fix was to set the system attribute on the cache index, so the defragger wouldn't touch it. I know a number of people running that beta that lost important files to that, when those files got moved into the old on-disk location of the cache index file and overwritten by IE when it direct-wrote to what it /thought/ was still the on-disk location of its index file. But I was fine, never in any danger, because IE's "Temporary Internet Files" cache was on a dedicated tempfiles filesystem. So the only files it overwrote for me were temporary in any case. 2) Some years ago, during a Phoenix summer, my AC went out. I was in a trailer at the time, so without the AC it got hot pretty quickly, and I was away, with the computer left on, at the time it went out. The high in the shade that day was about 47C/117F, and the trailer was in the sun, so it easily hit 55-60C/131-140F inside. The computer was obviously going to be hotter than that, and the spinning disk in the computer hotter still, so it easily hit 70C/158F or higher. The CPU shut down of course, and was fine when I turned it back on after a cooldown. The disk... not so fine. I'm sure it physically head-crashed and if I had taken it apart I'd have found grooves on the platter. But... disks were a lot more expensive back then, and I didn't have another disk with backups. What I *DID* have were backup partitions on the same disk, and because they weren't mounted at the time, the head didn't try seeking to them, and they weren't damaged (at least not beyond what could be repaired). When I went to assess things after everything cooled down, the damage was (almost) all o
Re: write corruption due to bio cloning on raid5/6
Reply to the TL;DR part, so TL;DR marker again... Well, I live on the other extreme now. I want as few filesystems as possible and viable (it's obviously impossible to have a real backup within the same fs and/or device and with the current size/performance/price differences between HDD and SSD, it's evident to separate the "small and fast" from the "big and slow" storage but other than that...). I always believed (even before I got a real grasp on these things and could explain my view or argue about this) "subvolumes" (in a general sense but let's use this word here) should reside below filesystems (and be totally optional) and filesystems should spread over a whole disk or(md- or hardware) RAID volume (forget the MSDOS partitions) and even these ZFS/Brtfs style subvolumes should be used sparingly (only when you really have a good enough reason to create a subvolume, although it doesn't matter nearly as much with subvolumes than it does with partitions). I remember the days when I thought it's important to create separate partitions for different kinds of data (10+ years ago when I was aware I didn't have the experience to deviate from common general teachings). I remember all the pain of randomly running out of space on any and all filesystems and eventually mixing the various kinds of data on every theoretically-segregated filesystems (wherever I found free space), causing a nightmare of broken sorting system (like a library after a tornado) and then all the horror of my first russian rulett like experiences of resizing partitions and filesystem to make the segregation decent again. And I saw much worse on other peoples's machines. At one point, I decided to create as few partitions as possible (and I really like the idea of zero partitions, I don't miss MSDOS). I still get shivers if I need to resize a filesystems due to the memories of those early tragic experiences when I never won the lottery on the "trial and error" runs but lost filesystems with both hands and learned what wild-spread silent corruption is and how you can refresh your backups with corrupted copies...). Let's not take me back to those early days, please. I don't want to live in a cave anymore. Thank you modern filesystems (and their authors). :) And on that note... Assuming I had interference problems, it was caused by my human mistake/negligence. I can always make similar or bigger human mistakes, independent of disk-level segregation. (For example, no amount of partitions will save any data if I accidentally wipe the entire drive with DD, or if I have it security-locked by the controller and loose the passwords, etc...) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
Janos Toth F. posted on Sat, 29 Jul 2017 05:02:48 +0200 as excerpted: > The read-only scrub finished without errors/hangs (with kernel > 4.12.3). So, I guess the hangs were caused by: > 1: other bug in 4.13-RC1 > 2: crazy-random SATA/disk-controller issue > 3: interference between various btrfs tools [*] > 4: something in the background did DIO write with 4.13-RC1 (but all > affected content was eventually overwritten/deleted between the scrub > attempts) > > [*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't > expect interference issues), so I didn't disable the scheduled > maintenance script which deletes old files, recursively defrags the > whole fs and runs a balance with usage=33 filters. I guess either of > those (especially balance) could potentially cause scrub to hang. That #3, interference between btrfs tools, could be it. It seems btrfs in general is getting stable enough now that we're beginning to see bugs exposed from running two or more tools at once, because the devs have apparently caught and fixed enough of the single-usage race bugs that individual tools are working reasonably well, and it's now the concurrent multi-usage case races that no one was thinking about when they were writing the code that are being exposed. At least, there have been a number of such bugs either definitely or probability-traced to concurrent usage, reported and traced/fixed, lately, more than I remember seeing in the past. (TL;DR folks can stop at that.) Incidentally, that's one more advantage to my own strategy of multiple independent small btrfs, keeping everything small enough that maintenance jobs are at least tolerably short, making it actually practical to run them. Tho my case is surely an extreme, with everything on ssd and my largest btrfs, even after recently switching my media filesystems to ssd and btrfs, being 80 GiB (usable and per device, btrfs raid1 on paired partitions, each on a different physical ssd). I use neither quotas, which don't scale well on btrfs and I don't need them, nor snapshots, which have a bit of a scaling issue (tho apparently not as bad as quotas) as well, because weekly or monthly backups are enough here, and the filesystems are small enough (and on ssd) to do full-copy backups in minutes each. In fact, making backups easier was a major reason I switched even the backups and media devices to all ssd, this time. So scrubs are trivially short enough I can run them and wait for the results while composing posts such as this (bscrub is my scrub script, run here by my admin user with a stub setup so sudo isn't required): $$ bscrub /mm scrub device /dev/sda11 (id 1) done scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08 total bytes scrubbed: 33.98GiB with 0 errors scrub device /dev/sdb11 (id 2) done scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08 total bytes scrubbed: 33.98GiB with 0 errors Just over a minute for a scrub of both devices on my largest 80 gig per device btrfs. =:^) Tho projecting to full it might be 2 and a half minutes... Tho of course parity-raid scrubs would be far slower, at a WAG an hour or two, for similar size on spinning rust... Balances are similar, but being on ssd and not needing one on any of the still relatively freshly redone filesystems ATM, I don't feel inclined to needlessly spend a write cycle just for demonstration. With filesystem maintenance runtimes of a minute, definitely under five minutes, per filesystem, and with full backups under 10, I don't /need/ to run more than one tool at once, and backups can trivially be kept fresh enough that I don't really feel the need to schedule maintenance and risk running more than one that way, either, particularly when I know it'll be done in minutes if I run it manually. =:^) Like I said, I'm obviously an extreme case, but equally obviously, while I see the runtime concurrency bug reports on the list, it's not something likely to affect me personally. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
The read-only scrub finished without errors/hangs (with kernel 4.12.3). So, I guess the hangs were caused by: 1: other bug in 4.13-RC1 2: crazy-random SATA/disk-controller issue 3: interference between various btrfs tools [*] 4: something in the background did DIO write with 4.13-RC1 (but all affected content was eventually overwritten/deleted between the scrub attempts) [*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't expect interference issues), so I didn't disable the scheduled maintenance script which deletes old files, recursively defrags the whole fs and runs a balance with usage=33 filters. I guess either of those (especially balance) could potentially cause scrub to hang. On Thu, Jul 27, 2017 at 10:44 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted: > >> * This is off-topic but raid5 scrub is painful. The disks run at >> constant ~100% utilization while performing at ~1/5 of their sequential >> read speeds. And despite explicitly asking idle IO priority when >> launching scrub, the filesystem becomes unbearably slow (while scrub >> takes a days or so to finish ... or get to the point where it hung the >> last time around, close to the end). > > That's because basically all the userspace scrub command does is make the > appropriate kernel calls to have it do the real scrub. So priority- > idling the userspace scrub doesn't do what it does on normal userspace > jobs that do much of the work themselves. > > The problem is that idle-prioritizing the kernel threads actually doing > the work could risk a deadlock due to lock inversion, since they're > kernel threads and aren't designed with the idea of people messing with > their priority in mind. > > Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended > at this time. Try btrfs raid1 or raid10 mode instead, or possible btrfs > raid1, single or raid0 mode on top of a pair of mdraid5s or similar. Tho > parity-raid mode in general (that is, not btrfs-specific) is known for > being slow in various cases, with raid10 normally being the best > performing closest alternative. (Tho in the btrfs-specific case, btrfs > raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally > recommended higher performance reasonably low danger alternative.) If this applies to all RAID flavors then I consider the built-in help and the manual pages of scrub misleading (if it's RAID56-only, the manual should still mention how RAID56 is an exception). Also, a resumed scrub seems to skip a lot of data. It picks up where it left but then prematurely reports a job well done. I remember noticing a similar behavior with balance cancel/resume on RAID5 a few years ago (it went on for a few more chunks but left the rest alone and reported completion --- I am not sure if that's fixed now or these have a common root cause). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted: > * This is off-topic but raid5 scrub is painful. The disks run at > constant ~100% utilization while performing at ~1/5 of their sequential > read speeds. And despite explicitly asking idle IO priority when > launching scrub, the filesystem becomes unbearably slow (while scrub > takes a days or so to finish ... or get to the point where it hung the > last time around, close to the end). That's because basically all the userspace scrub command does is make the appropriate kernel calls to have it do the real scrub. So priority- idling the userspace scrub doesn't do what it does on normal userspace jobs that do much of the work themselves. The problem is that idle-prioritizing the kernel threads actually doing the work could risk a deadlock due to lock inversion, since they're kernel threads and aren't designed with the idea of people messing with their priority in mind. Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended at this time. Try btrfs raid1 or raid10 mode instead, or possible btrfs raid1, single or raid0 mode on top of a pair of mdraid5s or similar. Tho parity-raid mode in general (that is, not btrfs-specific) is known for being slow in various cases, with raid10 normally being the best performing closest alternative. (Tho in the btrfs-specific case, btrfs raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally recommended higher performance reasonably low danger alternative.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
> It should only affect the dio-written files, the mentioned bug makes > btrfs write garbage into those files, so checksum fails when reading > files, nothing else from this bug. Thanks for confirming that. I thought so but I removed the affected temporary files even before I knew they were corrupt, yet I had trouble with the follow-up scrub, so I got confused about the scope of the issue. However, I am not sure if some other software which regularly runs in the background might use DIO (I don't think so but can't say for sure). > A hang could normally be caught by sysrq-w, could you please try it > and see if there is a difference in kernel log? It's not a total system hang. The filesystem in question effectively becomes read-only (I forgot to check if it actually turns RO or writes just silently hang) and scrub hangs (it doesn't seem to do any disk I/O and can't be cancelled gracefully). A graceful reboot or shutdown silently fails. In the mean time, I switched to Linux 4.12.3, updated the firmware on the HDDs and ran an extended SMART self-test (which found no errors), used cp to copy everything (not for backup but as a form of "crude scrub" [see *], which yielded zero errors) and now finally started a scrub (in foreground and read-only mode this time). * This is off-topic but raid5 scrub is painful. The disks run at constant ~100% utilization while performing at ~1/5 of their sequential read speeds. And despite explicitly asking idle IO priority when launching scrub, the filesystem becomes unbearably slow (while scrub takes a days or so to finish ... or get to the point where it hung the last time around, close to the end). I find it a little strange that BFQ and the on-board disk caching with NCQ + FUA (look-ahead read caching and write cache reordering enabled with 128Mb on-board caches) can't mitigate the issue a little better (whatever scrub is doing "wrong" from a performance perspective). If scrub hangs again, I will try to extract something useful from the logs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
On Mon, Jul 24, 2017 at 10:22:53PM +0200, Janos Toth F. wrote: > I accidentally ran into this problem (it's pretty silly because I > almost never run RC kernels or do dio writes but somehow I just > happened to do both at once, exactly before I read your patch notes). > I didn't initially catch any issues (I see no related messages in the > kernel log) but after seeing your patch, I started a scrub (*) and it > hung. > > Is there a way to fix a filesystem corrupted by this bug or does it > need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with > 5x4Tb HDDs.) There is a partial backup (of everything really > important, the rest is not important enough to be kept in multiple > copies, hence the desire for raid5...) and everything seems to be > readable anyway (so could be saved if needed) but nuking a big fs is > never fun... It should only affect the dio-written files, the mentioned bug makes btrfs write garbage into those files, so checksum fails when reading files, nothing else from this bug. As you use m=s=raid10, filesystem metadata is OK, so I think hang of scrub could be another problem. > > Scrub just hangs and pretty much makes the whole system hanging (it > needs a power cycling for a reboot). Although everything runs smooth > besides this. Btrfs check (read-only normal-mem mode) finds no errors, > the kernel log is clean, etc. > > I think I deleted all the affected dio-written test-files even before > I started scrubbing, so that doesn't seem to do the trick. Any other > ideas? > A hang could normally be caught by sysrq-w, could you please try it and see if there is a difference in kernel log? Thanks, -liubo > > * By the way, I see raid56 scrub is still painfully slow (~30Mb/s / > disk with raw disk speeds of >100 Mb/s). I forgot about this issue > since I last used raid5 a few years ago. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: write corruption due to bio cloning on raid5/6
I accidentally ran into this problem (it's pretty silly because I almost never run RC kernels or do dio writes but somehow I just happened to do both at once, exactly before I read your patch notes). I didn't initially catch any issues (I see no related messages in the kernel log) but after seeing your patch, I started a scrub (*) and it hung. Is there a way to fix a filesystem corrupted by this bug or does it need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with 5x4Tb HDDs.) There is a partial backup (of everything really important, the rest is not important enough to be kept in multiple copies, hence the desire for raid5...) and everything seems to be readable anyway (so could be saved if needed) but nuking a big fs is never fun... Scrub just hangs and pretty much makes the whole system hanging (it needs a power cycling for a reboot). Although everything runs smooth besides this. Btrfs check (read-only normal-mem mode) finds no errors, the kernel log is clean, etc. I think I deleted all the affected dio-written test-files even before I started scrubbing, so that doesn't seem to do the trick. Any other ideas? * By the way, I see raid56 scrub is still painfully slow (~30Mb/s / disk with raw disk speeds of >100 Mb/s). I forgot about this issue since I last used raid5 a few years ago. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Fri, Jul 14, 2017 at 10:04:34AM +0100, Filipe Manana wrote: > On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo wrote: > > On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote: > >> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > >> > From: Filipe Manana > >> > > >> > The recent changes to make bio cloning faster (added in the 4.13 merge > >> > window) by using the bio_clone_fast() API introduced a regression on > >> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field > >> > (therefore it can not be used) and the raid5/6 code uses the > >> > bio_for_each_segment_all() API to iterate the segments of a bio, and this > >> > API uses a bio's bi_vcnt field. > >> > > >> > The issue is very simple to trigger by doing for example a direct IO > >> > write > >> > against a raid5 or raid6 filesystem and then attempting to read what we > >> > wrote before: > >> > > >> > $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf > >> > $ mount /dev/sdc /mnt > >> > $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar > >> > $ od -t x1 /mnt/foobar > >> > od: /mnt/foobar: read error: Input/output error > >> > > >> > For that example, the following is also reported in dmesg/syslog: > >> > > >> > [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed > >> > [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 > >> > [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 > >> > [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 > >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > >> > > >> > A scrub will also fail correcting bad copies, mentioning the following in > >> > dmesg/syslog: > >> > > >> > [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed > >> > [18276.129617] BTRFS warning (device sdf): checksum error at logical > >> > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset > >> > 65536, length 4096, links $ > >> > [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed > >> > [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd > >> > 0, flush 0, corrupt 1, gen 0 > >> > [18276.206059] BTRFS warning (device sdf): checksum error at logical > >> > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset > >> > 196608, length 4096, links$ > >> > [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd > >> > 0, flush 0, corrupt 1, gen 0 > >> > [18276.306552] BTRFS warning (device sdf): checksum error at logical > >> > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset > >> > 262144, length 4096, links$ > >> > [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd > >> > 0, flush 0, corrupt 2, gen 0 > >> > [18276.394316] BTRFS warning (device sdf): checksum error at logical > >> > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset > >> > 458752, length 4096, links$ > >> > [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd > >> > 0, flush 0, corrupt 1, gen 0 > >> > [18276.434127] BTRFS warning (device sdf): checksum error at logical > >> > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset > >> > 589824, length 4096, links$ > >> > [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd > >> > 0, flush 0, corrupt 2, gen 0 > >> > [18276.500504] BTRFS error (device sdf): unable to fixup (regular) > >> > error at logical 2186477568 on dev /dev/sdd > >> > [18276.538400] BTRFS warning (device sdf): checksum error at logical > >> > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset > >> > 200704, length 4096, links$ > >> > [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd > >>
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Fri, Jul 14, 2017 at 10:04:34AM +0100, Filipe Manana wrote: > On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo wrote: > > On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote: > >> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > >> > @@ -1423,11 +1430,14 @@ static int fail_bio_stripe(struct btrfs_raid_bio > >> > *rbio, > >> > */ > >> > static void set_bio_pages_uptodate(struct bio *bio) > >> > { > >> > - struct bio_vec *bvec; > >> > - int i; > >> > + struct bio_vec bvec; > >> > + struct bvec_iter iter; > >> > + > >> > + if (bio_flagged(bio, BIO_CLONED)) > >> > + bio->bi_iter = btrfs_io_bio(bio)->iter; > >> > > > > > It is not necessary to update set_bio_pages_uptodate() because the bio > > here is for reading pages in RMW and no way it could be a cloned one. > > >From my testing I've never seen this function getting cloned bios as > well. But I'm afraid we don't have full test coverage and it scares me > to have code around that can cause silent corruptions or crashes due > to such subtle details (bi_vcnt can't be used for cloned bios and it's > not obvious which APIs from the block layer make use of it without > reading their code). Plus a quick look I had at btrfs_map_bio() > suggested it can clone bios used for read operations through > btrfs_submit_bio_hook() for example. So that's why I changed that > function too to be able to deal both with cloned and non-cloned bios. I agree with that approach. My first reaction was to switch to bio_for_each_segment everywhere, but this would increase stack consumption by 32 bytes, so with the asserts in place I hope we'd catch potential cloned/bi_vcnt violations. As you point out, the API is not clear on that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo wrote: > On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote: >> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: >> > From: Filipe Manana >> > >> > The recent changes to make bio cloning faster (added in the 4.13 merge >> > window) by using the bio_clone_fast() API introduced a regression on >> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field >> > (therefore it can not be used) and the raid5/6 code uses the >> > bio_for_each_segment_all() API to iterate the segments of a bio, and this >> > API uses a bio's bi_vcnt field. >> > >> > The issue is very simple to trigger by doing for example a direct IO write >> > against a raid5 or raid6 filesystem and then attempting to read what we >> > wrote before: >> > >> > $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf >> > $ mount /dev/sdc /mnt >> > $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar >> > $ od -t x1 /mnt/foobar >> > od: /mnt/foobar: read error: Input/output error >> > >> > For that example, the following is also reported in dmesg/syslog: >> > >> > [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed >> > [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 >> > [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 >> > [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 >> > off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> > >> > A scrub will also fail correcting bad copies, mentioning the following in >> > dmesg/syslog: >> > >> > [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed >> > [18276.129617] BTRFS warning (device sdf): checksum error at logical >> > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset >> > 65536, length 4096, links $ >> > [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed >> > [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, >> > flush 0, corrupt 1, gen 0 >> > [18276.206059] BTRFS warning (device sdf): checksum error at logical >> > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset >> > 196608, length 4096, links$ >> > [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> > flush 0, corrupt 1, gen 0 >> > [18276.306552] BTRFS warning (device sdf): checksum error at logical >> > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset >> > 262144, length 4096, links$ >> > [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> > flush 0, corrupt 2, gen 0 >> > [18276.394316] BTRFS warning (device sdf): checksum error at logical >> > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset >> > 458752, length 4096, links$ >> > [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, >> > flush 0, corrupt 1, gen 0 >> > [18276.434127] BTRFS warning (device sdf): checksum error at logical >> > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset >> > 589824, length 4096, links$ >> > [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, >> > flush 0, corrupt 2, gen 0 >> > [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error >> > at logical 2186477568 on dev /dev/sdd >> > [18276.538400] BTRFS warning (device sdf): checksum error at logical >> > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset >> > 200704, length 4096, links$ >> > [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> > flush 0, corrupt 3, gen 0 >> > [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error >> > at logical 2186481664 on dev /dev/sdd >> > [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error >
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 8:09 PM, Liu Bo wrote: > On Thu, Jul 13, 2017 at 07:11:24PM +0100, Filipe Manana wrote: >> On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo wrote: >> > On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote: >> >> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote: >> >> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: >> >> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: >> >> > >> From: Filipe Manana >> >> > >> >> >> [...] >> >> > > >> >> > > >> >> > >> Signed-off-by: Filipe Manana >> >> > >> --- >> >> > >> fs/btrfs/raid56.c | 26 ++ >> >> > >> 1 file changed, 18 insertions(+), 8 deletions(-) >> >> > >> >> >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >> >> > >> index b9abb0b01021..b89d07003697 100644 >> >> > >> --- a/fs/btrfs/raid56.c >> >> > >> +++ b/fs/btrfs/raid56.c >> >> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct >> >> > >> btrfs_raid_bio *rbio) >> >> > >> static void index_rbio_pages(struct btrfs_raid_bio *rbio) >> >> > >> { >> >> > >> struct bio *bio; >> >> > >> - struct bio_vec *bvec; >> >> > >> u64 start; >> >> > >> unsigned long stripe_offset; >> >> > >> unsigned long page_index; >> >> > >> - int i; >> >> > >> >> >> > >> spin_lock_irq(&rbio->bio_list_lock); >> >> > >> bio_list_for_each(bio, &rbio->bio_list) { >> >> > >> + struct bio_vec bvec; >> >> > >> + struct bvec_iter iter; >> >> > >> + int i = 0; >> >> > >> + >> >> > >> start = (u64)bio->bi_iter.bi_sector << 9; >> >> > >> stripe_offset = start - rbio->bbio->raid_map[0]; >> >> > >> page_index = stripe_offset >> PAGE_SHIFT; >> >> > >> >> >> > >> - bio_for_each_segment_all(bvec, bio, i) >> >> > >> - rbio->bio_pages[page_index + i] = >> >> > >> bvec->bv_page; >> >> > >> + if (bio_flagged(bio, BIO_CLONED)) >> >> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter; >> >> > >> + >> >> > > >> >> > > I think we can use use bio->bi_iter directly as the bio is not >> >> > > submitted yet, i.e. bi_iter is not advanced yet. >> >> > >> >> > I've seen it cause problems by not setting it. I.e. the bio was >> >> > iterated before somewhere else. >> >> > Which lead to the following trace: >> >> > >> >> >> >> This is coming from a free space inode flush during mount time and >> >> it's full of buffered writes, the bio in use is not a cloned one. >> >> >> >> Interesting...let me check locally. >> >> >> > >> > I ran the mentioned script in the commit log without the part of >> > setting bio->bi_iter to btrfs_io_bio(bio)->iter. >> > >> > $ mkfs.btrfs -f -d raid5 /dev/sd[cde] >> > >> > $ mount /dev/sde /mnt/btrfs >> > >> > $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar >> > wrote 131072/131072 bytes at offset 0 >> > 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec) >> > >> > $ od -t x1 /mnt/btrfs/foobar >> > 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd >> > * >> > 040 >> > >> > $ uname -r >> > 4.12.0+ >> > >> > It works fine without causing 'generic protection' during mount. >> >> Yes, the trace was not from running the test mentioned in the >> changelog, but rather btrfs/0[60-74] iirc or some other stress tests. >> It doesn't happen always as well. >> > > I see, then it might be caused by some other problems, as the trace > shows, it's a buffered write whose bio is not cloned, so it still > takes bio->bi_iter for the iteration. > > The 'general protection' was in finish_rmw(), could you please check > if that points to somewhere inside this bio_for_each_segment() helper? (gdb) list *(finish_rmw+0x1d6) 0x97eb4 is in finish_rmw (fs/btrfs/raid56.c:1252). 1247 1248 raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE, 1249 pointers); 1250 } else { 1251 /* raid5 */ 1252 memcpy(pointers[nr_data], pointers[0], PAGE_SIZE); 1253 run_xor(pointers + 1, nr_data - 1, PAGE_SIZE); 1254 } 1255 1256 finish_rmw() does not call bio_for_each_segment nor any equivalent, so it's the consequence of a previous call to one of those iterators, such as at index_rbio_pages setting bio_pages[] with some invalid memory address for example. > > thanks, > > -liubo > >> > >> > thanks, >> > >> > -liubo >> > >> >> -liubo >> >> >> >> > [ 3955.729020] general protection fault: [#1] PREEMPT SMP >> >> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse >> >> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw >> >> > i2c_core button sunrpc loop auto >> >> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov >> >> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c >> >> > crc32c_generic raid1 raid0 multipath line >> >> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata >> >> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded: >> >> > btrfs] >> >> > [ 3955.73173
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote: > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana > > > > The recent changes to make bio cloning faster (added in the 4.13 merge > > window) by using the bio_clone_fast() API introduced a regression on > > raid5/6 modes, because cloned bios have an invalid bi_vcnt field > > (therefore it can not be used) and the raid5/6 code uses the > > bio_for_each_segment_all() API to iterate the segments of a bio, and this > > API uses a bio's bi_vcnt field. > > > > The issue is very simple to trigger by doing for example a direct IO write > > against a raid5 or raid6 filesystem and then attempting to read what we > > wrote before: > > > > $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf > > $ mount /dev/sdc /mnt > > $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar > > $ od -t x1 /mnt/foobar > > od: /mnt/foobar: read error: Input/output error > > > > For that example, the following is also reported in dmesg/syslog: > > > > [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed > > [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 > > [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 > > [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off > > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > > > A scrub will also fail correcting bad copies, mentioning the following in > > dmesg/syslog: > > > > [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed > > [18276.129617] BTRFS warning (device sdf): checksum error at logical > > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset > > 65536, length 4096, links $ > > [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed > > [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, > > flush 0, corrupt 1, gen 0 > > [18276.206059] BTRFS warning (device sdf): checksum error at logical > > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset > > 196608, length 4096, links$ > > [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > > flush 0, corrupt 1, gen 0 > > [18276.306552] BTRFS warning (device sdf): checksum error at logical > > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset > > 262144, length 4096, links$ > > [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > > flush 0, corrupt 2, gen 0 > > [18276.394316] BTRFS warning (device sdf): checksum error at logical > > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset > > 458752, length 4096, links$ > > [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, > > flush 0, corrupt 1, gen 0 > > [18276.434127] BTRFS warning (device sdf): checksum error at logical > > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset > > 589824, length 4096, links$ > > [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, > > flush 0, corrupt 2, gen 0 > > [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error > > at logical 2186477568 on dev /dev/sdd > > [18276.538400] BTRFS warning (device sdf): checksum error at logical > > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset > > 200704, length 4096, links$ > > [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > > flush 0, corrupt 3, gen 0 > > [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error > > at logical 2186481664 on dev /dev/sdd > > [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error > > at logical 2186346496 on dev /dev/sde > > [18276.598306] BTRFS warning (device sdf): checksum error at logical > > 2186412
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 07:11:24PM +0100, Filipe Manana wrote: > On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo wrote: > > On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote: > >> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote: > >> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: > >> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > >> > >> From: Filipe Manana > >> > >> > >> [...] > >> > > > >> > > > >> > >> Signed-off-by: Filipe Manana > >> > >> --- > >> > >> fs/btrfs/raid56.c | 26 ++ > >> > >> 1 file changed, 18 insertions(+), 8 deletions(-) > >> > >> > >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > >> > >> index b9abb0b01021..b89d07003697 100644 > >> > >> --- a/fs/btrfs/raid56.c > >> > >> +++ b/fs/btrfs/raid56.c > >> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct > >> > >> btrfs_raid_bio *rbio) > >> > >> static void index_rbio_pages(struct btrfs_raid_bio *rbio) > >> > >> { > >> > >> struct bio *bio; > >> > >> - struct bio_vec *bvec; > >> > >> u64 start; > >> > >> unsigned long stripe_offset; > >> > >> unsigned long page_index; > >> > >> - int i; > >> > >> > >> > >> spin_lock_irq(&rbio->bio_list_lock); > >> > >> bio_list_for_each(bio, &rbio->bio_list) { > >> > >> + struct bio_vec bvec; > >> > >> + struct bvec_iter iter; > >> > >> + int i = 0; > >> > >> + > >> > >> start = (u64)bio->bi_iter.bi_sector << 9; > >> > >> stripe_offset = start - rbio->bbio->raid_map[0]; > >> > >> page_index = stripe_offset >> PAGE_SHIFT; > >> > >> > >> > >> - bio_for_each_segment_all(bvec, bio, i) > >> > >> - rbio->bio_pages[page_index + i] = bvec->bv_page; > >> > >> + if (bio_flagged(bio, BIO_CLONED)) > >> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter; > >> > >> + > >> > > > >> > > I think we can use use bio->bi_iter directly as the bio is not > >> > > submitted yet, i.e. bi_iter is not advanced yet. > >> > > >> > I've seen it cause problems by not setting it. I.e. the bio was > >> > iterated before somewhere else. > >> > Which lead to the following trace: > >> > > >> > >> This is coming from a free space inode flush during mount time and > >> it's full of buffered writes, the bio in use is not a cloned one. > >> > >> Interesting...let me check locally. > >> > > > > I ran the mentioned script in the commit log without the part of > > setting bio->bi_iter to btrfs_io_bio(bio)->iter. > > > > $ mkfs.btrfs -f -d raid5 /dev/sd[cde] > > > > $ mount /dev/sde /mnt/btrfs > > > > $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar > > wrote 131072/131072 bytes at offset 0 > > 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec) > > > > $ od -t x1 /mnt/btrfs/foobar > > 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > > * > > 040 > > > > $ uname -r > > 4.12.0+ > > > > It works fine without causing 'generic protection' during mount. > > Yes, the trace was not from running the test mentioned in the > changelog, but rather btrfs/0[60-74] iirc or some other stress tests. > It doesn't happen always as well. > I see, then it might be caused by some other problems, as the trace shows, it's a buffered write whose bio is not cloned, so it still takes bio->bi_iter for the iteration. The 'general protection' was in finish_rmw(), could you please check if that points to somewhere inside this bio_for_each_segment() helper? thanks, -liubo > > > > thanks, > > > > -liubo > > > >> -liubo > >> > >> > [ 3955.729020] general protection fault: [#1] PREEMPT SMP > >> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse > >> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw > >> > i2c_core button sunrpc loop auto > >> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov > >> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c > >> > crc32c_generic raid1 raid0 multipath line > >> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata > >> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded: > >> > btrfs] > >> > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted > >> > 4.12.0-rc6-btrfs-next-47+ #1 > >> > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > >> > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > >> > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000 > >> > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs] > >> > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246 > >> > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: > >> > 0400 > >> > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: > >> > 880235e86000 > >> > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: > >> > 0001 > >> > [ 3955.731
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 7:00 PM, Liu Bo wrote: > On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote: >> On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote: >> > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: >> > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: >> > >> From: Filipe Manana >> > >> >> [...] >> > > >> > > >> > >> Signed-off-by: Filipe Manana >> > >> --- >> > >> fs/btrfs/raid56.c | 26 ++ >> > >> 1 file changed, 18 insertions(+), 8 deletions(-) >> > >> >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >> > >> index b9abb0b01021..b89d07003697 100644 >> > >> --- a/fs/btrfs/raid56.c >> > >> +++ b/fs/btrfs/raid56.c >> > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct >> > >> btrfs_raid_bio *rbio) >> > >> static void index_rbio_pages(struct btrfs_raid_bio *rbio) >> > >> { >> > >> struct bio *bio; >> > >> - struct bio_vec *bvec; >> > >> u64 start; >> > >> unsigned long stripe_offset; >> > >> unsigned long page_index; >> > >> - int i; >> > >> >> > >> spin_lock_irq(&rbio->bio_list_lock); >> > >> bio_list_for_each(bio, &rbio->bio_list) { >> > >> + struct bio_vec bvec; >> > >> + struct bvec_iter iter; >> > >> + int i = 0; >> > >> + >> > >> start = (u64)bio->bi_iter.bi_sector << 9; >> > >> stripe_offset = start - rbio->bbio->raid_map[0]; >> > >> page_index = stripe_offset >> PAGE_SHIFT; >> > >> >> > >> - bio_for_each_segment_all(bvec, bio, i) >> > >> - rbio->bio_pages[page_index + i] = bvec->bv_page; >> > >> + if (bio_flagged(bio, BIO_CLONED)) >> > >> + bio->bi_iter = btrfs_io_bio(bio)->iter; >> > >> + >> > > >> > > I think we can use use bio->bi_iter directly as the bio is not >> > > submitted yet, i.e. bi_iter is not advanced yet. >> > >> > I've seen it cause problems by not setting it. I.e. the bio was >> > iterated before somewhere else. >> > Which lead to the following trace: >> > >> >> This is coming from a free space inode flush during mount time and >> it's full of buffered writes, the bio in use is not a cloned one. >> >> Interesting...let me check locally. >> > > I ran the mentioned script in the commit log without the part of > setting bio->bi_iter to btrfs_io_bio(bio)->iter. > > $ mkfs.btrfs -f -d raid5 /dev/sd[cde] > > $ mount /dev/sde /mnt/btrfs > > $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar > wrote 131072/131072 bytes at offset 0 > 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec) > > $ od -t x1 /mnt/btrfs/foobar > 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > * > 040 > > $ uname -r > 4.12.0+ > > It works fine without causing 'generic protection' during mount. Yes, the trace was not from running the test mentioned in the changelog, but rather btrfs/0[60-74] iirc or some other stress tests. It doesn't happen always as well. > > thanks, > > -liubo > >> -liubo >> >> > [ 3955.729020] general protection fault: [#1] PREEMPT SMP >> > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse >> > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw >> > i2c_core button sunrpc loop auto >> > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov >> > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c >> > crc32c_generic raid1 raid0 multipath line >> > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata >> > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded: >> > btrfs] >> > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted >> > 4.12.0-rc6-btrfs-next-47+ #1 >> > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 >> > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000 >> > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs] >> > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246 >> > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: >> > 0400 >> > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: >> > 880235e86000 >> > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: >> > 0001 >> > [ 3955.731730] R10: c9000249b310 R11: 0003 R12: >> > c9000249b3d8 >> > [ 3955.731730] R13: R14: 0003 R15: >> > c9000249b3f0 >> > [ 3955.731730] FS: 7f68fda61480() GS:88023f5c() >> > knlGS: >> > [ 3955.731730] CS: 0010 DS: ES: CR0: 80050033 >> > [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: >> > 06e0 >> > [ 3955.731730] Call Trace: >> > [ 3955.731730] full_stripe_write+0x69/0x93 [btrfs] >> > [ 3955.731730] raid56_parity_write+0xa4/0x110 [btrfs] >> > [ 3955.731730] btrfs_map_bio+0xca/0x25
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 10:06:32AM -0700, Liu Bo wrote: > On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote: > > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: > > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > > >> From: Filipe Manana > > >> > [...] > > > > > > > > >> Signed-off-by: Filipe Manana > > >> --- > > >> fs/btrfs/raid56.c | 26 ++ > > >> 1 file changed, 18 insertions(+), 8 deletions(-) > > >> > > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > > >> index b9abb0b01021..b89d07003697 100644 > > >> --- a/fs/btrfs/raid56.c > > >> +++ b/fs/btrfs/raid56.c > > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct > > >> btrfs_raid_bio *rbio) > > >> static void index_rbio_pages(struct btrfs_raid_bio *rbio) > > >> { > > >> struct bio *bio; > > >> - struct bio_vec *bvec; > > >> u64 start; > > >> unsigned long stripe_offset; > > >> unsigned long page_index; > > >> - int i; > > >> > > >> spin_lock_irq(&rbio->bio_list_lock); > > >> bio_list_for_each(bio, &rbio->bio_list) { > > >> + struct bio_vec bvec; > > >> + struct bvec_iter iter; > > >> + int i = 0; > > >> + > > >> start = (u64)bio->bi_iter.bi_sector << 9; > > >> stripe_offset = start - rbio->bbio->raid_map[0]; > > >> page_index = stripe_offset >> PAGE_SHIFT; > > >> > > >> - bio_for_each_segment_all(bvec, bio, i) > > >> - rbio->bio_pages[page_index + i] = bvec->bv_page; > > >> + if (bio_flagged(bio, BIO_CLONED)) > > >> + bio->bi_iter = btrfs_io_bio(bio)->iter; > > >> + > > > > > > I think we can use use bio->bi_iter directly as the bio is not > > > submitted yet, i.e. bi_iter is not advanced yet. > > > > I've seen it cause problems by not setting it. I.e. the bio was > > iterated before somewhere else. > > Which lead to the following trace: > > > > This is coming from a free space inode flush during mount time and > it's full of buffered writes, the bio in use is not a cloned one. > > Interesting...let me check locally. > I ran the mentioned script in the commit log without the part of setting bio->bi_iter to btrfs_io_bio(bio)->iter. $ mkfs.btrfs -f -d raid5 /dev/sd[cde] $ mount /dev/sde /mnt/btrfs $ xfs_io -f -d -c "pwrite 0 128K" /mnt/btrfs/foobar wrote 131072/131072 bytes at offset 0 128 KiB, 32 ops; 0. sec (5.842 MiB/sec and 1495.6766 ops/sec) $ od -t x1 /mnt/btrfs/foobar 000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 040 $ uname -r 4.12.0+ It works fine without causing 'generic protection' during mount. thanks, -liubo > -liubo > > > [ 3955.729020] general protection fault: [#1] PREEMPT SMP > > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse > > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw > > i2c_core button sunrpc loop auto > > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov > > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c > > crc32c_generic raid1 raid0 multipath line > > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata > > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded: > > btrfs] > > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted > > 4.12.0-rc6-btrfs-next-47+ #1 > > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000 > > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs] > > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246 > > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: > > 0400 > > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: > > 880235e86000 > > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: > > 0001 > > [ 3955.731730] R10: c9000249b310 R11: 0003 R12: > > c9000249b3d8 > > [ 3955.731730] R13: R14: 0003 R15: > > c9000249b3f0 > > [ 3955.731730] FS: 7f68fda61480() GS:88023f5c() > > knlGS: > > [ 3955.731730] CS: 0010 DS: ES: CR0: 80050033 > > [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: > > 06e0 > > [ 3955.731730] Call Trace: > > [ 3955.731730] full_stripe_write+0x69/0x93 [btrfs] > > [ 3955.731730] raid56_parity_write+0xa4/0x110 [btrfs] > > [ 3955.731730] btrfs_map_bio+0xca/0x259 [btrfs] > > [ 3955.731730] btrfs_submit_bio_hook+0xbf/0x147 [btrfs] > > [ 3955.731730] submit_one_bio+0x5d/0x7b [btrfs] > > [ 3955.731730] submit_extent_page.constprop.30+0x99/0x165 [btrfs] > > [ 3955.731730] ? __test_set_page_writeback+0x1aa/0x1bc > > [ 3955.731730] __extent_writepage_io+0x328/0x3a2 [btrfs] > > [ 3955.731730] ? end_extent_
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 05:46:13PM +0100, Filipe Manana wrote: > On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: > > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > >> From: Filipe Manana > >> [...] > > > > > >> Signed-off-by: Filipe Manana > >> --- > >> fs/btrfs/raid56.c | 26 ++ > >> 1 file changed, 18 insertions(+), 8 deletions(-) > >> > >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > >> index b9abb0b01021..b89d07003697 100644 > >> --- a/fs/btrfs/raid56.c > >> +++ b/fs/btrfs/raid56.c > >> @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct > >> btrfs_raid_bio *rbio) > >> static void index_rbio_pages(struct btrfs_raid_bio *rbio) > >> { > >> struct bio *bio; > >> - struct bio_vec *bvec; > >> u64 start; > >> unsigned long stripe_offset; > >> unsigned long page_index; > >> - int i; > >> > >> spin_lock_irq(&rbio->bio_list_lock); > >> bio_list_for_each(bio, &rbio->bio_list) { > >> + struct bio_vec bvec; > >> + struct bvec_iter iter; > >> + int i = 0; > >> + > >> start = (u64)bio->bi_iter.bi_sector << 9; > >> stripe_offset = start - rbio->bbio->raid_map[0]; > >> page_index = stripe_offset >> PAGE_SHIFT; > >> > >> - bio_for_each_segment_all(bvec, bio, i) > >> - rbio->bio_pages[page_index + i] = bvec->bv_page; > >> + if (bio_flagged(bio, BIO_CLONED)) > >> + bio->bi_iter = btrfs_io_bio(bio)->iter; > >> + > > > > I think we can use use bio->bi_iter directly as the bio is not > > submitted yet, i.e. bi_iter is not advanced yet. > > I've seen it cause problems by not setting it. I.e. the bio was > iterated before somewhere else. > Which lead to the following trace: > This is coming from a free space inode flush during mount time and it's full of buffered writes, the bio in use is not a cloned one. Interesting...let me check locally. -liubo > [ 3955.729020] general protection fault: [#1] PREEMPT SMP > [ 3955.729845] Modules linked in: btrfs ppdev tpm_tis evdev psmouse > parport_pc sg tpm_tis_core i2c_piix4 parport tpm pcspkr serio_raw > i2c_core button sunrpc loop auto > fs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c > crc32c_generic raid1 raid0 multipath line > ar md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata > virtio_pci virtio_ring scsi_mod virtio e1000 floppy [last unloaded: > btrfs] > [ 3955.731730] CPU: 15 PID: 26361 Comm: mount Not tainted > 4.12.0-rc6-btrfs-next-47+ #1 > [ 3955.731730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [ 3955.731730] task: 880235b5d680 task.stack: c90002498000 > [ 3955.731730] RIP: 0010:finish_rmw+0x1d6/0x3c4 [btrfs] > [ 3955.731730] RSP: 0018:c9000249b3d8 EFLAGS: 00010246 > [ 3955.731730] RAX: 880235e86000 RBX: 880210e23800 RCX: > 0400 > [ 3955.731730] RDX: 8800 RSI: db738800 RDI: > 880235e86000 > [ 3955.731730] RBP: c9000249b470 R08: 0001 R09: > 0001 > [ 3955.731730] R10: c9000249b310 R11: 0003 R12: > c9000249b3d8 > [ 3955.731730] R13: R14: 0003 R15: > c9000249b3f0 > [ 3955.731730] FS: 7f68fda61480() GS:88023f5c() > knlGS: > [ 3955.731730] CS: 0010 DS: ES: CR0: 80050033 > [ 3955.731730] CR2: 56403c233680 CR3: 00021049e000 CR4: > 06e0 > [ 3955.731730] Call Trace: > [ 3955.731730] full_stripe_write+0x69/0x93 [btrfs] > [ 3955.731730] raid56_parity_write+0xa4/0x110 [btrfs] > [ 3955.731730] btrfs_map_bio+0xca/0x259 [btrfs] > [ 3955.731730] btrfs_submit_bio_hook+0xbf/0x147 [btrfs] > [ 3955.731730] submit_one_bio+0x5d/0x7b [btrfs] > [ 3955.731730] submit_extent_page.constprop.30+0x99/0x165 [btrfs] > [ 3955.731730] ? __test_set_page_writeback+0x1aa/0x1bc > [ 3955.731730] __extent_writepage_io+0x328/0x3a2 [btrfs] > [ 3955.731730] ? end_extent_writepage+0xc4/0xc4 [btrfs] > [ 3955.731730] __extent_writepage+0x236/0x2bd [btrfs] > [ 3955.731730] extent_write_cache_pages.constprop.27+0x1ed/0x35e [btrfs] > [ 3955.731730] extent_writepages+0x4c/0x5d [btrfs] > [ 3955.731730] ? rcu_read_lock_sched_held+0x57/0x6c > [ 3955.731730] ? btrfs_set_bit_hook+0x233/0x233 [btrfs] > [ 3955.731730] ? __clear_extent_bit+0x2ab/0x2ca [btrfs] > [ 3955.731730] btrfs_writepages+0x28/0x2a [btrfs] > [ 3955.731730] do_writepages+0x3c/0x72 > [ 3955.731730] __filemap_fdatawrite_range+0x5a/0x63 > [ 3955.731730] filemap_fdatawrite_range+0x13/0x15 > [ 3955.731730] btrfs_fdatawrite_range+0x20/0x46 [btrfs] > [ 3955.731730] __btrfs_write_out_cache+0x2db/0x3aa [btrfs] > [ 3955.731730] btrfs_write_out_cache+0x8a/0xcd [btrfs] > [ 3955.731730] btrfs_star
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 5:36 PM, Liu Bo wrote: > On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> The recent changes to make bio cloning faster (added in the 4.13 merge >> window) by using the bio_clone_fast() API introduced a regression on >> raid5/6 modes, because cloned bios have an invalid bi_vcnt field >> (therefore it can not be used) and the raid5/6 code uses the >> bio_for_each_segment_all() API to iterate the segments of a bio, and this >> API uses a bio's bi_vcnt field. >> >> The issue is very simple to trigger by doing for example a direct IO write >> against a raid5 or raid6 filesystem and then attempting to read what we >> wrote before: >> >> $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf >> $ mount /dev/sdc /mnt >> $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar >> $ od -t x1 /mnt/foobar >> od: /mnt/foobar: read error: Input/output error >> >> For that example, the following is also reported in dmesg/syslog: >> >> [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed >> [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 >> [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 >> [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off >> 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 >> >> A scrub will also fail correcting bad copies, mentioning the following in >> dmesg/syslog: >> >> [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed >> [18276.129617] BTRFS warning (device sdf): checksum error at logical >> 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, >> length 4096, links $ >> [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed >> [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, >> flush 0, corrupt 1, gen 0 >> [18276.206059] BTRFS warning (device sdf): checksum error at logical >> 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset >> 196608, length 4096, links$ >> [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> flush 0, corrupt 1, gen 0 >> [18276.306552] BTRFS warning (device sdf): checksum error at logical >> 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset >> 262144, length 4096, links$ >> [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> flush 0, corrupt 2, gen 0 >> [18276.394316] BTRFS warning (device sdf): checksum error at logical >> 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset >> 458752, length 4096, links$ >> [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, >> flush 0, corrupt 1, gen 0 >> [18276.434127] BTRFS warning (device sdf): checksum error at logical >> 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset >> 589824, length 4096, links$ >> [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, >> flush 0, corrupt 2, gen 0 >> [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error >> at logical 2186477568 on dev /dev/sdd >> [18276.538400] BTRFS warning (device sdf): checksum error at logical >> 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset >> 200704, length 4096, links$ >> [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, >> flush 0, corrupt 3, gen 0 >> [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error >> at logical 2186481664 on dev /dev/sdd >> [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error >> at logical 2186346496 on dev /dev/sde >> [18276.598306] BTRFS warning (device sdf): checksum error at logical >> 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset >> 131072, length 4096, links$ >> [
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > The recent changes to make bio cloning faster (added in the 4.13 merge > window) by using the bio_clone_fast() API introduced a regression on > raid5/6 modes, because cloned bios have an invalid bi_vcnt field > (therefore it can not be used) and the raid5/6 code uses the > bio_for_each_segment_all() API to iterate the segments of a bio, and this > API uses a bio's bi_vcnt field. > > The issue is very simple to trigger by doing for example a direct IO write > against a raid5 or raid6 filesystem and then attempting to read what we > wrote before: > > $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf > $ mount /dev/sdc /mnt > $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar > $ od -t x1 /mnt/foobar > od: /mnt/foobar: read error: Input/output error > > For that example, the following is also reported in dmesg/syslog: > > [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed > [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 > [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 > [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x94374193 mirror 1 > [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off > 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 > > A scrub will also fail correcting bad copies, mentioning the following in > dmesg/syslog: > > [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed > [18276.129617] BTRFS warning (device sdf): checksum error at logical > 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, > length 4096, links $ > [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed > [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, > flush 0, corrupt 1, gen 0 > [18276.206059] BTRFS warning (device sdf): checksum error at logical > 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, > length 4096, links$ > [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > flush 0, corrupt 1, gen 0 > [18276.306552] BTRFS warning (device sdf): checksum error at logical > 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, > length 4096, links$ > [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > flush 0, corrupt 2, gen 0 > [18276.394316] BTRFS warning (device sdf): checksum error at logical > 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, > length 4096, links$ > [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, > flush 0, corrupt 1, gen 0 > [18276.434127] BTRFS warning (device sdf): checksum error at logical > 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, > length 4096, links$ > [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, > flush 0, corrupt 2, gen 0 > [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at > logical 2186477568 on dev /dev/sdd > [18276.538400] BTRFS warning (device sdf): checksum error at logical > 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, > length 4096, links$ > [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, > flush 0, corrupt 3, gen 0 > [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at > logical 2186481664 on dev /dev/sdd > [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at > logical 2186346496 on dev /dev/sde > [18276.598306] BTRFS warning (device sdf): checksum error at logical > 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, > length 4096, links$ > [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, > flush 0, corrupt 3, gen 0 > [18276.598582] BTRFS error
Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > The recent changes to make bio cloning faster (added in the 4.13 merge > window) by using the bio_clone_fast() API introduced a regression on > raid5/6 modes, because cloned bios have an invalid bi_vcnt field > (therefore it can not be used) and the raid5/6 code uses the > bio_for_each_segment_all() API to iterate the segments of a bio, and this > API uses a bio's bi_vcnt field. Good catch! I'm going to add assertions to all remaining bio_for_each_segment_all calls so that cloned bios do not get accidentally iterated that way. Other raid levels seem to be fine with the usecase you've provided, I'm now running fstests with the assertions. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
From: Filipe Manana The recent changes to make bio cloning faster (added in the 4.13 merge window) by using the bio_clone_fast() API introduced a regression on raid5/6 modes, because cloned bios have an invalid bi_vcnt field (therefore it can not be used) and the raid5/6 code uses the bio_for_each_segment_all() API to iterate the segments of a bio, and this API uses a bio's bi_vcnt field. The issue is very simple to trigger by doing for example a direct IO write against a raid5 or raid6 filesystem and then attempting to read what we wrote before: $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf $ mount /dev/sdc /mnt $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar $ od -t x1 /mnt/foobar od: /mnt/foobar: read error: Input/output error For that example, the following is also reported in dmesg/syslog: [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 A scrub will also fail correcting bad copies, mentioning the following in dmesg/syslog: [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $ [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$ [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$ [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$ [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$ [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$ [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$ [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 [18276.638362] BTRFS warning (device sdf): checksum error a