Re: unsolvable technical issues?
30.06.2018 06:22, Duncan пишет: > Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as > excerpted: > >> On 2018-06-24 16:22, Goffredo Baroncelli wrote: >>> On 06/23/2018 07:11 AM, Duncan wrote: waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: > According to this: > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , > section 1.2 > > It claims that BTRFS still have significant technical issues that may > never be resolved. I can speculate a bit. 1) When I see btrfs "technical issue that may never be resolved", the #1 first thing I think of, that AFAIK there are _definitely_ no plans to resolve, because it's very deeply woven into the btrfs core by now, is... [1)] Filesystem UUID Identification. Btrfs takes the UU bit of Universally Unique quite literally, assuming they really *are* unique, at least on that system[.] Because btrfs uses this supposedly unique ID to ID devices that belong to the filesystem, it can get *very* mixed up, with results possibly including dataloss, if it sees devices that don't actually belong to a filesystem with the same UUID as a mounted filesystem. >>> >>> As partial workaround you can disable udev btrfs rules and then do a >>> "btrfs dev scan" manually only for the device which you need. > >> You don't even need `btrfs dev scan` if you just specify the exact set >> of devices in the mount options. The `device=` mount option tells the >> kernel to check that device during the mount process. > > Not that lvm does any better in this regard[1], but has btrfs ever solved > the bug where only one device= in the kernel commandline's rootflags= > would take effect, effectively forcing initr* on people (like me) who > would otherwise not need them and prefer to do without them, if they're > using a multi-device btrfs as root? > This requires in-kernel device scanning; I doubt we will ever see it. > Not to mention the fact that as kernel people will tell you, device > enumeration isn't guaranteed to be in the same order every boot, so > device=/dev/* can't be relied upon and shouldn't be used -- but of course > device=LABEL= and device=UUID= and similar won't work without userspace, > basically udev (if they work at all, IDK if they actually do). > > Tho in practice from what I've seen, device enumeration order tends to be > dependable /enough/ for at least those without enterprise-level numbers > of devices to enumerate. Just boot with USB stick/eSATA drive plugged in, there are good chances it changes device order. > True, it /does/ change from time to time with a > new kernel, but anybody sane keeps a tested-dependable old kernel around > to boot to until they know the new one works as expected, and that sort > of change is seldom enough that users can boot to the old kernel and > adjust their settings for the new one as necessary when it does happen. > So as "don't do it that way because it's not reliable" as it might indeed > be in theory, in practice, just using an ordered /dev/* in kernel > commandlines does tend to "just work"... provided one is ready for the > occasion when that device parameter might need a bit of adjustment, of > course. > ... > > --- > [1] LVM is userspace code on top of the kernelspace devicemapper, and > therefore requires an initr* if root is on lvm, regardless. So btrfs > actually does a bit better here, only requiring it for multi-device btrfs. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs suddenly think's it's raid6
> No log_root* in the super. So no fsyncing at the time? What was > happening at the time of the power loss? That I don't know. > Pretty weird, all three supers are OK and yet two copies of the fs > tree are corrupt. Why doesn't it fall back automatically to one of the > other roots? > > You could try 'mount -o usebackuproot,ro' and see if that's permissive > enough to succeed. That yields the same result ``` marble@archlinux ~ % sudo mount -o usebackuproot,ro /dev/mapper/black /tmp/black mount: /tmp/black: can't read superblock on /dev/mapper/black. ``` > I wonder if the drive firmware reordered things contrary to Btrfs > expectation right at the powerfail - or even if the powerfail caused > the drive firmware to do the writes in the wrong order? > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Hugo Mills posted on Mon, 25 Jun 2018 16:54:36 + as excerpted: > On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote: > [snip] >> I hope I am not asking for too much (but I know I probably am), but I >> suggest that having a small snippet of information on the status page >> showing a little bit about what is either currently the development >> focus , or what people are known for working at would be very valuable >> for users and it may of course work both ways, such as exciting people >> or calming them down. ;) >> >> For example something simple like a "development focus" list... >> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology >> 2018-Q3: (planned) Magical feature X >> 2018-Q2: N-Way mirroring >> 2018-Q1: Feature work "RAID"5/6 >> >> I think it would be good for people living their lives outside as it >> would perhaps spark some attention from developers and perhaps even >> media as well. > > I started doing this a couple of years ago, but it turned out to be > impossible to keep even vaguely accurate or up to date, without going > round and bugging the developers individually on a per-release basis. I > don't think it's going to happen. In addition, anything like quarter, kernel cycle, etc, has been repeatedly demonstrated to be entirely broken beyond "current", because roadmapped tasks have rather consistently taken longer, sometimes /many/ /times/ longer (by a factor of 20+ in the case of raid56), than first predicted. But in theory it might be double, with just a roughly ordered list, no dates beyond "current focus", and with suitably big disclaimers about other things (generally bugs in otherwise more stable features, but occasionally a quick sub-feature that is seen to be easier to introduce at the current state than it might be later, etc) possibly getting priority and temporarily displacing roadmapped items. In fact, this last one is the big reason why raid56 has taken so long to even somewhat stabilize -- the devs kept finding bugs in already semi- stable features that took priority... for kernel cycle after kernel cycle. The quotas/qgroups feature, already introduced and intended to be at least semi-stable was one such culprit, requiring repeated rewrite and kernel cycles worth of bug squashing. A few critical under the right circumstances compression bugs, where compression was supposed to be an already reasonably stable feature, were another, tho these took far less developer bandwidth than quotas. Getting a reasonably usable fsck was a bunch of little patches. AFAIK that one wasn't actually an original focus and was intended to be back-burnered for some time, but once btrfs hit mainline, users started demanding it, so the priority was bumped. And of course having it has been good for finding and ultimately fixing other bugs as well, so it wasn't a bad thing, but the hard fact is the repairing fsck has taken, all told, I'd guess about the same number of developer cycles as quotas, and those developer cycles had to come from stuff that had been roadmapped for earlier. As a bit of an optimist I'd be inclined to argue that OK, we've gotten btrfs in far better shape general stability-wise now, and going forward, the focus can be back on the stuff that was roadmapped for earlier that this stuff displaced, so one might hope things will move faster again now, but really, who knows? That's arguably what the devs thought when they mainlined btrfs, too, and yet it took all this much longer to mature and stabilize since then. Still, it /has/ to happen at /some/ point, right? And I know for a fact that btrfs is far more stable now than it was... because things like ungraceful shutdowns that used to at minimum trigger (raid1 mode) scrub fixes on remount and scrub, now... don't -- btrfs is now stable enough that the atomic COW is doing its job and things "just work", where before, they required scrub repair at best, and occasional blow away and restore from backups. So I can at least /hope/ that the worst of the plague of bugs is behind us, and people can work on what they intended to do most (say 80%) of the time now, spending say a day's worth a week (20%) on bugs, instead of the reverse, 80% (4 days a week) on bugs and if they're lucky, a day a week on what they were supposed to be focused on, which is what we were seeing for awhile. Plus the tools to do the debugging, etc, are far more mature now, another reason bugs should hopefully take less time now. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as excerpted: > On 2018-06-24 16:22, Goffredo Baroncelli wrote: >> On 06/23/2018 07:11 AM, Duncan wrote: >>> waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: >>> According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. >>> >>> I can speculate a bit. >>> >>> 1) When I see btrfs "technical issue that may never be resolved", the >>> #1 first thing I think of, that AFAIK there are _definitely_ no plans >>> to resolve, because it's very deeply woven into the btrfs core by now, >>> is... >>> >>> [1)] Filesystem UUID Identification. Btrfs takes the UU bit of >>> Universally Unique quite literally, assuming they really *are* >>> unique, at least on that system[.] Because >>> btrfs uses this supposedly unique ID to ID devices that belong to the >>> filesystem, it can get *very* mixed up, with results possibly >>> including dataloss, if it sees devices that don't actually belong to a >>> filesystem with the same UUID as a mounted filesystem. >> >> As partial workaround you can disable udev btrfs rules and then do a >> "btrfs dev scan" manually only for the device which you need. > You don't even need `btrfs dev scan` if you just specify the exact set > of devices in the mount options. The `device=` mount option tells the > kernel to check that device during the mount process. Not that lvm does any better in this regard[1], but has btrfs ever solved the bug where only one device= in the kernel commandline's rootflags= would take effect, effectively forcing initr* on people (like me) who would otherwise not need them and prefer to do without them, if they're using a multi-device btrfs as root? Not to mention the fact that as kernel people will tell you, device enumeration isn't guaranteed to be in the same order every boot, so device=/dev/* can't be relied upon and shouldn't be used -- but of course device=LABEL= and device=UUID= and similar won't work without userspace, basically udev (if they work at all, IDK if they actually do). Tho in practice from what I've seen, device enumeration order tends to be dependable /enough/ for at least those without enterprise-level numbers of devices to enumerate. True, it /does/ change from time to time with a new kernel, but anybody sane keeps a tested-dependable old kernel around to boot to until they know the new one works as expected, and that sort of change is seldom enough that users can boot to the old kernel and adjust their settings for the new one as necessary when it does happen. So as "don't do it that way because it's not reliable" as it might indeed be in theory, in practice, just using an ordered /dev/* in kernel commandlines does tend to "just work"... provided one is ready for the occasion when that device parameter might need a bit of adjustment, of course. > Also, while LVM does have 'issues' with cloned PV's, it fails safe (by > refusing to work on VG's that have duplicate PV's), while BTRFS fails > very unsafely (by randomly corrupting data). And IMO that "failing unsafe" is both serious and common enough that it easily justifies adding the point to a list of this sort, thus my putting it #1. >>> 2) Subvolume and (more technically) reflink-aware defrag. >>> >>> It was there for a couple kernel versions some time ago, but >>> "impossibly" slow, so it was disabled until such time as btrfs could >>> be made to scale rather better in this regard. > I still contend that the biggest issue WRT reflink-aware defrag was that > it was not optional. The only way to get the old defrag behavior was to > boot a kernel that didn't have reflink-aware defrag support. IOW, > _everyone_ had to deal with the performance issues, not just the people > who wanted to use reflink-aware defrag. Absolutely. Which of course suggests making it optional, with a suitable warning as to the speed implications with lots of snapshots/reflinks, when it does get enabled again (and as David mentions elsewhere, there's apparently some work going into the idea once again, which potentially moves it from the 3-5 year range, at best, back to a 1/2-2-year range, time will tell). >>> 3) N-way-mirroring. >>> >> [...] >> This is not an issue, but a not implemented feature > If you're looking at feature parity with competitors, it's an issue. Exactly my point. Thanks. =:^) >>> 4) (Until relatively recently, and still in terms of scaling) Quotas. >>> >>> Until relatively recently, quotas could arguably be added to the list. >>> They were rewritten multiple times, and until recently, appeared to be >>> effectively eternally broken. >> >> Even tough what you are reporting is correct, I have to point out that >> the quota in BTRFS is more complex than the equivalent one of the other >> FS. Which, arguably, is exactly
Re: So, does btrfs check lowmem take days? weeks?
Well, there goes that. After about 18H: ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 btrfs(+0x3a232)[0x56091704f232] btrfs(+0x3ab46)[0x56091704fb46] btrfs(+0x3b9f5)[0x5609170509f5] btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] btrfs(+0x572ff)[0x56091706c2ff] btrfs(+0x60b13)[0x560917075b13] btrfs(cmd_check+0x2634)[0x56091707d431] btrfs(main+0x88)[0x560917027260] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] btrfs(_start+0x2a)[0x560917026dfa] Aborted That's https://github.com/Damenly/btrfs-progs.git Whoops, I didn't use the tmp1 branch, let me try again with that and report back, although the problem above is still going to be there since I think the only difference will be this, correct? https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: fix race between mkfs and mount
On 06/29/2018 08:06 PM, David Sterba wrote: On Tue, Jun 26, 2018 at 10:42:32PM +0800, Anand Jain wrote: Last version of the proposed fix is to extend the uuid_mutex over the whole mount callback and use it around calls to btrfs_scan_one_device. That way we'll be sure the mount will always get to the device it scanned and that will not be freed by the parallel scan. That will break the requisite #1 as above. I don't see how it breaks 'mount and or scan of two independent fsid+dev is possible'. It is possible, but the lock does not need to be filesystem local. Concurrent mount or scan will block on the uuid_mutex, As uuid_mutex is global, if fsid1 is being mounted, the fsid2 mount will wait upto certain extent with this code. Yes it will wait a bit, but the critical section is short. There's some IO done and it's reading of 4K in btrfs_read_disk_super. The rest is cpu-bound and hardly measurable in practice, in context of concurrent mount and scanning. I took the approach to fix the bug first and then make it faster or cleaner, also to fix it in a way that's still acceptable for current development cycle. My efforts here was to use uuid_mutex only to protect the fs_uuid update part, in this way there is concurrency in the mount process of fsid1 and fsid2 and provides shorter bootup time when the user uses the mount at boot option. The locking can be still improved but the uuid_mutex is not a contended lock, mount is an operation that does not happen thousand times a second, same for the scanning. My concern is about the boot up time when there are larger number of LUN configured with independent btrfs FSIDs to mount at boot. Since BTRFS is a kind of infrastructure for the servers, we can't rule out that these kind of configuration won't exists at all. Anyway as you said we can use uuid_mutex for the current development cycle. However a review on [1] which does fix your earlier concern of returning -EBUSY is appreciated. And pls let me know if going ahead with this approach would be accepted in the current development cycle.? So even if there are several mounts happening during boot, it's just a few and the delay is IMO unnoticeable. If the boot time is longer by say 100ms at worst, I'm still ok with my patches as a fix. But not as a final fix, so the updates to locking you mentioned are still possible. We now have a point of reference where syzbot does is not able to reproduce the bugs. [1] -- When %fs_devices::opened > 0 the device is mounted, %fs_devices is never freed so its safe to use %fs_devices and it does not need any locks or flags. However, when %fs_devices::opened == 0 (idle) that means device is not yet mounted, and it can be either transition to opened or freed. When it transitions to freed, fs_devices gets freed and any pointer accessing will endup with UAF error. Here are places where we access fs_devcies and it needs locking and using uuid_mutex is one of method 1. READY ioctl 2. Mount 3. SCAN ioctl 4. Stale cleanup during SCAN 5. planned device FORGET ioctl #4 and #5 may free the fs_devices while #1, #2, and #3 are still accessing the fs_devices. using uuid_mutex is one choice however it would slow down the mount at boot when there are larger number of independent btrfs fsid being mounted at boot. Proposed Fix - This does not return the -EBUSY. If there is any better way I am ok to use it. struct btrfs_fs_devices { :: int ref_count; :: } To acquire a fs_devices.. volume_devices_excl_state_enter(x) { fs_devices = NULL lock uuid_mutex if (fs_devices = find fs_devices(x)) fs_devices::ref_count++ unlock uuid_mutex return fs_devices } To release a fs_devices.. volume_devices_excl_state_exit(fs_devices) { lock uuid_mutex fs_devices::ref_count-- unlock uuid_mutex } To delete a fs_devices.. again: lock uuid_mutex find fs_devices if (fs_devices::ref_count != 0) { unlock uuid_mutex goto agin; } if (fs_devices->opened > 0) { unlock uuid_mutex return -EBUSY } free_fs_devices() unlock uuid_mutex -- Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
I've got about 1/2 the snapshots and less than 1/10th the data...but my btrfs check times are much shorter than either: 15 minutes and 65 minutes (lowmem). [chris@f28s ~]$ sudo btrfs fi us /mnt/first Overall: Device size:1024.00GiB Device allocated: 774.12GiB Device unallocated: 249.87GiB Device missing: 0.00B Used: 760.48GiB Free (estimated): 256.95GiB(min: 132.01GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB(used: 0.00B) Data,single: Size:761.00GiB, Used:753.93GiB /dev/mapper/first 761.00GiB Metadata,DUP: Size:6.50GiB, Used:3.28GiB /dev/mapper/first 13.00GiB System,DUP: Size:64.00MiB, Used:112.00KiB /dev/mapper/first 128.00MiB Unallocated: /dev/mapper/first 249.87GiB 146 subvolumes 137 snapshots total csum bytes: 790549924 total tree bytes: 3519250432 total fs tree bytes: 2546073600 total extent tree bytes: 131350528 Original mode check takes ~15 minutes Lowmem mode takes ~65 minutes RAM: 4G CPU: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs suddenly think's it's raid6
On Fri, Jun 29, 2018 at 4:09 PM, marble wrote: > Hey, > Thanks for the quick reply :-) > >> Is there anything like unexpected power loss happens? > Power loos may have happened. It was power via a RPi. >> And would you provide the following data for debugging? >> >> # btrfs ins dump-super -fFa /dev/mapper/black > See attachment. >> And further more, what's the device mapper setup for /dev/mapper/black? >> Is there anything like RAID here? > This is due to LUKS I think. "black" is the name I gave to cryptsetup open. No log_root* in the super. So no fsyncing at the time? What was happening at the time of the power loss? Pretty weird, all three supers are OK and yet two copies of the fs tree are corrupt. Why doesn't it fall back automatically to one of the other roots? You could try 'mount -o usebackuproot,ro' and see if that's permissive enough to succeed. I wonder if the drive firmware reordered things contrary to Btrfs expectation right at the powerfail - or even if the powerfail caused the drive firmware to do the writes in the wrong order? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Removed Disk - Super Error - Scrub did not fix - next steps?
On Fri, Jun 29, 2018 at 11:55 AM, wrote: > Hello, > > I recently was mounting some drives in the free drive bays of my server and > accidentally removed one of my drives from my raid1 btrfs array. I > immediately put it back in - but whatever damage that would have caused > seems to be already done. > > I got these log errors [dmesg -T] and after reading up it seemed that all I > *probably* needed to do was a scrub: > > Finally - the original logs I found by running "dmesg -T|grep -i btrfs" - > from where the log started to where the scrub begain: > > [Tue Jun 26 18:30:08 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr > 40547, rd 42808, flush 2, corrupt 0, gen 0 > > [Tue Jun 26 18:33:19 2018] btrfs_dev_stat_print_on_error: 2056 callbacks > suppressed > > A bunch of these errors... > > Then later: > > [Tue Jun 26 19:19:52 2018] BTRFS warning (device sdb): lost page write due > to IO error on /dev/sdh > > I think this is where I unmounted/remounted /mnt/titan: > > [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): disk space caching is > enabled > [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): has skinny extents > [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): bdev /dev/sdh errs: wr > 55907, rd 59133, flush 2, corrupt 0, gen 0 > [Tue Jun 26 19:49:57 2018] btrfs_dev_stat_print_on_error: 49 callbacks > suppressed > [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr > 55907, rd 59133, flush 2, corrupt 0, gen 1 > [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr > 55907, rd 59133, flush 2, corrupt 0, gen 2 > [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr > 55907, rd 59133, flush 2, corrupt 0, gen 3 > [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify > failed on 9081264963584 wanted 8663 found 8321 > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081264963584 (dev /dev/sdh sector 17728418112) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081264967680 (dev /dev/sdh sector 17728418120) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081264971776 (dev /dev/sdh sector 17728418128) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081264975872 (dev /dev/sdh sector 17728418136) > [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify > failed on 9081265356800 wanted 8663 found 8321 > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265356800 (dev /dev/sdh sector 17728418880) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265360896 (dev /dev/sdh sector 1772841) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265364992 (dev /dev/sdh sector 17728418896) > [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265369088 (dev /dev/sdh sector 17728418904) These are passive fixups that succeed. > > According to btrfs scrub status this is when the scrub began... > > [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify > failed on 9081265061888 wanted 8663 found 8321 > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265061888 (dev /dev/sdh sector 17728418304) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265065984 (dev /dev/sdh sector 17728418312) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265070080 (dev /dev/sdh sector 17728418320) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081265074176 (dev /dev/sdh sector 17728418328) > [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify > failed on 9081263046656 wanted 8664 found 8662 > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263046656 (dev /dev/sdh sector 17728414368) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263050752 (dev /dev/sdh sector 17728414376) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263054848 (dev /dev/sdh sector 17728414384) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263058944 (dev /dev/sdh sector 17728414392) > [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify > failed on 9081263194112 wanted 8664 found 8662 > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263194112 (dev /dev/sdh sector 17728414656) > [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: > ino 0 off 9081263198208 (dev /dev/sdh sector 17728414664) > [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify >
Re: btrfs suddenly think's it's raid6
Hey, Thanks for the quick reply :-) > Is there anything like unexpected power loss happens? Power loss may have happened. > And would you provide the following data for debugging? > > # btrfs ins dump-super -fFa /dev/mapper/black I attached it. > And further more, what's the device mapper setup for /dev/mapper/black? > Is there anything like RAID here? I think this is the result of luks. "black" is the name I passed to cryptsetup open. > Thanks, > Qu Cheers, marble superblock: bytenr=65536, device=/dev/mapper/black - csum_type 0 (crc32c) csum_size 4 csum0x9814744c [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 label black generation 352 root30736384 sys_array_size 129 chunk_root_generation 145 root_level 1 chunk_root 22020096 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 500105764864 bytes_used 153954693120 sectorsize 4096 nodesize16384 leafsize (deprecated) 16384 stripesize 4096 root_dir6 num_devices 1 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation187 uuid_tree_generation187 dev_item.uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 dev_item.fsid 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 [match] dev_item.type 0 dev_item.total_bytes500105764864 dev_item.bytes_used 184708759552 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP io_align 65536 io_width 65536 sector_size 4096 num_stripes 2 sub_stripes 0 stripe 0 devid 1 offset 22020096 dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 stripe 1 devid 1 offset 30408704 dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 backup_roots[4]: backup 0: backup_tree_root: 1084751872 gen: 187level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084702720 gen: 187level: 2 backup_fs_root: 1084653568 gen: 187level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1084817408 gen: 187level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 1: backup_tree_root: 1083801600 gen: 184level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1083752448 gen: 184level: 2 backup_fs_root: 1083604992 gen: 184level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1083867136 gen: 184level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 2: backup_tree_root: 1084145664 gen: 185level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084080128 gen: 185level: 2 backup_fs_root: 1083981824 gen: 185level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1084211200 gen: 185level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 3: backup_tree_root: 1084473344 gen: 186level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084424192 gen: 186level: 2 backup_fs_root:
Re: btrfs suddenly think's it's raid6
Hey, Thanks for the quick reply :-) > Is there anything like unexpected power loss happens? Power loos may have happened. It was power via a RPi. > And would you provide the following data for debugging? > > # btrfs ins dump-super -fFa /dev/mapper/black See attachment. > And further more, what's the device mapper setup for /dev/mapper/black? > Is there anything like RAID here? This is due to LUKS I think. "black" is the name I gave to cryptsetup open. Cheers, marble superblock: bytenr=65536, device=/dev/mapper/black - csum_type 0 (crc32c) csum_size 4 csum0x9814744c [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 label black generation 352 root30736384 sys_array_size 129 chunk_root_generation 145 root_level 1 chunk_root 22020096 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 500105764864 bytes_used 153954693120 sectorsize 4096 nodesize16384 leafsize (deprecated) 16384 stripesize 4096 root_dir6 num_devices 1 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation187 uuid_tree_generation187 dev_item.uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 dev_item.fsid 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 [match] dev_item.type 0 dev_item.total_bytes500105764864 dev_item.bytes_used 184708759552 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP io_align 65536 io_width 65536 sector_size 4096 num_stripes 2 sub_stripes 0 stripe 0 devid 1 offset 22020096 dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 stripe 1 devid 1 offset 30408704 dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26 backup_roots[4]: backup 0: backup_tree_root: 1084751872 gen: 187level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084702720 gen: 187level: 2 backup_fs_root: 1084653568 gen: 187level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1084817408 gen: 187level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 1: backup_tree_root: 1083801600 gen: 184level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1083752448 gen: 184level: 2 backup_fs_root: 1083604992 gen: 184level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1083867136 gen: 184level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 2: backup_tree_root: 1084145664 gen: 185level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084080128 gen: 185level: 2 backup_fs_root: 1083981824 gen: 185level: 2 backup_dev_root:1081655296 gen: 178level: 0 backup_csum_root: 1084211200 gen: 185level: 2 backup_total_bytes: 500105764864 backup_bytes_used: 153957298176 backup_num_devices: 1 backup 3: backup_tree_root: 1084473344 gen: 186level: 1 backup_chunk_root: 22020096gen: 145level: 1 backup_extent_root: 1084424192 gen: 186level: 2 backup_fs_root:
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Fri, Jun 29, 2018 at 9:15 AM, james harvey wrote: > On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: >> And an open question I have about scrub is weather it only ever is >> checking csums, meaning nodatacow files are never scrubbed, or if the >> copies are at least compared to each other? > > Scrub never looks at nodatacow files. It does not compare the copies > to each other. > > Qu submitted a patch to make check compare the copies: > https://patchwork.kernel.org/patch/10434509/ Yeah online scrub needs to report any mismatches, even if it can't do anything about it because it's ambiguous which copy is wrong. > IMO, I think the offline check should look at nodatacow copies like > this, but I still think this also needs to be added to scrub. In the > patch thread, I discuss my reasons why. In brief: online scanning; > this goes along with user's expectation of scrub ensuring mirrored > data integrity; and recommendations to setup scrub on periodic basis > to me means it's the place to put it. I don't mind this being implemented in offline scrub first for testing purposes. But the online scrub certainly should have this ability eventually. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Removed Disk - Super Error - Scrub did not fix - next steps?
Hello, I recently was mounting some drives in the free drive bays of my server and accidentally removed one of my drives from my raid1 btrfs array. I immediately put it back in - but whatever damage that would have caused seems to be already done. I got these log errors [dmesg -T] and after reading up it seemed that all I *probably* needed to do was a scrub: Finally - the original logs I found by running "dmesg -T|grep -i btrfs" - from where the log started to where the scrub begain: [Tue Jun 26 18:30:08 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr 40547, rd 42808, flush 2, corrupt 0, gen 0 [Tue Jun 26 18:33:19 2018] btrfs_dev_stat_print_on_error: 2056 callbacks suppressed A bunch of these errors... Then later: [Tue Jun 26 19:19:52 2018] BTRFS warning (device sdb): lost page write due to IO error on /dev/sdh I think this is where I unmounted/remounted /mnt/titan: [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): disk space caching is enabled [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): has skinny extents [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): bdev /dev/sdh errs: wr 55907, rd 59133, flush 2, corrupt 0, gen 0 [Tue Jun 26 19:49:57 2018] btrfs_dev_stat_print_on_error: 49 callbacks suppressed [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr 55907, rd 59133, flush 2, corrupt 0, gen 1 [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr 55907, rd 59133, flush 2, corrupt 0, gen 2 [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr 55907, rd 59133, flush 2, corrupt 0, gen 3 [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify failed on 9081264963584 wanted 8663 found 8321 [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081264963584 (dev /dev/sdh sector 17728418112) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081264967680 (dev /dev/sdh sector 17728418120) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081264971776 (dev /dev/sdh sector 17728418128) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081264975872 (dev /dev/sdh sector 17728418136) [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify failed on 9081265356800 wanted 8663 found 8321 [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265356800 (dev /dev/sdh sector 17728418880) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265360896 (dev /dev/sdh sector 1772841) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265364992 (dev /dev/sdh sector 17728418896) [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265369088 (dev /dev/sdh sector 17728418904) According to btrfs scrub status this is when the scrub began... [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify failed on 9081265061888 wanted 8663 found 8321 [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265061888 (dev /dev/sdh sector 17728418304) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265065984 (dev /dev/sdh sector 17728418312) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265070080 (dev /dev/sdh sector 17728418320) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081265074176 (dev /dev/sdh sector 17728418328) [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify failed on 9081263046656 wanted 8664 found 8662 [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263046656 (dev /dev/sdh sector 17728414368) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263050752 (dev /dev/sdh sector 17728414376) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263054848 (dev /dev/sdh sector 17728414384) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263058944 (dev /dev/sdh sector 17728414392) [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify failed on 9081263194112 wanted 8664 found 8662 [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263194112 (dev /dev/sdh sector 17728414656) [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected: ino 0 off 9081263198208 (dev /dev/sdh sector 17728414664) [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify failed on 9081264128000 wanted 8664 found 8662 [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify failed on 9081265520640 wanted 8665 found 8321 So I ran btrfs scrub and this was my
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 13:58, james harvey wrote: On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. The thing is though, as things stand right now, you need to manually edit the data on-disk directly or restore the file from a backup to fix the file. While it's technically true that you can manually repair this type of thing, both of the cases for doing it without those patches I mentioned, it's functionally impossible for a regular user to do it without potentially losing some data. Unless that changes, scrub telling you it's corrupt is not going to help much aside from making sure you don't make things worse by trying to use it. Given this, it would make sense to have a (disabled by default) option to have scrub repair it by just using the newer or older copy of the data. That would require classic RAID generational tracking though, which BTRFS doesn't have yet. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: > On 2018-06-29 11:15, james harvey wrote: >> >> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy >> wrote: >>> >>> And an open question I have about scrub is weather it only ever is >>> checking csums, meaning nodatacow files are never scrubbed, or if the >>> copies are at least compared to each other? >> >> >> Scrub never looks at nodatacow files. It does not compare the copies >> to each other. >> >> Qu submitted a patch to make check compare the copies: >> https://patchwork.kernel.org/patch/10434509/ >> >> This hasn't been added to btrfs-progs git yet. >> >> IMO, I think the offline check should look at nodatacow copies like >> this, but I still think this also needs to be added to scrub. In the >> patch thread, I discuss my reasons why. In brief: online scanning; >> this goes along with user's expectation of scrub ensuring mirrored >> data integrity; and recommendations to setup scrub on periodic basis >> to me means it's the place to put it. > > That said, it can't sanely fix things if there is a mismatch. At least, not > unless BTRFS gets proper generational tracking to handle temporarily missing > devices. As of right now, sanely fixing things requires significant manual > intervention, as you have to bypass the device read selection algorithm to > be able to look at the state of the individual copies so that you can pick > one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. > A while back, Anand Jain posted some patches that would let you select a > particular device to direct all reads to via a mount option, but I don't > think they ever got merged. That would have made manual recovery in cases > like this exponentially easier (mount read-only with one device selected, > copy the file out somewhere, remount read-only with the other device, drop > caches, copy the file out again, compare and reconcile the two copies, then > remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental send/receive broken after snapshot restore
28.06.2018 23:09, Hannes Schweizer пишет: > Hi, > > Here's my environment: > Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64 > Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux > btrfs-progs v4.17 > > Label: 'online' uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390 > Total devices 2 FS bytes used 3.16TiB > devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0 > devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1 > Data, RAID0: total=3.16TiB, used=3.15TiB > System, RAID0: total=16.00MiB, used=240.00KiB > Metadata, RAID0: total=7.00GiB, used=4.91GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > Label: 'offline' uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29 > Total devices 2 FS bytes used 3.52TiB > devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0 > devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1 > Data, RAID1: total=3.52TiB, used=3.52TiB > System, RAID1: total=8.00MiB, used=512.00KiB > Metadata, RAID1: total=6.00GiB, used=5.11GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > Label: 'external' uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0 > Total devices 1 FS bytes used 3.65TiB > devid1 size 5.46TiB used 3.66TiB path > /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc > Data, single: total=3.64TiB, used=3.64TiB > System, DUP: total=32.00MiB, used=448.00KiB > Metadata, DUP: total=11.00GiB, used=9.72GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > > The following automatic backup scheme is in place: > hourly: > btrfs sub snap -r online/root online/root. > > daily: > btrfs sub snap -r online/root online/root. > btrfs send -c online/root. > online/root. | btrfs receive offline > btrfs sub del -c online/root. > > monthly: > btrfs sub snap -r online/root online/root. > btrfs send -c online/root. > online/root. | btrfs receive external > btrfs sub del -c online/root. > > Now here are the commands leading up to my problem: > After the online filesystem suddenly went ro, and btrfs check showed > massive problems, I decided to start the online array from scratch: > 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0 > /dev/mapper/online1 > > As you can see from the backup commands above, the snapshots of > offline and external are not related, so in order to at least keep the > extensive backlog of the external snapshot set (including all > reflinks), I decided to restore the latest snapshot from external. > 2: btrfs send external/root. | btrfs receive online > > I wanted to ensure I can restart the incremental backup flow from > online to external, so I did this > 3: mv online/root. online/root > 4: btrfs sub snap -r online/root online/root. > 5: btrfs property set online/root ro false > > Now, I naively expected a simple restart of my automatic backups for > external should work. > However after running > 6: btrfs sub snap -r online/root online/root. > 7: btrfs send -c online/root. > online/root. | btrfs receive external You just recreated your "online" filesystem from scratch. Where "old_external_reference" comes from? You did not show steps used to create it. > I see the following error: > ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file > or directory > > Which is unfortunate, but the second problem actually encouraged me to > post this message. > As planned, I had to start the offline array from scratch as well, > because I no longer had any reference snapshot for incremental backups > on other devices: > 8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0 > /dev/mapper/offline1 > > However restarting the automatic daily backup flow bails out with a > similar error, although no potentially problematic previous > incremental snapshots should be involved here! > ERROR: unlink o925031-987-0/2139527549 failed. No such file or directory > Again - before you can *re*start incremental-forever sequence you need initial full copy. How exactly did you restart it if no snapshots exist either on source or on destination? > I'm a bit lost now. The only thing I could image which might be > confusing for btrfs, > is the residual "Received UUID" of online/root. > after command 2. > What's the recommended way to restore snapshots with send/receive > without breaking subsequent incremental backups (including reflinks of > existing backups)? > > Any hints appreciated... > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 12:28:31AM -0700, Marc MERLIN wrote: > So, I rebooted, and will now run Su's btrfs check without repair and > report back. As expected, it will likely still take days, here's the start: gargamel:~# btrfs check --mode=lowmem -p /dev/mapper/dshelf2 Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2, have: 4 ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2, have: 4 ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 180, have: 240 ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 67, have: 115 ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 67, have: 115 ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 114, have: 143 ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 114, have: 143 ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 301, have: 431 ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 355, have: 433 ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 160, have: 240 ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 161, have: 240 ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 169, have: 249 ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 171, have: 251 ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 347, have: 418 ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 1, have: 1449 ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 Mmmh, these look similar (but not identical) to the last run earlier in this thread: ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive vs rsync
On Fri, Jun 29, 2018 at 10:04:02AM +0200, Lionel Bouton wrote: > Hi, > > On 29/06/2018 09:22, Marc MERLIN wrote: > > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote: > >> On Thu, 28 Jun 2018 23:59:03 -0700 > >> Marc MERLIN wrote: > >> > >>> I don't waste a week recreating the many btrfs send/receive relationships. > >> Consider not using send/receive, and switching to regular rsync instead. > >> Send/receive is very limiting and cumbersome, including because of what you > >> described. And it doesn't gain you much over an incremental rsync. As for > > Err, sorry but I cannot agree with you here, at all :) > > > > btrfs send/receive is pretty much the only reason I use btrfs. > > rsync takes hours on big filesystems scanning every single inode on both > > sides and then seeing what changed, and only then sends the differences > > It's super inefficient. > > btrfs send knows in seconds what needs to be sent, and works on it right > > away. > > I've not yet tried send/receive but I feel the pain of rsyncing millions > of files (I had to use lsyncd to limit the problem to the time the > origin servers reboot which is a relatively rare event) so this thread > picked my attention. Looking at the whole thread I wonder if you could > get a more manageable solution by splitting the filesystem. So, let's be clear. I did backups with rsync for 10+ years. It was slow and painful. On my laptop an hourly rsync between 2 drives slowed down my machine to a crawl while everything was being stat'ed, it took forever. Now with btrfs send/receive, it just works, I don't even see it happening in the background. Here is a page I wrote about it in 2014: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive Here is a talk I gave in 2014 too, scroll to the bottom of the page, and the bottom of the talk outline: http://marc.merlins.org/perso/btrfs/2014-05.html#My-Btrfs-Talk-at-Linuxcon-JP-2014 and click on 'Btrfs send/receive' > If instead of using a single BTRFS filesystem you used LVM volumes > (maybe with Thin provisioning and monitoring of the volume group free > space) for each of your servers to backup with one BTRFS filesystem per > volume you would have less snapshots per filesystem and isolate problems > in case of corruption. If you eventually decide to start from scratch > again this might help a lot in your case. So, I already have problems due to too many block layers: - raid 5 + ssd - bcache - dmcrypt - btrfs I get occasional deadlocks due to upper layers sending more data to the lower layer (bcache) than it can process. I'm a bit warry of adding yet another layer (LVM), but you're otherwise correct than keeping smaller btrfs filesystems would help with performance and containing possible damage. Has anyone actually done this? :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing
On Fri, Jun 29, 2018 at 10:57:07AM +0200, David Sterba wrote: > This is called either at the end of the mount or if the mount fails. > In both cases, there's nothing running that can use the table so the > lock is pointless. And then lockdep says no. The umount path frees the table but there's some unfinished bio that wants to use the table from the interrupt context. And this is puzzling, there should be no IO in flight, all workers should be stopped. btrfs/011: [ 1339.169842] [ 1339.171891] WARNING: inconsistent lock state [ 1339.173724] 4.17.0-rc7-default+ #168 Not tainted [ 1339.175661] [ 1339.177479] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 1339.179758] umount/4029 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1339.182008] 96eee2cd (&(>lock)->rlock){?.-.}, at: __remove_rbio_from_cache+0x5f/0x140 [btrfs] [ 1339.183982] {IN-HARDIRQ-W} state was registered at: [ 1339.184819] _raw_spin_lock_irqsave+0x43/0x52 [ 1339.185433] unlock_stripe+0x78/0x3d0 [btrfs] [ 1339.186041] rbio_orig_end_io+0x41/0xd0 [btrfs] [ 1339.186648] blk_update_request+0xd7/0x330 [ 1339.187205] blk_mq_end_request+0x18/0x70 [ 1339.187752] flush_smp_call_function_queue+0x83/0x120 [ 1339.188405] smp_call_function_single_interrupt+0x43/0x270 [ 1339.189094] call_function_single_interrupt+0xf/0x20 [ 1339.189733] native_safe_halt+0x2/0x10 [ 1339.190255] default_idle+0x1f/0x190 [ 1339.190759] do_idle+0x217/0x240 [ 1339.191229] cpu_startup_entry+0x6f/0x80 [ 1339.191768] start_secondary+0x192/0x1e0 [ 1339.192385] secondary_startup_64+0xa5/0xb0 [ 1339.193145] irq event stamp: 783817 [ 1339.193701] hardirqs last enabled at (783817): [] _raw_spin_unlock_irqrestore+0x4d/0x60 [ 1339.194896] hardirqs last disabled at (783816): [] _raw_spin_lock_irqsave+0x20/0x52 [ 1339.196010] softirqs last enabled at (777902): [] __do_softirq+0x397/0x502 [ 1339.197435] softirqs last disabled at (777879): [] irq_exit+0xc1/0xd0 [ 1339.198797] [ 1339.198797] other info that might help us debug this: [ 1339.200011] Possible unsafe locking scenario: [ 1339.200011] [ 1339.201066]CPU0 [ 1339.201464] [ 1339.201845] lock(&(>lock)->rlock); [ 1339.202372] [ 1339.202768] lock(&(>lock)->rlock); [ 1339.203313] [ 1339.203313] *** DEADLOCK *** [ 1339.203313] [ 1339.204215] 1 lock held by umount/4029: [ 1339.204727] #0: 7c3dd992 (>s_umount_key#26){}, at: deactivate_super+0x43/0x50 [ 1339.205822] [ 1339.205822] stack backtrace: [ 1339.206660] CPU: 2 PID: 4029 Comm: umount Not tainted 4.17.0-rc7-default+ #168 [ 1339.207875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 1339.209357] Call Trace: [ 1339.209742] dump_stack+0x85/0xc0 [ 1339.210204] print_usage_bug.cold.57+0x1aa/0x1e4 [ 1339.210790] ? print_shortest_lock_dependencies+0x40/0x40 [ 1339.211680] mark_lock+0x530/0x630 [ 1339.212291] ? print_shortest_lock_dependencies+0x40/0x40 [ 1339.213180] __lock_acquire+0x549/0xf60 [ 1339.213855] ? flush_work+0x24a/0x280 [ 1339.214485] ? __lock_is_held+0x4f/0x90 [ 1339.215176] lock_acquire+0x9e/0x1d0 [ 1339.215850] ? __remove_rbio_from_cache+0x5f/0x140 [btrfs] [ 1339.216755] _raw_spin_lock+0x2c/0x40 [ 1339.217303] ? __remove_rbio_from_cache+0x5f/0x140 [btrfs] [ 1339.218005] __remove_rbio_from_cache+0x5f/0x140 [btrfs] [ 1339.218687] btrfs_free_stripe_hash_table+0x2a/0x50 [btrfs] [ 1339.219393] close_ctree+0x1d7/0x330 [btrfs] [ 1339.219961] generic_shutdown_super+0x64/0x100 [ 1339.220659] kill_anon_super+0xe/0x20 [ 1339.221245] btrfs_kill_super+0x12/0xa0 [btrfs] [ 1339.221840] deactivate_locked_super+0x29/0x60 [ 1339.222416] cleanup_mnt+0x3b/0x70 [ 1339.222892] task_work_run+0x9b/0xd0 [ 1339.223388] exit_to_usermode_loop+0x99/0xa0 [ 1339.223949] do_syscall_64+0x16c/0x170 [ 1339.224603] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1339.225457] RIP: 0033:0x7ffb9180d017 [ 1339.226063] RSP: 002b:7fff9cb72a28 EFLAGS: 0246 ORIG_RAX: 00a6 [ 1339.227289] RAX: RBX: 55b804721970 RCX: 7ffb9180d017 [ 1339.228373] RDX: 0001 RSI: RDI: 55b804721b50 [ 1339.229451] RBP: R08: 55b804721b70 R09: 7fff9cb71290 [ 1339.230524] R10: R11: 0246 R12: 7ffb91d291c4 [ 1339.231604] R13: 55b804721b50 R14: R15: 7fff9cb72c98 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: > And an open question I have about scrub is weather it only ever is > checking csums, meaning nodatacow files are never scrubbed, or if the > copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages
On 29.06.2018 17:29, David Sterba wrote: > On Fri, Jun 29, 2018 at 12:08:10PM +0300, Nikolay Borisov wrote: >>> for (i = 0; i < num_pages; i++) >>> copy_page(page_address(dst->pages[i]), >>> page_address(src->pages[i])); >>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h >>> index 0bfd4aeb822d..d8382a4a7f46 100644 >>> --- a/fs/btrfs/extent_io.h >>> +++ b/fs/btrfs/extent_io.h >>> @@ -440,10 +440,10 @@ int read_extent_buffer_pages(struct extent_io_tree >>> *tree, >>> int mirror_num); >>> void wait_on_extent_buffer_writeback(struct extent_buffer *eb); >>> >>> -static inline unsigned long num_extent_pages(u64 start, u64 len) >>> +static inline unsigned long num_extent_pages(const struct extent_buffer >>> *eb) >>> { >>> - return ((start + len + PAGE_SIZE - 1) >> PAGE_SHIFT) - >>> - (start >> PAGE_SHIFT); >>> + return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) - >>> + (eb->start >> PAGE_SHIFT); >> >> This is a good opportunity to eliminate the open-coded round_up: >> >> round_up(eb->start + eb->len, PAGE_SIZE) > > Ok, I'll add this patch: > > @@ -442,8 +442,8 @@ void wait_on_extent_buffer_writeback(struct extent_buffer > *eb); > > static inline int num_extent_pages(const struct extent_buffer *eb) > { > - return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) - > - (eb->start >> PAGE_SHIFT); > + return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) - > + (eb->start >> PAGE_SHIFT); > } LGTM > > static inline void extent_buffer_get(struct extent_buffer *eb) > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages
On Fri, Jun 29, 2018 at 12:08:10PM +0300, Nikolay Borisov wrote: > > for (i = 0; i < num_pages; i++) > > copy_page(page_address(dst->pages[i]), > > page_address(src->pages[i])); > > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > > index 0bfd4aeb822d..d8382a4a7f46 100644 > > --- a/fs/btrfs/extent_io.h > > +++ b/fs/btrfs/extent_io.h > > @@ -440,10 +440,10 @@ int read_extent_buffer_pages(struct extent_io_tree > > *tree, > > int mirror_num); > > void wait_on_extent_buffer_writeback(struct extent_buffer *eb); > > > > -static inline unsigned long num_extent_pages(u64 start, u64 len) > > +static inline unsigned long num_extent_pages(const struct extent_buffer > > *eb) > > { > > - return ((start + len + PAGE_SIZE - 1) >> PAGE_SHIFT) - > > - (start >> PAGE_SHIFT); > > + return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) - > > + (eb->start >> PAGE_SHIFT); > > This is a good opportunity to eliminate the open-coded round_up: > > round_up(eb->start + eb->len, PAGE_SIZE) Ok, I'll add this patch: @@ -442,8 +442,8 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb); static inline int num_extent_pages(const struct extent_buffer *eb) { - return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) - - (eb->start >> PAGE_SHIFT); + return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) - + (eb->start >> PAGE_SHIFT); } static inline void extent_buffer_get(struct extent_buffer *eb) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: remove warnings superseded by refcount_t usage
On Tue, Jun 26, 2018 at 05:00:46PM +0300, Nikolay Borisov wrote: > On 25.06.2018 19:38, David Sterba wrote: > > There are several WARN_ON calls that catch incrorrect reference counter > > updates, but this is what the refcount_t does already: > > > > * refcount_inc from 0 will warn > > * refcount_dec_and_test from 0 will warn > > > > But these warnings are only going to be triggered in the case of > CONFIG_REFCOUNT_FULL otherwise refcount falls back to plain atomic > operations. Right, I'm for REFCOUNT_FULL on by default obviously, but we need to take handle the case where it's not. And there isn't a refcount type with the full semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE
On Fri, Jun 29, 2018 at 09:07:12PM +0800, Qu Wenruo wrote: > >> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb) > >> > >>spin_lock(>refs_lock); > >>if (atomic_read(>refs) == 2 && > >> - test_bit(EXTENT_BUFFER_DUMMY, >bflags)) > >> + test_bit(EXTENT_BUFFER_PRIVATE, >bflags)) > >>atomic_dec(>refs); > > Also discussed in off list mail, this extra atomic_dec for cloned eb > looks confusing. > (That also explains why after cloning the eb, we call > extent_buffer_get() and only frees it once, and still no eb leaking) > What about just removing such special handling? The extent_buffer_get can be moved to the cloning function, all callers increase the reference but the missing dec is indeed confusing. However I don't think we can remove it completely. We need to keep the eb::refs at the same level for all eb types (regular, STALE, DUMMY), so we can use eg. free_extent_buffer. There may be other way how to clean it that I don't see now, so if you think you can improve the reference handling, then go for it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE
On 29.06.2018 16:07, Qu Wenruo wrote: > > > On 2018年06月29日 20:46, David Sterba wrote: >> On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote: >>> EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have >>> this flag set are not in any way dummy. Rather, they are private in >>> the sense that are not linked to the global buffer tree. This flag has >>> subtle implications to the way free_extent_buffer work for example, as >>> well as controls whether page->mapping->private_lock is held during >>> extent_buffer release. Pages for a private buffer cannot be under io, >>> nor can they be written by a 3rd party so taking the lock is >>> unnecessary. >>> >>> Signed-off-by: Nikolay Borisov >>> --- >>> fs/btrfs/disk-io.c | 2 +- >>> fs/btrfs/extent_io.c | 10 +- >>> fs/btrfs/extent_io.h | 2 +- >>> 3 files changed, 7 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>> index 8a469f70d5ee..1c655be92690 100644 >>> --- a/fs/btrfs/disk-io.c >>> +++ b/fs/btrfs/disk-io.c >>> @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer >>> *buf) >>> * enabled. Normal people shouldn't be marking dummy buffers as dirty >>> * outside of the sanity tests. >>> */ >>> - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) >>> + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) >> >> This is going to be confusing. There's page Private bit, >> PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically >> connected. >> >> I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by >> btrfs_clone_extent_buffer or used in the disconnected way (ie. without >> the mapping). > > UNMAPPED looks good to me. > (Or ANONYMOUS?) I'm more leaning towards UNMAPPED at this point. > >> >>> return; >>> #endif >>> root = BTRFS_I(buf->pages[0]->mapping->host)->root; >>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >>> index 6ac15804bab1..6611207e8e1f 100644 >>> --- a/fs/btrfs/extent_io.c >>> +++ b/fs/btrfs/extent_io.c >>> @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb) >>> static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) >>> { >>> int i; >>> - int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags); >>> + int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags); >>> >>> BUG_ON(extent_buffer_under_io(eb)); >>> >>> @@ -4755,7 +4755,7 @@ struct extent_buffer >>> *btrfs_clone_extent_buffer(struct extent_buffer *src) >>> } >>> >>> set_bit(EXTENT_BUFFER_UPTODATE, >bflags); >>> - set_bit(EXTENT_BUFFER_DUMMY, >bflags); >>> + set_bit(EXTENT_BUFFER_PRIVATE, >bflags); >>> >>> return new; >>> } >>> @@ -4780,7 +4780,7 @@ struct extent_buffer >>> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, >> >> Would be good to sync the new name with the helpers: >> __alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then. >> >>> } >>> set_extent_buffer_uptodate(eb); >>> btrfs_set_header_nritems(eb, 0); >>> - set_bit(EXTENT_BUFFER_DUMMY, >bflags); >>> + set_bit(EXTENT_BUFFER_PRIVATE, >bflags); >>> >>> return eb; >>> err: >>> @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer >>> *eb) >>> /* Should be safe to release our pages at this point */ >>> btrfs_release_extent_buffer_page(eb); >>> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >>> - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) { >>> + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) { >>> __free_extent_buffer(eb); >>> return 1; >>> } >>> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb) >>> >>> spin_lock(>refs_lock); >>> if (atomic_read(>refs) == 2 && >>> - test_bit(EXTENT_BUFFER_DUMMY, >bflags)) >>> + test_bit(EXTENT_BUFFER_PRIVATE, >bflags)) >>> atomic_dec(>refs); > > Also discussed in off list mail, this extra atomic_dec for cloned eb > looks confusing. > (That also explains why after cloning the eb, we call > extent_buffer_get() and only frees it once, and still no eb leaking) > What about just removing such special handling? I think this special handling is not needed if we consider the fact that allocating a new eb already has ref1. In this case the code that allocated the buffer really "hands over" the reference when putting it on a btrfs_path struct. Let's take btrfs_search_old_slot as an example. It calls tree_mod_log_rewind which rewinds the passed in extent buffer by cloning it and doing an extra extent_buffer_get and "publishing" it to the given btrfs_path struct. But really this buffer should have only a single reference since it's only in the btrfs_path. Then this buffer should be released when btrfs_release_path is called on the path. Again, in btrfs_search_old_slot we have the same usage scenario in get_old_root. >
Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE
On 2018年06月29日 20:46, David Sterba wrote: > On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote: >> EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have >> this flag set are not in any way dummy. Rather, they are private in >> the sense that are not linked to the global buffer tree. This flag has >> subtle implications to the way free_extent_buffer work for example, as >> well as controls whether page->mapping->private_lock is held during >> extent_buffer release. Pages for a private buffer cannot be under io, >> nor can they be written by a 3rd party so taking the lock is >> unnecessary. >> >> Signed-off-by: Nikolay Borisov >> --- >> fs/btrfs/disk-io.c | 2 +- >> fs/btrfs/extent_io.c | 10 +- >> fs/btrfs/extent_io.h | 2 +- >> 3 files changed, 7 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 8a469f70d5ee..1c655be92690 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf) >> * enabled. Normal people shouldn't be marking dummy buffers as dirty >> * outside of the sanity tests. >> */ >> -if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) >> +if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) > > This is going to be confusing. There's page Private bit, > PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically > connected. > > I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by > btrfs_clone_extent_buffer or used in the disconnected way (ie. without > the mapping). UNMAPPED looks good to me. (Or ANONYMOUS?) > >> return; >> #endif >> root = BTRFS_I(buf->pages[0]->mapping->host)->root; >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 6ac15804bab1..6611207e8e1f 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb) >> static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) >> { >> int i; >> -int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags); >> +int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags); >> >> BUG_ON(extent_buffer_under_io(eb)); >> >> @@ -4755,7 +4755,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct >> extent_buffer *src) >> } >> >> set_bit(EXTENT_BUFFER_UPTODATE, >bflags); >> -set_bit(EXTENT_BUFFER_DUMMY, >bflags); >> +set_bit(EXTENT_BUFFER_PRIVATE, >bflags); >> >> return new; >> } >> @@ -4780,7 +4780,7 @@ struct extent_buffer >> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, > > Would be good to sync the new name with the helpers: > __alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then. > >> } >> set_extent_buffer_uptodate(eb); >> btrfs_set_header_nritems(eb, 0); >> -set_bit(EXTENT_BUFFER_DUMMY, >bflags); >> +set_bit(EXTENT_BUFFER_PRIVATE, >bflags); >> >> return eb; >> err: >> @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer >> *eb) >> /* Should be safe to release our pages at this point */ >> btrfs_release_extent_buffer_page(eb); >> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> -if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) { >> +if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) { >> __free_extent_buffer(eb); >> return 1; >> } >> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb) >> >> spin_lock(>refs_lock); >> if (atomic_read(>refs) == 2 && >> -test_bit(EXTENT_BUFFER_DUMMY, >bflags)) >> +test_bit(EXTENT_BUFFER_PRIVATE, >bflags)) >> atomic_dec(>refs); Also discussed in off list mail, this extra atomic_dec for cloned eb looks confusing. (That also explains why after cloning the eb, we call extent_buffer_get() and only frees it once, and still no eb leaking) What about just removing such special handling? Thanks, Qu >> >> if (atomic_read(>refs) == 2 && >> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h >> index 0bfd4aeb822d..bfccaec2c89a 100644 >> --- a/fs/btrfs/extent_io.h >> +++ b/fs/btrfs/extent_io.h >> @@ -46,7 +46,7 @@ >> #define EXTENT_BUFFER_STALE 6 >> #define EXTENT_BUFFER_WRITEBACK 7 >> #define EXTENT_BUFFER_READ_ERR 8/* read IO error */ >> -#define EXTENT_BUFFER_DUMMY 9 >> +#define EXTENT_BUFFER_PRIVATE 9 >> #define EXTENT_BUFFER_IN_TREE 10 >> #define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */ >> >> -- >> 2.7.4 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On Fri, 2018-06-29 at 09:10 +0800, Qu Wenruo wrote: > Maybe it's the old mkfs causing the problem? > Although mkfs.btrfs added device size alignment much earlier than > kernel, it's still possible that the old mkfs doesn't handle the > initial > device and extra device (mkfs.btrfs will always create a temporary fs > on > the first device, then add all the other devices to the system) the > same > way. Well who knows,.. at least now everything's fine again :-) Thanks guys! Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
lockdep splat between fsync and DIO
Running xfstest on today's misc-next revealed the following lockdep splat (but the machine didn't lock up): [ 1477.192040] == [ 1477.192226] WARNING: possible circular locking dependency detected [ 1477.192393] 4.18.0-rc1-nbor #755 Not tainted [ 1477.192522] -- [ 1477.192686] fsstress/4314 is trying to acquire lock: [ 1477.192827] 3e0774ac (sb_internal#2){.+.+}, at: start_transaction+0x2e8/0x4b0 [ 1477.193027] but task is already holding lock: [ 1477.193191] ef79de77 (>dio_sem){}, at: btrfs_direct_IO+0x3bb/0x450 [ 1477.193395] which lock already depends on the new lock. [ 1477.193589] the existing dependency chain (in reverse order) is: [ 1477.193774] -> #2 (>dio_sem){}: [ 1477.193928]btrfs_log_changed_extents.isra.5+0x6e/0x9e0 [ 1477.194143]btrfs_log_inode+0x96c/0xf10 [ 1477.194344]btrfs_log_inode_parent+0x295/0xb10 [ 1477.194540]btrfs_log_dentry_safe+0x4a/0x70 [ 1477.194743]btrfs_sync_file+0x3eb/0x580 [ 1477.194928]do_fsync+0x38/0x60 [ 1477.195114]__x64_sys_fsync+0x10/0x20 [ 1477.195320]do_syscall_64+0x5a/0x1a0 [ 1477.195496]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1477.195698] -> #1 (>log_mutex){+.+.}: [ 1477.196001]btrfs_record_unlink_dir+0x2a/0xa0 [ 1477.196225]btrfs_unlink+0x5e/0xd0 [ 1477.196401]vfs_unlink+0xc4/0x190 [ 1477.196602]do_unlinkat+0x2ab/0x310 [ 1477.196774]do_syscall_64+0x5a/0x1a0 [ 1477.197016]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1477.197230] -> #0 (sb_internal#2){.+.+}: [ 1477.197480]__sb_start_write+0x126/0x1a0 [ 1477.197662]start_transaction+0x2e8/0x4b0 [ 1477.197845]find_free_extent+0x10c1/0x1430 [ 1477.198027]btrfs_reserve_extent+0x9b/0x180 [ 1477.198224]btrfs_get_blocks_direct+0x38d/0x700 [ 1477.198419]__blockdev_direct_IO+0xb2f/0x39c7 [ 1477.198611]btrfs_direct_IO+0x190/0x450 [ 1477.198790]generic_file_direct_write+0x9d/0x160 [ 1477.198986]btrfs_file_write_iter+0x217/0x60b [ 1477.199179]aio_write+0x136/0x1f0 [ 1477.199359]io_submit_one+0x584/0x6d0 [ 1477.199543]__se_sys_io_submit+0x91/0x220 [ 1477.199789]do_syscall_64+0x5a/0x1a0 [ 1477.199964]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1477.200163] other info that might help us debug this: [ 1477.200522] Chain exists of: sb_internal#2 --> >log_mutex --> >dio_sem [ 1477.200891] Possible unsafe locking scenario: [ 1477.201137]CPU0CPU1 [ 1477.201356] [ 1477.201533] lock(>dio_sem); [ 1477.201687]lock(>log_mutex); [ 1477.201887]lock(>dio_sem); [ 1477.202088] lock(sb_internal#2); [ 1477.202290] *** DEADLOCK *** [ 1477.202581] 1 lock held by fsstress/4314: [ 1477.202745] #0: ef79de77 (>dio_sem){}, at: btrfs_direct_IO+0x3bb/0x450 [ 1477.203039] stack backtrace: [ 1477.203312] CPU: 0 PID: 4314 Comm: fsstress Not tainted 4.18.0-rc1-nbor #755 [ 1477.203537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 1477.203850] Call Trace: [ 1477.203992] dump_stack+0x85/0xcb [ 1477.204151] print_circular_bug.isra.19+0x1c8/0x2b0 [ 1477.204383] __lock_acquire+0x182e/0x1900 [ 1477.204551] ? lock_acquire+0xa3/0x210 [ 1477.204715] lock_acquire+0xa3/0x210 [ 1477.204889] ? start_transaction+0x2e8/0x4b0 [ 1477.205065] __sb_start_write+0x126/0x1a0 [ 1477.205271] ? start_transaction+0x2e8/0x4b0 [ 1477.205455] start_transaction+0x2e8/0x4b0 [ 1477.205631] find_free_extent+0x10c1/0x1430 [ 1477.205806] btrfs_reserve_extent+0x9b/0x180 [ 1477.205980] btrfs_get_blocks_direct+0x38d/0x700 [ 1477.206193] __blockdev_direct_IO+0xb2f/0x39c7 [ 1477.206397] ? __lock_acquire+0x2b6/0x1900 [ 1477.206571] ? __lock_acquire+0x2b6/0x1900 [ 1477.206744] ? can_nocow_extent+0x480/0x480 [ 1477.206921] ? btrfs_run_delalloc_work+0x40/0x40 [ 1477.207107] ? btrfs_direct_IO+0x190/0x450 [ 1477.207322] btrfs_direct_IO+0x190/0x450 [ 1477.207492] ? btrfs_run_delalloc_work+0x40/0x40 [ 1477.207678] generic_file_direct_write+0x9d/0x160 [ 1477.207862] btrfs_file_write_iter+0x217/0x60b [ 1477.208044] aio_write+0x136/0x1f0 [ 1477.208244] ? lock_acquire+0xa3/0x210 [ 1477.208415] ? __might_fault+0x3e/0x90 [ 1477.208580] ? io_submit_one+0x584/0x6d0 [ 1477.208749] io_submit_one+0x584/0x6d0 [ 1477.208917] ? lock_acquire+0xa3/0x210 [ 1477.209086] __se_sys_io_submit+0x91/0x220 [ 1477.209296] ? do_syscall_64+0x5a/0x1a0 [ 1477.209468] do_syscall_64+0x5a/0x1a0 [ 1477.209636] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1477.209826] RIP: 0033:0x7fa509cee697 [
Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE
On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote: > EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have > this flag set are not in any way dummy. Rather, they are private in > the sense that are not linked to the global buffer tree. This flag has > subtle implications to the way free_extent_buffer work for example, as > well as controls whether page->mapping->private_lock is held during > extent_buffer release. Pages for a private buffer cannot be under io, > nor can they be written by a 3rd party so taking the lock is > unnecessary. > > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/disk-io.c | 2 +- > fs/btrfs/extent_io.c | 10 +- > fs/btrfs/extent_io.h | 2 +- > 3 files changed, 7 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 8a469f70d5ee..1c655be92690 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf) >* enabled. Normal people shouldn't be marking dummy buffers as dirty >* outside of the sanity tests. >*/ > - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) > + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) This is going to be confusing. There's page Private bit, PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically connected. I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by btrfs_clone_extent_buffer or used in the disconnected way (ie. without the mapping). > return; > #endif > root = BTRFS_I(buf->pages[0]->mapping->host)->root; > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 6ac15804bab1..6611207e8e1f 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb) > static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) > { > int i; > - int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags); > + int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags); > > BUG_ON(extent_buffer_under_io(eb)); > > @@ -4755,7 +4755,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct > extent_buffer *src) > } > > set_bit(EXTENT_BUFFER_UPTODATE, >bflags); > - set_bit(EXTENT_BUFFER_DUMMY, >bflags); > + set_bit(EXTENT_BUFFER_PRIVATE, >bflags); > > return new; > } > @@ -4780,7 +4780,7 @@ struct extent_buffer > *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, Would be good to sync the new name with the helpers: __alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then. > } > set_extent_buffer_uptodate(eb); > btrfs_set_header_nritems(eb, 0); > - set_bit(EXTENT_BUFFER_DUMMY, >bflags); > + set_bit(EXTENT_BUFFER_PRIVATE, >bflags); > > return eb; > err: > @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer > *eb) > /* Should be safe to release our pages at this point */ > btrfs_release_extent_buffer_page(eb); > #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS > - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) { > + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) { > __free_extent_buffer(eb); > return 1; > } > @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb) > > spin_lock(>refs_lock); > if (atomic_read(>refs) == 2 && > - test_bit(EXTENT_BUFFER_DUMMY, >bflags)) > + test_bit(EXTENT_BUFFER_PRIVATE, >bflags)) > atomic_dec(>refs); > > if (atomic_read(>refs) == 2 && > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > index 0bfd4aeb822d..bfccaec2c89a 100644 > --- a/fs/btrfs/extent_io.h > +++ b/fs/btrfs/extent_io.h > @@ -46,7 +46,7 @@ > #define EXTENT_BUFFER_STALE 6 > #define EXTENT_BUFFER_WRITEBACK 7 > #define EXTENT_BUFFER_READ_ERR 8/* read IO error */ > -#define EXTENT_BUFFER_DUMMY 9 > +#define EXTENT_BUFFER_PRIVATE 9 > #define EXTENT_BUFFER_IN_TREE 10 > #define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */ > > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs suddenly think's it's raid6
On 2018年06月29日 19:04, marble wrote: > Hello, > I have an external HDD. The HDD contains no partition. > I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs. > It's used to store some media files. > The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux > to play music from the drive. > > After disconnecting the drive from the Pi and connecting it to my laptop > again, I couldn't mount it anymore. If I read the dmesg right, it now > thinks that it's part of a raid6. Don't panic, the raid6 module is always loaded by btrfs, no matter if btrfs uses or not. > > btrfs check --repair also didn't help. Normally you shouldn't try it anyway. > > ``` > marble@archlinux ~ % uname -a > Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC > 2018 x86_64 GNU/Linux > > marble@archlinux ~ % btrfs --version > btrfs-progs v4.16.1 > > marble@archlinux ~ % sudo cryptsetup open /dev/sda black > [sudo] password for marble: > Enter passphrase for /dev/sda: > > marble@archlinux ~ % mkdir /tmp/black > marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black > mount: /tmp/black: can't read superblock on /dev/mapper/black. > > marble@archlinux ~ % sudo btrfs fi show > Label: 'black' uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 > Total devices 1 FS bytes used 143.38GiB > devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black > > marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black > enabling repair mode > Checking filesystem on /dev/mapper/black > UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 > Fixed 0 roots. > checking extents > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 This means, both copies for logical bytenr 1082114048 are corrupted. > bytenr mismatch, want=1082114048, have=9385453728028469028 > owner ref check failed [1082114048 16384] > repair deleting extent record: key [1082114048,168,16384] You shouldn't really use --repair here. Your extent tree is corrupted, --repair could make things worse here. > adding new tree backref on start 1082114048 len 16384 parent 0 root 5 > Repaired extent references for 1082114048 > ref mismatch on [59038556160 4096] extent item 1, found 0 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > bytenr mismatch, want=1082114048, have=9385453728028469028 > incorrect local backref count on 59038556160 root 5 owner 334393 offset > 0 found 0 wanted 1 back 0x56348aee5de0 > backref disk bytenr does not match extent record, bytenr=59038556160, > ref bytenr=0 > backpointer mismatch on [59038556160 4096] > owner ref check failed [59038556160 4096] > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 > checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 > bytenr mismatch, want=1082114048, have=9385453728028469028 > failed to repair damaged filesystem, aborting > > marble@archlinux ~ % dmesg > /tmp/dmesg.log This is the root cause: [ 124.544439] BTRFS error (device dm-1): bad tree block start 9385453728028469028 1082114048 [ 124.555229] BTRFS error (device dm-1): bad tree block start 2365503423870651471 1082114048 [ 124.555275] BTRFS warning (device dm-1): failed to read fs tree: -5 Your fs tree is corrupted. I'm not sure how it's corrupted, but corruption to fs tree is a little rare. Is there anything like unexpected power loss happens? And would you provide the following data for debugging? # btrfs ins dump-super -fFa /dev/mapper/black And further more, what's the device mapper setup for /dev/mapper/black? Is there anything like RAID here? Thanks, Qu > ``` > > Any clues? > > Best regards, > marble > signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/4] btrfs: Refactor loop in btrfs_release_extent_buffer_page
On Wed, Jun 27, 2018 at 04:38:22PM +0300, Nikolay Borisov wrote: > The purpose of the function is to free all the pages comprising an > extent buffer. This can be achieved with a simple for loop rather than > the slitghly more involved 'do {} while' construct. So rewrite the > loop using a 'for' construct. Additionally we can never have an > extent_buffer that is 0 pages so remove the check for index == 0. No > functional changes. The reversed loop makes sense, the first page is special and used for locking the whole extent buffer's pages, as can be seen eg. 897ca6e9b4fef86d5dfb6b31 or 4f2de97acee6532b36dd6e99 . Also see current alloc_extent_buffer for the locking and managing the private bit. So you can still rewrite it as a for loop but would have to preserve the logic or provide the reason that it's correct to iterate from 0 to num_pages. There are some subltle races regarding pages and extents and their presence in various trees, so I'd rather be careful here. > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/extent_io.c | 13 - > 1 file changed, 4 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index cce6087d6880..4180a3b7e725 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4641,19 +4641,14 @@ int extent_buffer_under_io(struct extent_buffer *eb) > */ > static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) > { > - unsigned long index; > - struct page *page; > + int i; > int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags); > > BUG_ON(extent_buffer_under_io(eb)); > > - index = num_extent_pages(eb->start, eb->len); > - if (index == 0) > - return; This check does seem to be redundant. > + for (i = 0; i < num_extent_pages(eb->start, eb->len); i++) { I think that the num_extent_pages(...) would be better put to a temporary variable so it's not evaluated each time the loop termination condition is checked. > + struct page *page = eb->pages[i]; > > - do { > - index--; > - page = eb->pages[index]; > if (!page) > continue; > if (mapped) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: fix race between mkfs and mount
On Tue, Jun 26, 2018 at 10:42:32PM +0800, Anand Jain wrote: > >>> Last version of the proposed fix is to extend the uuid_mutex over the > >>> whole mount callback and use it around calls to btrfs_scan_one_device. > >>> That way we'll be sure the mount will always get to the device it > >>> scanned and that will not be freed by the parallel scan. > >> > >>That will break the requisite #1 as above. > > > > I don't see how it breaks 'mount and or scan of two independent fsid+dev > > is possible'. It is possible, but the lock does not need to be > > filesystem local. > > > > Concurrent mount or scan will block on the uuid_mutex, > > As uuid_mutex is global, if fsid1 is being mounted, the fsid2 mount > will wait upto certain extent with this code. Yes it will wait a bit, but the critical section is short. There's some IO done and it's reading of 4K in btrfs_read_disk_super. The rest is cpu-bound and hardly measurable in practice, in context of concurrent mount and scanning. I took the approach to fix the bug first and then make it faster or cleaner, also to fix it in a way that's still acceptable for current development cycle. > My efforts here was to > use uuid_mutex only to protect the fs_uuid update part, in this way > there is concurrency in the mount process of fsid1 and fsid2 and > provides shorter bootup time when the user uses the mount at boot > option. The locking can be still improved but the uuid_mutex is not a contended lock, mount is an operation that does not happen thousand times a second, same for the scanning. So even if there are several mounts happening during boot, it's just a few and the delay is IMO unnoticeable. If the boot time is longer by say 100ms at worst, I'm still ok with my patches as a fix. But not as a final fix, so the updates to locking you mentioned are still possible. We now have a point of reference where syzbot does is not able to reproduce the bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"
On Fri, Jun 29, 2018 at 10:13:44AM +0300, Nikolay Borisov wrote: > On 29.06.2018 10:11, Anand Jain wrote: > > > > On 06/29/2018 01:26 PM, Nikolay Borisov wrote: > >> Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for > >> device list traversal") btrfs_show_devname no longer takes > >> device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix > >> a possible umount deadlock") aimed to fix no longer exists. So remove > >> the extra code that commit added. > >> > >> This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa. > > > > Its not exactly a revert. > > > > Because before 88c14590cdd6 was written, 0ccd05285e7f was already their > > for the right purpose OR a new title is even better. > > What I'm saying is that commit 88c14590cdd6 obsoleted 0ccd05285e7f, > hence this patch reverts the latter. I've literally generated it with > git revert 0ccd05285e7f + minor fixups. I do not recommend to do a revert here. Even if a patch reverts functionality because it's not needed anymore, then it's a "forward" change. There are also other changes that may affect the behaviour and in this case it's 47dba17171a76ea2a2a71 that removes rcu barrier, so the patch has to be put into the context of current code. The commit is almost 2 years old, the idea of reverts is IMHO more to provide an easy way do a small step back during one devlopment cycle when the moving parts are still in sight. Back then the commit fixed a deadlock, a revert here would read as 'ok, we want the deadlock back'. So, the code is ok. The subject needs to drop the word 'revert' and changelg maybe mention a few more references why the logic is not needed anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs suddenly think's it's raid6
On 2018-06-29 07:04, marble wrote: Hello, I have an external HDD. The HDD contains no partition. I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs. It's used to store some media files. The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux to play music from the drive. After disconnecting the drive from the Pi and connecting it to my laptop again, I couldn't mount it anymore. If I read the dmesg right, it now thinks that it's part of a raid6. btrfs check --repair also didn't help. ``` marble@archlinux ~ % uname -a Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux marble@archlinux ~ % btrfs --version btrfs-progs v4.16.1 marble@archlinux ~ % sudo cryptsetup open /dev/sda black [sudo] password for marble: Enter passphrase for /dev/sda: marble@archlinux ~ % mkdir /tmp/black marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black mount: /tmp/black: can't read superblock on /dev/mapper/black. marble@archlinux ~ % sudo btrfs fi show Label: 'black' uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Total devices 1 FS bytes used 143.38GiB devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black enabling repair mode Checking filesystem on /dev/mapper/black UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Fixed 0 roots. checking extents checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 owner ref check failed [1082114048 16384] repair deleting extent record: key [1082114048,168,16384] adding new tree backref on start 1082114048 len 16384 parent 0 root 5 Repaired extent references for 1082114048 ref mismatch on [59038556160 4096] extent item 1, found 0 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 incorrect local backref count on 59038556160 root 5 owner 334393 offset 0 found 0 wanted 1 back 0x56348aee5de0 backref disk bytenr does not match extent record, bytenr=59038556160, ref bytenr=0 backpointer mismatch on [59038556160 4096] owner ref check failed [59038556160 4096] checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 failed to repair damaged filesystem, aborting marble@archlinux ~ % dmesg > /tmp/dmesg.log ``` Any clues? It's not thinking it's a raid6 array. All the messages before this one: Btrfs loaded, crc32c=crc32c-intel Are completely unrelated to BTRFS (because anything before that message happened before any BTRFS code ran). The raid6 messages are actually from the startup code for the kernel's generic parity RAID implementation. These: BTRFS error (device dm-1): bad tree block start 9385453728028469028 1082114048 BTRFS error (device dm-1): bad tree block start 2365503423870651471 1082114048 Are the relevant error messages. Unfortunately, I don't really know what's wrong in this case though. Hopefully one of the developers will have some further insight. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs suddenly think's it's raid6
Hello, I have an external HDD. The HDD contains no partition. I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs. It's used to store some media files. The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux to play music from the drive. After disconnecting the drive from the Pi and connecting it to my laptop again, I couldn't mount it anymore. If I read the dmesg right, it now thinks that it's part of a raid6. btrfs check --repair also didn't help. ``` marble@archlinux ~ % uname -a Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux marble@archlinux ~ % btrfs --version btrfs-progs v4.16.1 marble@archlinux ~ % sudo cryptsetup open /dev/sda black [sudo] password for marble: Enter passphrase for /dev/sda: marble@archlinux ~ % mkdir /tmp/black marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black mount: /tmp/black: can't read superblock on /dev/mapper/black. marble@archlinux ~ % sudo btrfs fi show Label: 'black' uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Total devices 1 FS bytes used 143.38GiB devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black enabling repair mode Checking filesystem on /dev/mapper/black UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Fixed 0 roots. checking extents checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 owner ref check failed [1082114048 16384] repair deleting extent record: key [1082114048,168,16384] adding new tree backref on start 1082114048 len 16384 parent 0 root 5 Repaired extent references for 1082114048 ref mismatch on [59038556160 4096] extent item 1, found 0 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 incorrect local backref count on 59038556160 root 5 owner 334393 offset 0 found 0 wanted 1 back 0x56348aee5de0 backref disk bytenr does not match extent record, bytenr=59038556160, ref bytenr=0 backpointer mismatch on [59038556160 4096] owner ref check failed [59038556160 4096] checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 failed to repair damaged filesystem, aborting marble@archlinux ~ % dmesg > /tmp/dmesg.log ``` Any clues? Best regards, marble [0.00] Linux version 4.17.2-1-ARCH (builduser@heftig-9574) (gcc version 8.1.1 20180531 (GCC)) #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 [0.00] Command line: initrd=\initramfs-linux.img cryptdevice=PARTLABEL=root:root_luks root=/dev/mapper/root_luks quiet rw [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Centaur CentaurHauls [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' [0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64 [0.00] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64 [0.00] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x00057fff] usable [0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved [0.00] BIOS-e820: [mem 0x00059000-0x0008bfff] usable [0.00] BIOS-e820: [mem 0x0008c000-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xb50dcfff] usable [0.00] BIOS-e820: [mem 0xb50dd000-0xb50ddfff] ACPI NVS [0.00] BIOS-e820: [mem 0xb50de000-0xb50defff] reserved [0.00] BIOS-e820: [mem 0xb50df000-0xbea16fff] usable [0.00] BIOS-e820: [mem 0xbea17000-0xbff2dfff] reserved [0.00] BIOS-e820: [mem 0xbff2e000-0xbff99fff] ACPI NVS [
Re: [PATCH] btrfs: use customized batch size for total_bytes_pinned
Nikolay Borisov 於 2018-06-29 16:43 寫到: On 29.06.2018 11:27, Ethan Lien wrote: We use percpu_counter to reduce lock contention of frequently updated counter. But the default 32 batch size is suitable only for inc/dec update, not for sector size update. The end result is we still acquire spin lock for every percpu_counter_add(). So use customized batch size for total_bytes_pinned as we already done for dirty_metadata_bytes and delalloc_bytes. Also we fix dirty_metadata_bytes to use __percpu_counter_compare of customized batch so we get precise value of dirty_metadata_bytes. Do you have any numbers to prove this is problematic? Perhaps a fio workload + lockdep contention numbers before/after the patch ? I think a contented path may come from heavily cow metadata and short data extents, where we call add_pinned_bytes() frequently in releasing old blocks. I'll try if I can find out a test to show the difference. Signed-off-by: Ethan Lien --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 10 ++ fs/btrfs/extent-tree.c | 43 +++--- 3 files changed, 35 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; struct list_head list; /* Protected by the spinlock 'lock'. */ diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 205092dc9390..dfed08e70ec1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping, fs_info = BTRFS_I(mapping->host)->root->fs_info; /* this is a bit racy, but that's ok */ - ret = percpu_counter_compare(_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret < 0) return 0; } @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info, if (flush_delayed) btrfs_balance_delayed_items(fs_info); - ret = percpu_counter_compare(_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret > 0) { balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); } This hunk needs to be in a separate patch. And the last sentence which relates to it needs to be expanded further to explain the problem. OK I'll prepare a separate patch. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..a067c047904c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes, + space_info->total_bytes_pinned_batch); } /* @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, - -head->num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + space_info->total_bytes_pinned_batch); if (head->is_data) { spin_lock(_refs->lock); @@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags) INIT_LIST_HEAD(_info->ro_bgs); INIT_LIST_HEAD(_info->tickets); INIT_LIST_HEAD(_info->priority_tickets); + if (flags & BTRFS_BLOCK_GROUP_DATA) + space_info->total_bytes_pinned_batch = info->delalloc_batch; + else + space_info->total_bytes_pinned_batch = info->dirty_metadata_batch; ret = kobject_init_and_add(_info->kobj, _info_ktype, info->space_info_kobj, "%s", @@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct
Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages
On 29.06.2018 11:56, David Sterba wrote: > Almost all callers pass the start and len as 2 arguments but this is not > necessary, all the information is provided by the eb. By reordering the > calls to num_extent_pages, we don't need the local variables with > start/len. > > Reviewed-by: Nikolay Borisov > Signed-off-by: David Sterba > --- > fs/btrfs/extent_io.c | 30 +++--- > fs/btrfs/extent_io.h | 6 +++--- > 2 files changed, 18 insertions(+), 18 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 8e4a7cdbc9f5..577bbb026042 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info, >struct extent_buffer *eb, int mirror_num) > { > u64 start = eb->start; > - unsigned long i, num_pages = num_extent_pages(eb->start, eb->len); > + unsigned long i, num_pages = num_extent_pages(eb); > int ret = 0; > > if (sb_rdonly(fs_info->sb)) > @@ -3591,7 +3591,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb, > if (!ret) > return ret; > > - num_pages = num_extent_pages(eb->start, eb->len); > + num_pages = num_extent_pages(eb); > for (i = 0; i < num_pages; i++) { > struct page *p = eb->pages[i]; > > @@ -3721,7 +3721,7 @@ static noinline_for_stack int write_one_eb(struct > extent_buffer *eb, > int ret = 0; > > clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags); > - num_pages = num_extent_pages(eb->start, eb->len); > + num_pages = num_extent_pages(eb); > atomic_set(>io_pages, num_pages); > > /* set btree blocks beyond nritems with 0 to avoid stale content. */ > @@ -4650,7 +4650,7 @@ static void btrfs_release_extent_buffer_page(struct > extent_buffer *eb) > > BUG_ON(extent_buffer_under_io(eb)); > > - index = num_extent_pages(eb->start, eb->len); > + index = num_extent_pages(eb); > if (index == 0) > return; > > @@ -4743,7 +4743,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct > extent_buffer *src) > unsigned long i; > struct page *p; > struct extent_buffer *new; > - unsigned long num_pages = num_extent_pages(src->start, src->len); > + unsigned long num_pages = num_extent_pages(src); > > new = __alloc_extent_buffer(src->fs_info, src->start, src->len); > if (new == NULL) > @@ -4775,12 +4775,11 @@ struct extent_buffer > *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, > unsigned long num_pages; > unsigned long i; > > - num_pages = num_extent_pages(start, len); > - > eb = __alloc_extent_buffer(fs_info, start, len); > if (!eb) > return NULL; > > + num_pages = num_extent_pages(eb); > for (i = 0; i < num_pages; i++) { > eb->pages[i] = alloc_page(GFP_NOFS); > if (!eb->pages[i]) > @@ -4844,7 +4843,7 @@ static void mark_extent_buffer_accessed(struct > extent_buffer *eb, > > check_buffer_tree_ref(eb); > > - num_pages = num_extent_pages(eb->start, eb->len); > + num_pages = num_extent_pages(eb); > for (i = 0; i < num_pages; i++) { > struct page *p = eb->pages[i]; > > @@ -4941,7 +4940,7 @@ struct extent_buffer *alloc_extent_buffer(struct > btrfs_fs_info *fs_info, > u64 start) > { > unsigned long len = fs_info->nodesize; > - unsigned long num_pages = num_extent_pages(start, len); > + unsigned long num_pages; > unsigned long i; > unsigned long index = start >> PAGE_SHIFT; > struct extent_buffer *eb; > @@ -4964,6 +4963,7 @@ struct extent_buffer *alloc_extent_buffer(struct > btrfs_fs_info *fs_info, > if (!eb) > return ERR_PTR(-ENOMEM); > > + num_pages = num_extent_pages(eb); > for (i = 0; i < num_pages; i++, index++) { > p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); > if (!p) { > @@ -5160,7 +5160,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb) > unsigned long num_pages; > struct page *page; > > - num_pages = num_extent_pages(eb->start, eb->len); > + num_pages = num_extent_pages(eb); > > for (i = 0; i < num_pages; i++) { > page = eb->pages[i]; > @@ -5194,7 +5194,7 @@ int set_extent_buffer_dirty(struct extent_buffer *eb) > > was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, >bflags); > > - num_pages = num_extent_pages(eb->start, eb->len); > + num_pages = num_extent_pages(eb); > WARN_ON(atomic_read(>refs) == 0); > WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags)); > > @@ -5210,7 +5210,7 @@ void clear_extent_buffer_uptodate(struct extent_buffer > *eb) > unsigned long num_pages; > > clear_bit(EXTENT_BUFFER_UPTODATE, >bflags); > - num_pages = num_extent_pages(eb->start, eb->len); > +
[PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing
This is called either at the end of the mount or if the mount fails. In both cases, there's nothing running that can use the table so the lock is pointless. Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 272acd9b1192..0840b054e4b7 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -407,19 +407,16 @@ static void remove_rbio_from_cache(struct btrfs_raid_bio *rbio) static void btrfs_clear_rbio_cache(struct btrfs_fs_info *info) { struct btrfs_stripe_hash_table *table; - unsigned long flags; struct btrfs_raid_bio *rbio; table = info->stripe_hash_table; - spin_lock_irqsave(>cache_lock, flags); while (!list_empty(>stripe_cache)) { rbio = list_entry(table->stripe_cache.next, struct btrfs_raid_bio, stripe_cache); __remove_rbio_from_cache(rbio); } - spin_unlock_irqrestore(>cache_lock, flags); } /* -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/14] btrfs: raid56: use new helper for async_scrub_parity
Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index f9b349171d61..339cce0878d1 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -170,7 +170,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio); static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check); -static void async_scrub_parity(struct btrfs_raid_bio *rbio); +static void scrub_parity_work(struct btrfs_work *work); static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t work_func) { @@ -812,7 +812,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) start_async_work(next, rmw_work); } else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) { steal_rbio(rbio, next); - async_scrub_parity(next); + start_async_work(next, scrub_parity_work); } goto done_nolock; @@ -2703,18 +2703,10 @@ static void scrub_parity_work(struct btrfs_work *work) raid56_parity_scrub_stripe(rbio); } -static void async_scrub_parity(struct btrfs_raid_bio *rbio) -{ - btrfs_init_work(>work, btrfs_rmw_helper, - scrub_parity_work, NULL, NULL); - - btrfs_queue_work(rbio->fs_info->rmw_workers, >work); -} - void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio) { if (!lock_stripe_add(rbio)) - async_scrub_parity(rbio); + start_async_work(rbio, scrub_parity_work); } /* The following code is used for dev replace of a missing RAID 5/6 device. */ -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/14] btrfs: raid56: use new helper for async_rmw_stripe
Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index f30d847baf07..96a7d3445623 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio); static noinline void finish_rmw(struct btrfs_raid_bio *rbio); static void rmw_work(struct btrfs_work *work); static void read_rebuild_work(struct btrfs_work *work); -static void async_rmw_stripe(struct btrfs_raid_bio *rbio); static void async_read_rebuild(struct btrfs_raid_bio *rbio); static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio); static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed); @@ -811,7 +810,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) async_read_rebuild(next); } else if (next->operation == BTRFS_RBIO_WRITE) { steal_rbio(rbio, next); - async_rmw_stripe(next); + start_async_work(next, rmw_work); } else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) { steal_rbio(rbio, next); async_scrub_parity(next); @@ -1501,12 +1500,6 @@ static void raid_rmw_end_io(struct bio *bio) rbio_orig_end_io(rbio, BLK_STS_IOERR); } -static void async_rmw_stripe(struct btrfs_raid_bio *rbio) -{ - btrfs_init_work(>work, btrfs_rmw_helper, rmw_work, NULL, NULL); - btrfs_queue_work(rbio->fs_info->rmw_workers, >work); -} - static void async_read_rebuild(struct btrfs_raid_bio *rbio) { btrfs_init_work(>work, btrfs_rmw_helper, @@ -1645,7 +1638,7 @@ static int partial_stripe_write(struct btrfs_raid_bio *rbio) ret = lock_stripe_add(rbio); if (ret == 0) - async_rmw_stripe(rbio); + start_async_work(rbio, rmw_work); return 0; } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/14] btrfs: raid56: merge rbio_is_full helpers
There's only one call site of the unlocked helper so it can be folded into the caller. Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 339cce0878d1..272acd9b1192 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -507,32 +507,21 @@ static void run_xor(void **pages, int src_cnt, ssize_t len) } /* - * returns true if the bio list inside this rbio - * covers an entire stripe (no rmw required). - * Must be called with the bio list lock held, or - * at a time when you know it is impossible to add - * new bios into the list + * Returns true if the bio list inside this rbio covers an entire stripe (no + * rmw required). */ -static int __rbio_is_full(struct btrfs_raid_bio *rbio) +static int rbio_is_full(struct btrfs_raid_bio *rbio) { + unsigned long flags; unsigned long size = rbio->bio_list_bytes; int ret = 1; + spin_lock_irqsave(>bio_list_lock, flags); if (size != rbio->nr_data * rbio->stripe_len) ret = 0; - BUG_ON(size > rbio->nr_data * rbio->stripe_len); - return ret; -} - -static int rbio_is_full(struct btrfs_raid_bio *rbio) -{ - unsigned long flags; - int ret; - - spin_lock_irqsave(>bio_list_lock, flags); - ret = __rbio_is_full(rbio); spin_unlock_irqrestore(>bio_list_lock, flags); + return ret; } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/14] btrfs: raid56: catch errors from full_stripe_write
Add fall-back code to catch failure of full_stripe_write. Proper error handling from inside run_plug would need more code restructuring as it's called at arbitrary points by io scheduler. Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 0840b054e4b7..84889d10d5b0 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1683,8 +1683,11 @@ static void run_plug(struct btrfs_plug_cb *plug) list_del_init(>plug_list); if (rbio_is_full(cur)) { + int ret; + /* we have a full stripe, send it down */ - full_stripe_write(cur); + ret = full_stripe_write(cur); + BUG_ON(ret); continue; } if (last) { -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/14] btrfs: open-code bio_set_op_attrs
The helper is trivial and marked as deprecated. Signed-off-by: David Sterba --- fs/btrfs/check-integrity.c | 2 +- fs/btrfs/compression.c | 4 ++-- fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 2 +- fs/btrfs/raid56.c | 10 +- fs/btrfs/scrub.c | 6 +++--- 6 files changed, 13 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index dc062b195c46..7addfa41b393 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1624,7 +1624,7 @@ static int btrfsic_read_block(struct btrfsic_state *state, bio = btrfs_io_bio_alloc(num_pages - i); bio_set_dev(bio, block_ctx->dev->bdev); bio->bi_iter.bi_sector = dev_bytenr >> 9; - bio_set_op_attrs(bio, REQ_OP_READ, 0); + bio->bi_opf = REQ_OP_READ; for (j = i; j < num_pages; j++) { ret = bio_add_page(bio, block_ctx->pagev[j], diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index f48794a36068..70dace47258b 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -609,7 +609,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, cb->len = bio->bi_iter.bi_size; comp_bio = btrfs_bio_alloc(bdev, cur_disk_byte); - bio_set_op_attrs (comp_bio, REQ_OP_READ, 0); + comp_bio->bi_opf = REQ_OP_READ; comp_bio->bi_private = cb; comp_bio->bi_end_io = end_compressed_bio_read; refcount_set(>pending_bios, 1); @@ -656,7 +656,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, } comp_bio = btrfs_bio_alloc(bdev, cur_disk_byte); - bio_set_op_attrs(comp_bio, REQ_OP_READ, 0); + comp_bio->bi_opf = REQ_OP_READ; comp_bio->bi_private = cb; comp_bio->bi_end_io = end_compressed_bio_read; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 94ca8d644de9..8d86531150cd 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2401,7 +2401,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, start - page_offset(page), (int)phy_offset, failed_bio->bi_end_io, NULL); - bio_set_op_attrs(bio, REQ_OP_READ, read_mode); + bio->bi_opf = REQ_OP_READ | read_mode; btrfs_debug(btrfs_sb(inode->i_sb), "Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d", diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 02e12e8bffc4..0d0df4fae4b0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7861,7 +7861,7 @@ static blk_status_t dio_read_error(struct inode *inode, struct bio *failed_bio, isector >>= inode->i_sb->s_blocksize_bits; bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, pgoff, isector, repair_endio, repair_arg); - bio_set_op_attrs(bio, REQ_OP_READ, read_mode); + bio->bi_opf = REQ_OP_READ | read_mode; btrfs_debug(BTRFS_I(inode)->root->fs_info, "repair DIO read error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d", diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 42631079c492..1a1b7d6c44cb 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1330,7 +1330,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) bio->bi_private = rbio; bio->bi_end_io = raid_write_end_io; - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + bio->bi_opf = REQ_OP_WRITE; submit_bio(bio); } @@ -1586,7 +1586,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) bio->bi_private = rbio; bio->bi_end_io = raid_rmw_end_io; - bio_set_op_attrs(bio, REQ_OP_READ, 0); + bio->bi_opf = REQ_OP_READ; btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56); @@ -2130,7 +2130,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio) bio->bi_private = rbio; bio->bi_end_io = raid_recover_end_io; - bio_set_op_attrs(bio, REQ_OP_READ, 0); + bio->bi_opf = REQ_OP_READ; btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56); @@ -2502,7 +2502,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, bio->bi_private = rbio; bio->bi_end_io = raid_write_end_io; - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + bio->bi_opf = REQ_OP_WRITE; submit_bio(bio); } @@ -2684,7 +2684,7 @@ static void
[PATCH 06/14] btrfs: switch types to int when counting eb pages
The loops iterating eb pages use unsigned long, that's an overkill as we know that there are at most 16 pages (64k / 4k), and 4 by default (with nodesize 16k). Reviewed-by: Nikolay Borisov Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 44 ++-- fs/btrfs/extent_io.h | 2 +- 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 577bbb026042..94ca8d644de9 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, int mirror_num) { u64 start = eb->start; - unsigned long i, num_pages = num_extent_pages(eb); + int i, num_pages = num_extent_pages(eb); int ret = 0; if (sb_rdonly(fs_info->sb)) @@ -3541,7 +3541,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb, struct btrfs_fs_info *fs_info, struct extent_page_data *epd) { - unsigned long i, num_pages; + int i, num_pages; int flush = 0; int ret = 0; @@ -3715,7 +3715,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, struct extent_io_tree *tree = _I(fs_info->btree_inode)->io_tree; u64 offset = eb->start; u32 nritems; - unsigned long i, num_pages; + int i, num_pages; unsigned long start, end; unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META; int ret = 0; @@ -4644,7 +4644,7 @@ int extent_buffer_under_io(struct extent_buffer *eb) */ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) { - unsigned long index; + int index; struct page *page; int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags); @@ -4740,10 +4740,10 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src) { - unsigned long i; + int i; struct page *p; struct extent_buffer *new; - unsigned long num_pages = num_extent_pages(src); + int num_pages = num_extent_pages(src); new = __alloc_extent_buffer(src->fs_info, src->start, src->len); if (new == NULL) @@ -4772,8 +4772,8 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, unsigned long len) { struct extent_buffer *eb; - unsigned long num_pages; - unsigned long i; + int num_pages; + int i; eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) @@ -4839,7 +4839,7 @@ static void check_buffer_tree_ref(struct extent_buffer *eb) static void mark_extent_buffer_accessed(struct extent_buffer *eb, struct page *accessed) { - unsigned long num_pages, i; + int num_pages, i; check_buffer_tree_ref(eb); @@ -4940,8 +4940,8 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start) { unsigned long len = fs_info->nodesize; - unsigned long num_pages; - unsigned long i; + int num_pages; + int i; unsigned long index = start >> PAGE_SHIFT; struct extent_buffer *eb; struct extent_buffer *exists = NULL; @@ -5156,8 +5156,8 @@ void free_extent_buffer_stale(struct extent_buffer *eb) void clear_extent_buffer_dirty(struct extent_buffer *eb) { - unsigned long i; - unsigned long num_pages; + int i; + int num_pages; struct page *page; num_pages = num_extent_pages(eb); @@ -5186,8 +5186,8 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb) int set_extent_buffer_dirty(struct extent_buffer *eb) { - unsigned long i; - unsigned long num_pages; + int i; + int num_pages; int was_dirty = 0; check_buffer_tree_ref(eb); @@ -5205,9 +5205,9 @@ int set_extent_buffer_dirty(struct extent_buffer *eb) void clear_extent_buffer_uptodate(struct extent_buffer *eb) { - unsigned long i; + int i; struct page *page; - unsigned long num_pages; + int num_pages; clear_bit(EXTENT_BUFFER_UPTODATE, >bflags); num_pages = num_extent_pages(eb); @@ -5220,9 +5220,9 @@ void clear_extent_buffer_uptodate(struct extent_buffer *eb) void set_extent_buffer_uptodate(struct extent_buffer *eb) { - unsigned long i; + int i; struct page *page; - unsigned long num_pages; + int num_pages; set_bit(EXTENT_BUFFER_UPTODATE, >bflags); num_pages = num_extent_pages(eb); @@ -5235,13 +5235,13 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb) int read_extent_buffer_pages(struct extent_io_tree *tree, struct extent_buffer *eb, int wait,
[PATCH 05/14] btrfs: pass only eb to num_extent_pages
Almost all callers pass the start and len as 2 arguments but this is not necessary, all the information is provided by the eb. By reordering the calls to num_extent_pages, we don't need the local variables with start/len. Reviewed-by: Nikolay Borisov Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 30 +++--- fs/btrfs/extent_io.h | 6 +++--- 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 8e4a7cdbc9f5..577bbb026042 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, int mirror_num) { u64 start = eb->start; - unsigned long i, num_pages = num_extent_pages(eb->start, eb->len); + unsigned long i, num_pages = num_extent_pages(eb); int ret = 0; if (sb_rdonly(fs_info->sb)) @@ -3591,7 +3591,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb, if (!ret) return ret; - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++) { struct page *p = eb->pages[i]; @@ -3721,7 +3721,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, int ret = 0; clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags); - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); atomic_set(>io_pages, num_pages); /* set btree blocks beyond nritems with 0 to avoid stale content. */ @@ -4650,7 +4650,7 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb) BUG_ON(extent_buffer_under_io(eb)); - index = num_extent_pages(eb->start, eb->len); + index = num_extent_pages(eb); if (index == 0) return; @@ -4743,7 +4743,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src) unsigned long i; struct page *p; struct extent_buffer *new; - unsigned long num_pages = num_extent_pages(src->start, src->len); + unsigned long num_pages = num_extent_pages(src); new = __alloc_extent_buffer(src->fs_info, src->start, src->len); if (new == NULL) @@ -4775,12 +4775,11 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, unsigned long num_pages; unsigned long i; - num_pages = num_extent_pages(start, len); - eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) return NULL; + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++) { eb->pages[i] = alloc_page(GFP_NOFS); if (!eb->pages[i]) @@ -4844,7 +4843,7 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb, check_buffer_tree_ref(eb); - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++) { struct page *p = eb->pages[i]; @@ -4941,7 +4940,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start) { unsigned long len = fs_info->nodesize; - unsigned long num_pages = num_extent_pages(start, len); + unsigned long num_pages; unsigned long i; unsigned long index = start >> PAGE_SHIFT; struct extent_buffer *eb; @@ -4964,6 +4963,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, if (!eb) return ERR_PTR(-ENOMEM); + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++, index++) { p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); if (!p) { @@ -5160,7 +5160,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb) unsigned long num_pages; struct page *page; - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++) { page = eb->pages[i]; @@ -5194,7 +5194,7 @@ int set_extent_buffer_dirty(struct extent_buffer *eb) was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, >bflags); - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); WARN_ON(atomic_read(>refs) == 0); WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags)); @@ -5210,7 +5210,7 @@ void clear_extent_buffer_uptodate(struct extent_buffer *eb) unsigned long num_pages; clear_bit(EXTENT_BUFFER_UPTODATE, >bflags); - num_pages = num_extent_pages(eb->start, eb->len); + num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++) { page = eb->pages[i]; if (page) @@ -5225,7 +5225,7
[PATCH 08/14] btrfs: raid56: add new helper for starting async work
Add helper that schedules a given function to run on the rmw workqueue. This will replace several standalone helpers. Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 1a1b7d6c44cb..f30d847baf07 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -174,6 +174,12 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check); static void async_scrub_parity(struct btrfs_raid_bio *rbio); +static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t work_func) +{ + btrfs_init_work(>work, btrfs_rmw_helper, work_func, NULL, NULL); + btrfs_queue_work(rbio->fs_info->rmw_workers, >work); +} + /* * the stripe hash table is used for locking, and to collect * bios in hopes of making a full stripe -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/14] btrfs: raid56: use new helper for async_read_rebuild
Signed-off-by: David Sterba --- fs/btrfs/raid56.c | 15 +++ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 96a7d3445623..f9b349171d61 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio); static noinline void finish_rmw(struct btrfs_raid_bio *rbio); static void rmw_work(struct btrfs_work *work); static void read_rebuild_work(struct btrfs_work *work); -static void async_read_rebuild(struct btrfs_raid_bio *rbio); static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio); static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed); static void __free_raid_bio(struct btrfs_raid_bio *rbio); @@ -804,10 +803,10 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_unlock_irqrestore(>lock, flags); if (next->operation == BTRFS_RBIO_READ_REBUILD) - async_read_rebuild(next); + start_async_work(next, read_rebuild_work); else if (next->operation == BTRFS_RBIO_REBUILD_MISSING) { steal_rbio(rbio, next); - async_read_rebuild(next); + start_async_work(next, read_rebuild_work); } else if (next->operation == BTRFS_RBIO_WRITE) { steal_rbio(rbio, next); start_async_work(next, rmw_work); @@ -1500,14 +1499,6 @@ static void raid_rmw_end_io(struct bio *bio) rbio_orig_end_io(rbio, BLK_STS_IOERR); } -static void async_read_rebuild(struct btrfs_raid_bio *rbio) -{ - btrfs_init_work(>work, btrfs_rmw_helper, - read_rebuild_work, NULL, NULL); - - btrfs_queue_work(rbio->fs_info->rmw_workers, >work); -} - /* * the stripe must be locked by the caller. It will * unlock after all the writes are done @@ -2765,5 +2756,5 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio) { if (!lock_stripe_add(rbio)) - async_read_rebuild(rbio); + start_async_work(rbio, read_rebuild_work); } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/14] btrfs: prune unused includes
Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba --- fs/btrfs/compression.c | 4 fs/btrfs/dev-replace.c | 5 - fs/btrfs/disk-io.c | 2 -- fs/btrfs/file.c | 3 --- fs/btrfs/inode-map.c| 1 - fs/btrfs/inode.c| 4 fs/btrfs/ioctl.c| 5 - fs/btrfs/ordered-data.c | 1 - fs/btrfs/raid56.c | 13 - fs/btrfs/reada.c| 1 - fs/btrfs/struct-funcs.c | 1 - fs/btrfs/super.c| 3 --- fs/btrfs/sysfs.c| 2 -- fs/btrfs/volumes.c | 3 --- 14 files changed, 48 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d3e447b45bf7..f48794a36068 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -5,7 +5,6 @@ #include #include -#include #include #include #include @@ -14,10 +13,7 @@ #include #include #include -#include -#include #include -#include #include #include #include diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index e2ba0419297a..819440204fd5 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -6,14 +6,9 @@ #include #include #include -#include #include -#include -#include -#include #include #include -#include #include "ctree.h" #include "extent_map.h" #include "disk-io.h" diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7ee825b96187..efdfbd334b85 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -5,8 +5,6 @@ #include #include -#include -#include #include #include #include diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 11e43e69d12f..9f11ce754f75 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -5,14 +5,11 @@ #include #include -#include #include #include #include #include -#include #include -#include #include #include #include diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c index 12fcd8897c33..c120b8e25a3e 100644 --- a/fs/btrfs/inode-map.c +++ b/fs/btrfs/inode-map.c @@ -3,7 +3,6 @@ * Copyright (C) 2007 Oracle. All rights reserved. */ -#include #include #include diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f30608dc038b..02e12e8bffc4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -14,17 +14,13 @@ #include #include #include -#include -#include #include #include -#include #include #include #include #include #include -#include #include #include #include diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 5556e9ea2a4b..a78a2c8cbd09 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -5,23 +5,18 @@ #include #include -#include #include #include #include #include #include #include -#include #include #include #include -#include #include -#include #include #include -#include #include #include #include diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 2e1a1694a33d..6055d357ce4c 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -6,7 +6,6 @@ #include #include #include -#include #include "ctree.h" #include "transaction.h" #include "btrfs_inode.h" diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 27ed47a23f26..42631079c492 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -5,32 +5,19 @@ */ #include -#include #include #include -#include #include -#include -#include -#include -#include -#include #include #include #include #include #include -#include #include "ctree.h" -#include "extent_map.h" #include "disk-io.h" -#include "transaction.h" -#include "print-tree.h" #include "volumes.h" #include "raid56.h" #include "async-thread.h" -#include "check-integrity.h" -#include "rcu-string.h" /* set when additional merges to this rbio are not allowed */ #define RBIO_RMW_LOCKED_BIT1 diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 40f1bcef394d..83d49f2e7bc6 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -7,7 +7,6 @@ #include #include #include -#include #include #include #include "ctree.h" diff --git a/fs/btrfs/struct-funcs.c b/fs/btrfs/struct-funcs.c index b7b4acb12833..4c13b737f568 100644 --- a/fs/btrfs/struct-funcs.c +++ b/fs/btrfs/struct-funcs.c @@ -3,7 +3,6 @@ * Copyright (C) 2007 Oracle. All rights reserved. */ -#include #include #include "ctree.h" diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 81107ad49f3a..a264bea21c3a 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -5,7 +5,6 @@ #include #include -#include #include #include #include @@ -15,8 +14,6 @@ #include #include #include -#include -#include #include #include #include diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 4a4e960c7c66..3717c864ba23 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -7,10 +7,8 @@ #include #include #include -#include #include #include -#include #include #include
[PATCH 01/14] btrfs: simplify some assignments of inode numbers
There are several places when the btrfs inode is converted to the generic inode, back to btrfs and then passed to btrfs_ino. We can remove the extra back and forth conversions. Signed-off-by: David Sterba --- fs/btrfs/inode.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c12b7a6e534a..a27e68405aa3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4243,9 +4243,9 @@ static void btrfs_prune_dentries(struct btrfs_root *root) prev = node; entry = rb_entry(node, struct btrfs_inode, rb_node); - if (objectid < btrfs_ino(BTRFS_I(>vfs_inode))) + if (objectid < btrfs_ino(entry)) node = node->rb_left; - else if (objectid > btrfs_ino(BTRFS_I(>vfs_inode))) + else if (objectid > btrfs_ino(entry)) node = node->rb_right; else break; @@ -4253,7 +4253,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root) if (!node) { while (prev) { entry = rb_entry(prev, struct btrfs_inode, rb_node); - if (objectid <= btrfs_ino(BTRFS_I(>vfs_inode))) { + if (objectid <= btrfs_ino(entry)) { node = prev; break; } @@ -4262,7 +4262,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root) } while (node) { entry = rb_entry(node, struct btrfs_inode, rb_node); - objectid = btrfs_ino(BTRFS_I(>vfs_inode)) + 1; + objectid = btrfs_ino(entry) + 1; inode = igrab(>vfs_inode); if (inode) { spin_unlock(>inode_lock); @@ -5615,9 +5615,9 @@ static void inode_tree_add(struct inode *inode) parent = *p; entry = rb_entry(parent, struct btrfs_inode, rb_node); - if (ino < btrfs_ino(BTRFS_I(>vfs_inode))) + if (ino < btrfs_ino(entry)) p = >rb_left; - else if (ino > btrfs_ino(BTRFS_I(>vfs_inode))) + else if (ino > btrfs_ino(entry)) p = >rb_right; else { WARN_ON(!(entry->vfs_inode.i_state & -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/14] btrfs: use copy_page for copying pages instead of memcpy
Use the helper that's possibly optimized for full page copies. Signed-off-by: David Sterba --- fs/btrfs/free-space-cache.c | 4 ++-- fs/btrfs/raid56.c | 12 +--- 2 files changed, 7 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 38ad5e9f0bf2..15732be6bd56 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -547,7 +547,7 @@ static int io_ctl_add_bitmap(struct btrfs_io_ctl *io_ctl, void *bitmap) io_ctl_map_page(io_ctl, 0); } - memcpy(io_ctl->cur, bitmap, PAGE_SIZE); + copy_page(io_ctl->cur, bitmap); io_ctl_set_crc(io_ctl, io_ctl->index - 1); if (io_ctl->index < io_ctl->num_pages) io_ctl_map_page(io_ctl, 0); @@ -607,7 +607,7 @@ static int io_ctl_read_bitmap(struct btrfs_io_ctl *io_ctl, if (ret) return ret; - memcpy(entry->bitmap, io_ctl->cur, PAGE_SIZE); + copy_page(entry->bitmap, io_ctl->cur); io_ctl_unmap_page(io_ctl); return 0; diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 5e4ad134b9ad..27ed47a23f26 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -260,7 +260,7 @@ static void cache_rbio_pages(struct btrfs_raid_bio *rbio) s = kmap(rbio->bio_pages[i]); d = kmap(rbio->stripe_pages[i]); - memcpy(d, s, PAGE_SIZE); + copy_page(d, s); kunmap(rbio->bio_pages[i]); kunmap(rbio->stripe_pages[i]); @@ -1275,7 +1275,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) pointers); } else { /* raid5 */ - memcpy(pointers[nr_data], pointers[0], PAGE_SIZE); + copy_page(pointers[nr_data], pointers[0]); run_xor(pointers + 1, nr_data - 1, PAGE_SIZE); } @@ -1941,9 +1941,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) BUG_ON(failb != -1); pstripe: /* Copy parity block into failed block to start with */ - memcpy(pointers[faila], - pointers[rbio->nr_data], - PAGE_SIZE); + copy_page(pointers[faila], pointers[rbio->nr_data]); /* rearrange the pointer array */ p = pointers[faila]; @@ -2448,7 +2446,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, pointers); } else { /* raid5 */ - memcpy(pointers[nr_data], pointers[0], PAGE_SIZE); + copy_page(pointers[nr_data], pointers[0]); run_xor(pointers + 1, nr_data - 1, PAGE_SIZE); } @@ -2456,7 +2454,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, p = rbio_stripe_page(rbio, rbio->scrubp, pagenr); parity = kmap(p); if (memcmp(parity, pointers[rbio->scrubp], PAGE_SIZE)) - memcpy(parity, pointers[rbio->scrubp], PAGE_SIZE); + copy_page(parity, pointers[rbio->scrubp]); else /* Parity is right, needn't writeback */ bitmap_clear(rbio->dbitmap, pagenr, 1); -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/14] btrfs: simplify pointer chasing of local fs_info variables
Functions that get btrfs inode can simply reach the fs_info by dereferencing the root and this looks a bit more straightforward compared to the btrfs_sb(...) indirection. If the transaction handle is available and not NULL it's used instead. Signed-off-by: David Sterba --- fs/btrfs/delayed-inode.c| 4 ++-- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 6 +++--- fs/btrfs/file-item.c| 2 +- fs/btrfs/file.c | 14 +++--- fs/btrfs/free-space-cache.c | 7 ++- fs/btrfs/inode.c| 6 +++--- fs/btrfs/tree-log.c | 6 +++--- 8 files changed, 22 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index fe6caa7e698b..596d2af0c8aa 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1222,7 +1222,7 @@ int btrfs_commit_inode_delayed_items(struct btrfs_trans_handle *trans, int btrfs_commit_inode_delayed_inode(struct btrfs_inode *inode) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_trans_handle *trans; struct btrfs_delayed_node *delayed_node = btrfs_get_delayed_node(inode); struct btrfs_path *path; @@ -1837,7 +1837,7 @@ int btrfs_delayed_update_inode(struct btrfs_trans_handle *trans, int btrfs_delayed_delete_inode_ref(struct btrfs_inode *inode) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_delayed_node *delayed_node; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 205092dc9390..7ee825b96187 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -212,7 +212,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode, struct page *page, size_t pg_offset, u64 start, u64 len, int create) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct extent_map_tree *em_tree = >extent_tree; struct extent_map *em; int ret; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..59e42481a515 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6019,7 +6019,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; unsigned nr_extents; enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL; int ret = 0; @@ -6092,7 +6092,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes, bool qgroup_free) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; num_bytes = ALIGN(num_bytes, fs_info->sectorsize); spin_lock(>lock); @@ -6121,7 +6121,7 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes, void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes, bool qgroup_free) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; unsigned num_extents; spin_lock(>lock); diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index f9dd6d1836a3..ec0a1acbe783 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -922,7 +922,7 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, const bool new_inline, struct extent_map *em) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_root *root = inode->root; struct extent_buffer *leaf = path->nodes[0]; const int slot = path->slots[0]; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f660ba1e5e58..11e43e69d12f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -83,7 +83,7 @@ static int __compare_inode_defrag(struct inode_defrag *defrag1, static int __btrfs_add_inode_defrag(struct btrfs_inode *inode, struct inode_defrag *defrag) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb); + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct inode_defrag *entry; struct rb_node **p; struct rb_node *parent = NULL; @@ -135,8 +135,8 @@ static inline int __need_auto_defrag(struct btrfs_fs_info
[PATCH 00/14] Misc cleanups for 4.19
Minor interface/API and other safe cleanups. David Sterba (14): btrfs: simplify some assignments of inode numbers btrfs: simplify pointer chasing of local fs_info variables btrfs: use copy_page for copying pages instead of memcpy btrfs: prune unused includes btrfs: pass only eb to num_extent_pages btrfs: switch types to int when counting eb pages btrfs: open-code bio_set_op_attrs btrfs: raid56: add new helper for starting async work btrfs: raid56: use new helper for async_rmw_stripe btrfs: raid56: use new helper for async_read_rebuild btrfs: raid56: use new helper for async_scrub_parity btrfs: raid56: merge rbio_is_full helpers btrfs: raid56: don't lock stripe cache table when freeing btrfs: raid56: catch errors from full_stripe_write fs/btrfs/check-integrity.c | 2 +- fs/btrfs/compression.c | 8 +-- fs/btrfs/delayed-inode.c| 4 +- fs/btrfs/dev-replace.c | 5 -- fs/btrfs/disk-io.c | 4 +- fs/btrfs/extent-tree.c | 6 +- fs/btrfs/extent_io.c| 70 +++--- fs/btrfs/extent_io.h| 6 +- fs/btrfs/file-item.c| 2 +- fs/btrfs/file.c | 17 +++--- fs/btrfs/free-space-cache.c | 11 ++-- fs/btrfs/inode-map.c| 1 - fs/btrfs/inode.c| 24 fs/btrfs/ioctl.c| 5 -- fs/btrfs/ordered-data.c | 1 - fs/btrfs/raid56.c | 112 +++- fs/btrfs/reada.c| 1 - fs/btrfs/scrub.c| 6 +- fs/btrfs/struct-funcs.c | 1 - fs/btrfs/super.c| 3 - fs/btrfs/sysfs.c| 2 - fs/btrfs/tree-log.c | 6 +- fs/btrfs/volumes.c | 3 - 23 files changed, 109 insertions(+), 191 deletions(-) -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: use customized batch size for total_bytes_pinned
On 29.06.2018 11:27, Ethan Lien wrote: > We use percpu_counter to reduce lock contention of frequently updated > counter. But the default 32 batch size is suitable only for inc/dec > update, not for sector size update. The end result is we still acquire > spin lock for every percpu_counter_add(). So use customized batch size > for total_bytes_pinned as we already done for dirty_metadata_bytes > and delalloc_bytes. Also we fix dirty_metadata_bytes to use > __percpu_counter_compare of customized batch so we get precise value of > dirty_metadata_bytes. Do you have any numbers to prove this is problematic? Perhaps a fio workload + lockdep contention numbers before/after the patch ? > > Signed-off-by: Ethan Lien > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/disk-io.c | 10 ++ > fs/btrfs/extent-tree.c | 43 +++--- > 3 files changed, 35 insertions(+), 19 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 118346aceea9..df682a521635 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -422,6 +422,7 @@ struct btrfs_space_info { >* time the transaction commits. >*/ > struct percpu_counter total_bytes_pinned; > + s32 total_bytes_pinned_batch; > > struct list_head list; > /* Protected by the spinlock 'lock'. */ > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 205092dc9390..dfed08e70ec1 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping, > > fs_info = BTRFS_I(mapping->host)->root->fs_info; > /* this is a bit racy, but that's ok */ > - ret = percpu_counter_compare(_info->dirty_metadata_bytes, > - BTRFS_DIRTY_METADATA_THRESH); > + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, > + BTRFS_DIRTY_METADATA_THRESH, > + fs_info->dirty_metadata_batch); > if (ret < 0) > return 0; > } > @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct > btrfs_fs_info *fs_info, > if (flush_delayed) > btrfs_balance_delayed_items(fs_info); > > - ret = percpu_counter_compare(_info->dirty_metadata_bytes, > - BTRFS_DIRTY_METADATA_THRESH); > + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, > + BTRFS_DIRTY_METADATA_THRESH, > + fs_info->dirty_metadata_batch); > if (ret > 0) { > > balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); > } This hunk needs to be in a separate patch. And the last sentence which relates to it needs to be expanded further to explain the problem. > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d9fe58c0080..a067c047904c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info > *fs_info, s64 num_bytes, > > space_info = __find_space_info(fs_info, flags); > ASSERT(space_info); > - percpu_counter_add(_info->total_bytes_pinned, num_bytes); > + percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes, > + space_info->total_bytes_pinned_batch); > } > > /* > @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > flags = BTRFS_BLOCK_GROUP_METADATA; > space_info = __find_space_info(fs_info, flags); > ASSERT(space_info); > - percpu_counter_add(_info->total_bytes_pinned, > --head->num_bytes); > + percpu_counter_add_batch(_info->total_bytes_pinned, > +-head->num_bytes, > +space_info->total_bytes_pinned_batch); > > if (head->is_data) { > spin_lock(_refs->lock); > @@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info > *info, u64 flags) > INIT_LIST_HEAD(_info->ro_bgs); > INIT_LIST_HEAD(_info->tickets); > INIT_LIST_HEAD(_info->priority_tickets); > + if (flags & BTRFS_BLOCK_GROUP_DATA) > + space_info->total_bytes_pinned_batch = info->delalloc_batch; > + else > + space_info->total_bytes_pinned_batch = > info->dirty_metadata_batch; > > ret = kobject_init_and_add(_info->kobj, _info_ktype, > info->space_info_kobj, "%s", > @@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode > *inode, u64 bytes) >* allocation, and no removed chunk in current transaction, >* don't bother committing the transaction. >*/ > -
[PATCH] btrfs: use customized batch size for total_bytes_pinned
We use percpu_counter to reduce lock contention of frequently updated counter. But the default 32 batch size is suitable only for inc/dec update, not for sector size update. The end result is we still acquire spin lock for every percpu_counter_add(). So use customized batch size for total_bytes_pinned as we already done for dirty_metadata_bytes and delalloc_bytes. Also we fix dirty_metadata_bytes to use __percpu_counter_compare of customized batch so we get precise value of dirty_metadata_bytes. Signed-off-by: Ethan Lien --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 10 ++ fs/btrfs/extent-tree.c | 43 +++--- 3 files changed, 35 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; struct list_head list; /* Protected by the spinlock 'lock'. */ diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 205092dc9390..dfed08e70ec1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping, fs_info = BTRFS_I(mapping->host)->root->fs_info; /* this is a bit racy, but that's ok */ - ret = percpu_counter_compare(_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret < 0) return 0; } @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info, if (flush_delayed) btrfs_balance_delayed_items(fs_info); - ret = percpu_counter_compare(_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret > 0) { balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..a067c047904c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes, + space_info->total_bytes_pinned_batch); } /* @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, - -head->num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + space_info->total_bytes_pinned_batch); if (head->is_data) { spin_lock(_refs->lock); @@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags) INIT_LIST_HEAD(_info->ro_bgs); INIT_LIST_HEAD(_info->tickets); INIT_LIST_HEAD(_info->priority_tickets); + if (flags & BTRFS_BLOCK_GROUP_DATA) + space_info->total_bytes_pinned_batch = info->delalloc_batch; + else + space_info->total_bytes_pinned_batch = info->dirty_metadata_batch; ret = kobject_init_and_add(_info->kobj, _info_ktype, info->space_info_kobj, "%s", @@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * allocation, and no removed chunk in current transaction, * don't bother committing the transaction. */ - have_pinned_space = percpu_counter_compare( + have_pinned_space = __percpu_counter_compare( _sinfo->total_bytes_pinned, - used + bytes - data_sinfo->total_bytes); + used + bytes - data_sinfo->total_bytes, + data_sinfo->total_bytes_pinned_batch); spin_unlock(_sinfo->lock); /*
Re: So, does btrfs check lowmem take days? weeks?
Hi, On 29/06/2018 09:22, Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote: >> On Thu, 28 Jun 2018 23:59:03 -0700 >> Marc MERLIN wrote: >> >>> I don't waste a week recreating the many btrfs send/receive relationships. >> Consider not using send/receive, and switching to regular rsync instead. >> Send/receive is very limiting and cumbersome, including because of what you >> described. And it doesn't gain you much over an incremental rsync. As for > Err, sorry but I cannot agree with you here, at all :) > > btrfs send/receive is pretty much the only reason I use btrfs. > rsync takes hours on big filesystems scanning every single inode on both > sides and then seeing what changed, and only then sends the differences > It's super inefficient. > btrfs send knows in seconds what needs to be sent, and works on it right > away. I've not yet tried send/receive but I feel the pain of rsyncing millions of files (I had to use lsyncd to limit the problem to the time the origin servers reboot which is a relatively rare event) so this thread picked my attention. Looking at the whole thread I wonder if you could get a more manageable solution by splitting the filesystem. If instead of using a single BTRFS filesystem you used LVM volumes (maybe with Thin provisioning and monitoring of the volume group free space) for each of your servers to backup with one BTRFS filesystem per volume you would have less snapshots per filesystem and isolate problems in case of corruption. If you eventually decide to start from scratch again this might help a lot in your case. Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, 29 Jun 2018 00:22:10 -0700 Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote: > > On Thu, 28 Jun 2018 23:59:03 -0700 > > Marc MERLIN wrote: > > > > > I don't waste a week recreating the many btrfs send/receive relationships. > > > > Consider not using send/receive, and switching to regular rsync instead. > > Send/receive is very limiting and cumbersome, including because of what you > > described. And it doesn't gain you much over an incremental rsync. As for > > Err, sorry but I cannot agree with you here, at all :) > > btrfs send/receive is pretty much the only reason I use btrfs. > rsync takes hours on big filesystems scanning every single inode on both > sides and then seeing what changed, and only then sends the differences I use it for backing up root filesystems of about 20 hosts, and for syncing large multi-terabyte media collections -- it's fast enough in both. Admittedly neither of those case has millions of subdirs or files where scanning may take a long time. And in the former case it's also all from and to SSDs. Maybe your use case is different where it doesn't work as well. But perhaps then general day-to-day performance is not great either, so I'd suggest looking into SSD-based LVM caching, it really works wonders with Btrfs. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 03:20:42PM +0800, Qu Wenruo wrote: > If certain btrfs specific operations are involved, it's definitely not OK: > 1) Balance > 2) Quota > 3) Btrfs check Ok, I understand. I'll try to balance almost never then. My problems did indeed start because I ran balance and it got stuck 2 days with 0 progress. That still seems like a bug though. I'm ok with slow, but stuck for 2 days with only 270 snapshots or so means there is a bug, or the algorithm is so expensive that 270 snapshots could cause it to take days or weeks to proceed? > > It's a backup server, it only contains data from other machines. > > If the filesystem cannot be recovered to a working state, I will need > > over a week to restart the many btrfs send commands from many servers. > > This is why anything other than --repair is useless ot me, I don't need > > the data back, it's still on the original machines, I need the > > filesystem to work again so that I don't waste a week recreating the > > many btrfs send/receive relationships. > > Now totally understand why you need to repair the fs. I also understand that my use case is atypical :) But I guess this also means that using btrfs for a lot of send/receive on a backup server is not going to work well unfortunately :-/ Now I'm wondering if I'm the only person even doing this. > > Does the pastebin help and is 270 snapshots ok enough? > > The super dump doesn't show anything wrong. > > So the problem may be in the super large extent tree. > > In this case, plain check result with Su's patch would help more, other > than the not so interesting super dump. First I tried to mount with skip balance after the partial repair, and it hung a long time: [445635.716318] BTRFS info (device dm-2): disk space caching is enabled [445635.736229] BTRFS info (device dm-2): has skinny extents [445636.101999] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [445825.053205] BTRFS info (device dm-2): enabling ssd optimizations [446511.006588] BTRFS info (device dm-2): disk space caching is enabled [446511.026737] BTRFS info (device dm-2): has skinny extents [446511.325470] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [446699.593501] BTRFS info (device dm-2): enabling ssd optimizations [446964.077045] INFO: task btrfs-transacti:9211 blocked for more than 120 seconds. [446964.099802] Not tainted 4.17.2-amd64-preempt-sysrq-20180818 #3 [446964.120004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. So, I rebooted, and will now run Su's btrfs check without repair and report back. Thanks both for your help. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote: > On Thu, 28 Jun 2018 23:59:03 -0700 > Marc MERLIN wrote: > > > I don't waste a week recreating the many btrfs send/receive relationships. > > Consider not using send/receive, and switching to regular rsync instead. > Send/receive is very limiting and cumbersome, including because of what you > described. And it doesn't gain you much over an incremental rsync. As for Err, sorry but I cannot agree with you here, at all :) btrfs send/receive is pretty much the only reason I use btrfs. rsync takes hours on big filesystems scanning every single inode on both sides and then seeing what changed, and only then sends the differences It's super inefficient. btrfs send knows in seconds what needs to be sent, and works on it right away. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On 2018年06月29日 14:59, Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote: >>> If --repair doesn't work, check is useless to me sadly. >> >> Not exactly. >> Although it's time consuming, I have manually patched several users fs, >> which normally ends pretty well. > > Ok I understand now. > >>> Agreed, I doubt I have over or much over 100 snapshots though (but I >>> can't check right now). >>> Sadly I'm not allowed to mount even read only while check is running: >>> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2 >>> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy > > Ok, so I just checked now, 270 snapshots, but not because I'm crazy, > because I use btrfs send a lot :) > >> This looks like super block corruption? >> >> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"? > > Sure, there you go: https://pastebin.com/uF1pHTsg > >> And what about "skip_balance" mount option? > > I have this in my fstab :) > >> Another problem is, with so many snapshots, balance is also hugely >> slowed, thus I'm not 100% sure if it's really a hang. > > I sent another thread about this last week, balance got hung after 2 > days of doing nothing and just moving a single chunk. > > Ok, I was able to remount the filesystem read only. I was wrong, I have > 270 snapshots: > gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/' > 74 > gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/' > 196 > > It's a backup server, I use btrfs send for many machines and for each btrs > send, I keep history, maybe 10 or so backups. So it adds up in the end. > > Is btrfs unable to deal with this well enough? It depends. For certain and rare case, if the only operations to the filesystem are non-btrfs specific operations (POSIX file operations), then you're fine. (Maybe you can go thousands snapshots before any obvious performance degrade) If certain btrfs specific operations are involved, it's definitely not OK: 1) Balance 2) Quota 3) Btrfs check > >> If for that usage, btrfs-restore would fit your use case more, >> Unfortunately it needs extra disk space and isn't good at restoring >> subvolume/snapshots. >> (Although it's much faster than repairing the possible corrupted extent >> tree) > > It's a backup server, it only contains data from other machines. > If the filesystem cannot be recovered to a working state, I will need > over a week to restart the many btrfs send commands from many servers. > This is why anything other than --repair is useless ot me, I don't need > the data back, it's still on the original machines, I need the > filesystem to work again so that I don't waste a week recreating the > many btrfs send/receive relationships. Now totally understand why you need to repair the fs. > >>> Is that possible at all? >> >> At least for file recovery (fs tree repair), we have such behavior. >> >> However, the problem you hit (and a lot of users hit) is all about >> extent tree repair, which doesn't even goes to file recovery. >> >> All the hassle are in extent tree, and for extent tree, it's just good >> or bad. Any corruption in extent tree may lead to later bugs. >> The only way to avoid extent tree problems is to mount the fs RO. >> >> So, I'm afraid it is at least impossible for recent years. > > Understood, thanks for answering. > > Does the pastebin help and is 270 snapshots ok enough? The super dump doesn't show anything wrong. So the problem may be in the super large extent tree. In this case, plain check result with Su's patch would help more, other than the not so interesting super dump. Thanks, Qu > > Thanks, > Marc > signature.asc Description: OpenPGP digital signature
Re: So, does btrfs check lowmem take days? weeks?
On Thu, 28 Jun 2018 23:59:03 -0700 Marc MERLIN wrote: > I don't waste a week recreating the many btrfs send/receive relationships. Consider not using send/receive, and switching to regular rsync instead. Send/receive is very limiting and cumbersome, including because of what you described. And it doesn't gain you much over an incremental rsync. As for snapshots on the backup server, you can either automate making one as soon as a backup has finished, or simply make them once/twice a day, during a period when no backups are ongoing. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"
On 29.06.2018 10:11, Anand Jain wrote: > > On 06/29/2018 01:26 PM, Nikolay Borisov wrote: >> Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for >> device list traversal") btrfs_show_devname no longer takes >> device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix >> a possible umount deadlock") aimed to fix no longer exists. So remove >> the extra code that commit added. >> >> This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa. > > Its not exactly a revert. > > Because before 88c14590cdd6 was written, 0ccd05285e7f was already their > for the right purpose OR a new title is even better. What I'm saying is that commit 88c14590cdd6 obsoleted 0ccd05285e7f, hence this patch reverts the latter. I've literally generated it with git revert 0ccd05285e7f + minor fixups. > > >> Why do we first replace the closed device with a copy in >> btrfs_close_one_device, then dispose of the copied >> devices in btrfs_close_devices IFF we had fs_devices->seed not being >> NULL? > > I doubt if its for the fs_devices->seed purpose, we could have > just NULLed few items in device and it would have worked > as well. I guess it for the purpose of RCU satisfactions, > I remember running into that issues while trying to fix. > If you have plans to fix that pls go ahead. Thanks. While > my time is occupied with volume locks as of now. Otherwise > its in the list to fix. Yeah , I spoke with David about that and he said it was related to RCU and of course the commit that introduced that had nothing about RCU in the change log. Device close logic seems to be a train wreck atm. > >> Signed-off-by: Nikolay Borisov > > Reviewed-by: Anand Jain > > Thanks, Anand > >> --- >> >> V2: >> * Fixed build failure due to using old name of free_device_rcu >> function. >> fs/btrfs/volumes.c | 26 ++ >> 1 file changed, 6 insertions(+), 20 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 5bd6f3a40f9c..011a19b7930f 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device >> *device) >> blkdev_put(device->bdev, device->mode); >> } >> -static void btrfs_prepare_close_one_device(struct btrfs_device >> *device) >> +static void btrfs_close_one_device(struct btrfs_device *device) >> { >> struct btrfs_fs_devices *fs_devices = device->fs_devices; >> struct btrfs_device *new_device; >> @@ -1022,6 +1022,8 @@ static void >> btrfs_prepare_close_one_device(struct btrfs_device *device) >> if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) >> fs_devices->missing_devices--; >> + btrfs_close_bdev(device); >> + >> new_device = btrfs_alloc_device(NULL, >devid, >> device->uuid); >> BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ >> @@ -1035,39 +1037,23 @@ static void >> btrfs_prepare_close_one_device(struct btrfs_device *device) >> list_replace_rcu(>dev_list, _device->dev_list); >> new_device->fs_devices = device->fs_devices; >> + >> + call_rcu(>rcu, free_device_rcu); >> } >> static int close_fs_devices(struct btrfs_fs_devices *fs_devices) >> { >> struct btrfs_device *device, *tmp; >> - struct list_head pending_put; >> - >> - INIT_LIST_HEAD(_put); >> if (--fs_devices->opened > 0) >> return 0; >> mutex_lock(_devices->device_list_mutex); >> list_for_each_entry_safe(device, tmp, _devices->devices, >> dev_list) { >> - btrfs_prepare_close_one_device(device); >> - list_add(>dev_list, _put); >> + btrfs_close_one_device(device); >> } >> mutex_unlock(_devices->device_list_mutex); >> - /* >> - * btrfs_show_devname() is using the device_list_mutex, >> - * sometimes call to blkdev_put() leads vfs calling >> - * into this func. So do put outside of device_list_mutex, >> - * as of now. >> - */ >> - while (!list_empty(_put)) { >> - device = list_first_entry(_put, >> - struct btrfs_device, dev_list); >> - list_del(>dev_list); >> - btrfs_close_bdev(device); >> - call_rcu(>rcu, free_device_rcu); >> - } >> - >> WARN_ON(fs_devices->open_devices); >> WARN_ON(fs_devices->rw_devices); >> fs_devices->opened = 0; >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"
On 06/29/2018 01:26 PM, Nikolay Borisov wrote: Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") btrfs_show_devname no longer takes device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix a possible umount deadlock") aimed to fix no longer exists. So remove the extra code that commit added. This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa. Its not exactly a revert. Because before 88c14590cdd6 was written, 0ccd05285e7f was already their for the right purpose OR a new title is even better. > Why do we first replace the closed device with a copy in > btrfs_close_one_device, then dispose of the copied > devices in btrfs_close_devices IFF we had fs_devices->seed not being > NULL? I doubt if its for the fs_devices->seed purpose, we could have just NULLed few items in device and it would have worked as well. I guess it for the purpose of RCU satisfactions, I remember running into that issues while trying to fix. If you have plans to fix that pls go ahead. Thanks. While my time is occupied with volume locks as of now. Otherwise its in the list to fix. Signed-off-by: Nikolay Borisov Reviewed-by: Anand Jain Thanks, Anand --- V2: * Fixed build failure due to using old name of free_device_rcu function. fs/btrfs/volumes.c | 26 ++ 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5bd6f3a40f9c..011a19b7930f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device) blkdev_put(device->bdev, device->mode); } -static void btrfs_prepare_close_one_device(struct btrfs_device *device) +static void btrfs_close_one_device(struct btrfs_device *device) { struct btrfs_fs_devices *fs_devices = device->fs_devices; struct btrfs_device *new_device; @@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) fs_devices->missing_devices--; + btrfs_close_bdev(device); + new_device = btrfs_alloc_device(NULL, >devid, device->uuid); BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ @@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) list_replace_rcu(>dev_list, _device->dev_list); new_device->fs_devices = device->fs_devices; + + call_rcu(>rcu, free_device_rcu); } static int close_fs_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_device *device, *tmp; - struct list_head pending_put; - - INIT_LIST_HEAD(_put); if (--fs_devices->opened > 0) return 0; mutex_lock(_devices->device_list_mutex); list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) { - btrfs_prepare_close_one_device(device); - list_add(>dev_list, _put); + btrfs_close_one_device(device); } mutex_unlock(_devices->device_list_mutex); - /* -* btrfs_show_devname() is using the device_list_mutex, -* sometimes call to blkdev_put() leads vfs calling -* into this func. So do put outside of device_list_mutex, -* as of now. -*/ - while (!list_empty(_put)) { - device = list_first_entry(_put, - struct btrfs_device, dev_list); - list_del(>dev_list); - btrfs_close_bdev(device); - call_rcu(>rcu, free_device_rcu); - } - WARN_ON(fs_devices->open_devices); WARN_ON(fs_devices->rw_devices); fs_devices->opened = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()
On Mon, Jun 25, 2018 at 07:16:38PM +0200, David Sterba wrote: > On Mon, May 14, 2018 at 06:35:48PM +0200, David Sterba wrote: > > On Fri, May 11, 2018 at 01:30:01PM -0700, Omar Sandoval wrote: > > > On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote: > > > > On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote: > > > > > do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > > > > struct block_device *bdev, struct iov_iter *iter, > > > > > get_block_t get_block, dio_iodone_t end_io, > > > > > - dio_submit_t submit_io, int flags) > > > > > + dio_submit_t submit_io, int flags, void *private) > > > > > > > > Oh, dear... That's what, 9 arguments? I agree that the hack in > > > > question > > > > is obscene, but so is this ;-/ > > > > > > So looking at these one by one, obviously needed: > > > > > > - iocb > > > - inode > > > - iter > > > > > > bdev is almost always inode->i_sb->s_bdev, except for Btrfs :( > > > > > > These could _maybe_ go in struct kiocb: > > > > > > - flags could maybe be folded into ki_flags > > > - private could maybe go in iocb->private, but I haven't yet read > > > through to figure out if we're already using iocb->private for direct > > > I/O > > > > I think the kiocb::private can be used for the purpose. There's only one > > user, ext4, that also passes some DIO data around so it would in line > > with the interface AFAICS. > > Omar, do you have an update to the patchset? Thanks. Al, what do you think of changing all users of map_bh->b_private to use iocb->private? We'd have to pass the iocb to get_block() and submit_io(), but we could get rid of dio->private. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote: > > If --repair doesn't work, check is useless to me sadly. > > Not exactly. > Although it's time consuming, I have manually patched several users fs, > which normally ends pretty well. Ok I understand now. > > Agreed, I doubt I have over or much over 100 snapshots though (but I > > can't check right now). > > Sadly I'm not allowed to mount even read only while check is running: > > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2 > > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy Ok, so I just checked now, 270 snapshots, but not because I'm crazy, because I use btrfs send a lot :) > This looks like super block corruption? > > What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"? Sure, there you go: https://pastebin.com/uF1pHTsg > And what about "skip_balance" mount option? I have this in my fstab :) > Another problem is, with so many snapshots, balance is also hugely > slowed, thus I'm not 100% sure if it's really a hang. I sent another thread about this last week, balance got hung after 2 days of doing nothing and just moving a single chunk. Ok, I was able to remount the filesystem read only. I was wrong, I have 270 snapshots: gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/' 74 gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/' 196 It's a backup server, I use btrfs send for many machines and for each btrs send, I keep history, maybe 10 or so backups. So it adds up in the end. Is btrfs unable to deal with this well enough? > If for that usage, btrfs-restore would fit your use case more, > Unfortunately it needs extra disk space and isn't good at restoring > subvolume/snapshots. > (Although it's much faster than repairing the possible corrupted extent > tree) It's a backup server, it only contains data from other machines. If the filesystem cannot be recovered to a working state, I will need over a week to restart the many btrfs send commands from many servers. This is why anything other than --repair is useless ot me, I don't need the data back, it's still on the original machines, I need the filesystem to work again so that I don't waste a week recreating the many btrfs send/receive relationships. > > Is that possible at all? > > At least for file recovery (fs tree repair), we have such behavior. > > However, the problem you hit (and a lot of users hit) is all about > extent tree repair, which doesn't even goes to file recovery. > > All the hassle are in extent tree, and for extent tree, it's just good > or bad. Any corruption in extent tree may lead to later bugs. > The only way to avoid extent tree problems is to mount the fs RO. > > So, I'm afraid it is at least impossible for recent years. Understood, thanks for answering. Does the pastebin help and is 270 snapshots ok enough? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote: > > > https://github.com/Damenly/btrfs-progs/tree/tmp1 > > > > Not sure if I undertand that you meant, here. > > > Sorry for my unclear words. > Simply speaking, I suggest you to stop current running check. > Then, clone above branch to compile binary then run > 'btrfs check --mode=lowmem $dev'. I understand, I'll build and try it. > > This filesystem is trash to me and will require over a week to rebuild > > manually if I can't repair it. > > Understood your anxiety, a log of check without '--repair' will help > us to make clear what's wrong with your filesystem. Ok, I'll run your new code without repair and report back. It will likely take over a day though. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On 2018年06月29日 14:06, Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote: >> Just normal btrfs check, and post the output. >> If normal check eats up all your memory, btrfs check --mode=lowmem. > > Does check without --repair eat less RAM? Unfortunately, no. > >> --repair should be considered as the last method. > > If --repair doesn't work, check is useless to me sadly. Not exactly. Although it's time consuming, I have manually patched several users fs, which normally ends pretty well. If it's not a wide-spread problem but some small fatal one, it may be fixed. > I know that for > FS analysis and bug reporting, you want to have the FS without changing > it to something maybe worse, but for my use, if it can't be mounted and > can't be fixed, then it gets deleted which is even worse than check > doing the wrong thing. > >>> The last two ERROR lines took over a day to get generated, so I'm not sure >>> if it's still working, but just slowly. >> >> OK, that explains something. >> >> One extent is referred hundreds times, no wonder it will take a long time. >> >> Just one tip here, there are really too many snapshots/reflinked files. >> It's highly recommended to keep the number of snapshots to a reasonable >> number (lower two digits). >> Although btrfs snapshot is super fast, it puts a lot of pressure on its >> extent tree, so there is no free lunch here. > > Agreed, I doubt I have over or much over 100 snapshots though (but I > can't check right now). > Sadly I'm not allowed to mount even read only while check is running: > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2 > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy > >>> I see. Is there any reasonably easy way to check on this running process? >> >> GDB attach would be good. >> Interrupt and check the inode number if it's checking fs tree. >> Check the extent bytenr number if it's checking extent tree. >> >> But considering how many snapshots there are, it's really hard to determine. >> >> In this case, the super large extent tree is causing a lot of problem, >> maybe it's a good idea to allow btrfs check to skip extent tree check? > > I only see --init-extent-tree in the man page, which option did you have > in mind? That feature is just in my mind, not even implemented yet. > >>> Then again, maybe it already fixed enough that I can mount my filesystem >>> again. >> >> This needs the initial btrfs check report and the kernel messages how it >> fails to mount. > > mount command hangs, kernel does not show anything special outside of disk > access hanging. > > Jun 23 17:23:26 gargamel kernel: [ 341.802696] BTRFS warning (device dm-2): > 'recovery' is deprecated, use 'useback > uproot' instead > Jun 23 17:23:26 gargamel kernel: [ 341.828743] BTRFS info (device dm-2): > trying to use backup root at mount time > Jun 23 17:23:26 gargamel kernel: [ 341.850180] BTRFS info (device dm-2): > disk space caching is enabled > Jun 23 17:23:26 gargamel kernel: [ 341.869014] BTRFS info (device dm-2): has > skinny extents > Jun 23 17:23:26 gargamel kernel: [ 342.206289] BTRFS info (device dm-2): > bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 > Jun 23 17:26:26 gargamel kernel: [ 521.571392] BTRFS info (device dm-2): > enabling ssd optimizations > Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long > (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750 > Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): > disk space caching is enabled > Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has > skinny extents > Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): > bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 > Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): > enabling ssd optimizations > Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): > disk space caching is enabled > Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has > skinny extents > Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): > bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 This looks like super block corruption? What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"? And what about "skip_balance" mount option? Another problem is, with so many snapshots, balance is also hugely slowed, thus I'm not 100% sure if it's really a hang. > Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): > enabling ssd optimizations > Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long > (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500 > Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed > Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192,
Re: So, does btrfs check lowmem take days? weeks?
On 06/29/2018 02:10 PM, Marc MERLIN wrote: On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote: I have figured out the bug is lowmem check can't deal with shared tree block in reloc tree. The fix is simple, you can try the follow repo: https://github.com/Damenly/btrfs-progs/tree/tmp1 Not sure if I undertand that you meant, here. Sorry for my unclear words. Simply speaking, I suggest you to stop current running check. Then, clone above branch to compile binary then run 'btrfs check --mode=lowmem $dev'. Please run lowmem check "without =--repair" first to be sure whether your filesystem is fine. The filesystem is not fine, it caused btrfs balance to hang, whether balance actually broke it further or caused the breakage, I can't say. Then mount hangs, even with recovery, unless I use ro. This filesystem is trash to me and will require over a week to rebuild manually if I can't repair it. Understood your anxiety, a log of check without '--repair' will help us to make clear what's wrong with your filesystem. Thanks, Su Running check without repair for likely several days just to know that my filesystem is not clear (I already know this) isn't useful :) Or am I missing something? Though the bug and phenomenon are clear enough, before sending my patch, I have to make a test image. I have spent a week to study btrfs balance but it seems a liitle hard for me. thanks for having a look, either way. Marc -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs updates for 4.18
On 06/29/2018 02:26 AM, David Sterba wrote: On Thu, Jun 28, 2018 at 07:22:59PM +0800, Anand Jain wrote: The circular locking dependency warning occurs at FSSTRESS_PROG. And in particular at doproc() in xfstests/ltp/fsstress.c, randomly at any of the command at opdesc_tops[] = { ..} which involves calling mmap file operation and if there is something to commit. The commit transaction does need device_list_mutex which is also being used for the btrfs_open_devices() in the commit 542c5908abfe84f7. But btrfs_open_devices() is only called at mount, and mmap() can establish only be established after the mount has completed. With this give its unclear to me why the circular locking dependency check is warning about this. I feel until we have clarity about this and also solve other problem related to the streamlining of uuid_mutex, I suggest we revert 542c5908abfe84f7. Sorry for the inconvenience. Ok, the revert is one option. I'm cosidering adding both the locks, like is in https://patchwork.kernel.org/patch/10478443/ . This would have no effect, as btrfs_open_devices is called only from mount path and the list_sort is done only for the first time when there are not other users of the list that would not also be under the uuid_mutex. This passed the syzbot and other tests, so this does not break things and goes towards pushing the device_list_mutex as the real protection mechanism for the fs_devices members. Let me know what you think, the revert should be the last option if we don't have anything better. With this patch [1] as well I find the circular lock warning[2]. [1] https://patchwork.kernel.org/patch/10478443/ Test case: mkfs.btrfs -fq /dev/sdc && mount /dev/sdc /btrfs && /xfstests/ltp/fsstress -d /btrfs -w -p 1 -n 2000 However when the device_list_mutex is removed, the warning goes away. Let me investigate bit more about circular locking dependency. About using uuid_mutex in btrfs_open_devices(). I am planning to be more conceivable about the using the bit map for the volume flags and which shall also include the EXCL OPS in progress flag for the fs_devices. Which means we hold uuid_mutex and set/reset EXCL OPS flag for the fs_devices. And so the other fsids like fsid2 can still hold the uuid_mutex while fsid1 is still mounting/opening (which may sleep). I hope you would agree to use bit map for volume, we also need this bit map to manage the volume status. Or if there is a better solution I am fine. However uuid_mutex isn't as it blocks fsids2 to mount. Thanks, Anand [2] --- kernel: kernel: == kernel: WARNING: possible circular locking dependency detected kernel: 4.18.0-rc1+ #63 Not tainted kernel: -- kernel: fsstress/3062 is trying to acquire lock: kernel: 7d28aeca (_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x43/0x70 [btrfs] kernel: but task is already holding lock: kernel: 2fc78565 (>mmap_sem){}, at: vm_mmap_pgoff+0x9f/0x110 kernel: which lock already depends on the new lock. kernel: the existing dependency chain (in reverse order) is: kernel: -> #5 (>mmap_sem){}: kernel:_copy_from_user+0x1e/0x90 kernel:scsi_cmd_ioctl+0x2ba/0x480 kernel:cdrom_ioctl+0x3b/0xb2e kernel:sr_block_ioctl+0x7e/0xc0 kernel:blkdev_ioctl+0x4ea/0x980 kernel:block_ioctl+0x39/0x40 kernel:do_vfs_ioctl+0xa2/0x6c0 kernel:ksys_ioctl+0x70/0x80 kernel:__x64_sys_ioctl+0x16/0x20 kernel:do_syscall_64+0x4a/0x180 kernel:entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: -> #4 (sr_mutex){+.+.}: kernel:sr_block_open+0x24/0xd0 kernel:__blkdev_get+0xcb/0x480 kernel:blkdev_get+0x144/0x3a0 kernel:do_dentry_open+0x1b1/0x2d0 kernel:path_openat+0x57b/0xcc0 kernel:do_filp_open+0x9b/0x110 kernel:do_sys_open+0x1bd/0x250 kernel:do_syscall_64+0x4a/0x180 kernel:entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: -> #3 (>bd_mutex){+.+.}: kernel:__blkdev_get+0x5d/0x480 kernel:blkdev_get+0x243/0x3a0 kernel:blkdev_get_by_path+0x4a/0x80 kernel:btrfs_get_bdev_and_sb+0x1b/0xa0 [btrfs] kernel:open_fs_devices+0x85/0x270 [btrfs] kernel:btrfs_open_devices+0x6b/0x70 [btrfs] kernel:btrfs_mount_root+0x41a/0x7e0 [btrfs] kernel:mount_fs+0x30/0x150 kernel:vfs_kern_mount.part.31+0x54/0x140 kernel:btrfs_mount+0x175/0x920 [btrfs] kernel:mount_fs+0x30/0x150 kernel:vfs_kern_mount.part.31+0x54/0x140
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote: > I have figured out the bug is lowmem check can't deal with shared tree block > in reloc tree. The fix is simple, you can try the follow repo: > > https://github.com/Damenly/btrfs-progs/tree/tmp1 Not sure if I undertand that you meant, here. > Please run lowmem check "without =--repair" first to be sure whether > your filesystem is fine. The filesystem is not fine, it caused btrfs balance to hang, whether balance actually broke it further or caused the breakage, I can't say. Then mount hangs, even with recovery, unless I use ro. This filesystem is trash to me and will require over a week to rebuild manually if I can't repair it. Running check without repair for likely several days just to know that my filesystem is not clear (I already know this) isn't useful :) Or am I missing something? > Though the bug and phenomenon are clear enough, before sending my patch, > I have to make a test image. I have spent a week to study btrfs balance > but it seems a liitle hard for me. thanks for having a look, either way. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote: > Just normal btrfs check, and post the output. > If normal check eats up all your memory, btrfs check --mode=lowmem. Does check without --repair eat less RAM? > --repair should be considered as the last method. If --repair doesn't work, check is useless to me sadly. I know that for FS analysis and bug reporting, you want to have the FS without changing it to something maybe worse, but for my use, if it can't be mounted and can't be fixed, then it gets deleted which is even worse than check doing the wrong thing. > > The last two ERROR lines took over a day to get generated, so I'm not sure > > if it's still working, but just slowly. > > OK, that explains something. > > One extent is referred hundreds times, no wonder it will take a long time. > > Just one tip here, there are really too many snapshots/reflinked files. > It's highly recommended to keep the number of snapshots to a reasonable > number (lower two digits). > Although btrfs snapshot is super fast, it puts a lot of pressure on its > extent tree, so there is no free lunch here. Agreed, I doubt I have over or much over 100 snapshots though (but I can't check right now). Sadly I'm not allowed to mount even read only while check is running: gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2 mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy > > I see. Is there any reasonably easy way to check on this running process? > > GDB attach would be good. > Interrupt and check the inode number if it's checking fs tree. > Check the extent bytenr number if it's checking extent tree. > > But considering how many snapshots there are, it's really hard to determine. > > In this case, the super large extent tree is causing a lot of problem, > maybe it's a good idea to allow btrfs check to skip extent tree check? I only see --init-extent-tree in the man page, which option did you have in mind? > > Then again, maybe it already fixed enough that I can mount my filesystem > > again. > > This needs the initial btrfs check report and the kernel messages how it > fails to mount. mount command hangs, kernel does not show anything special outside of disk access hanging. Jun 23 17:23:26 gargamel kernel: [ 341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback uproot' instead Jun 23 17:23:26 gargamel kernel: [ 341.828743] BTRFS info (device dm-2): trying to use backup root at mount time Jun 23 17:23:26 gargamel kernel: [ 341.850180] BTRFS info (device dm-2): disk space caching is enabled Jun 23 17:23:26 gargamel kernel: [ 341.869014] BTRFS info (device dm-2): has skinny extents Jun 23 17:23:26 gargamel kernel: [ 342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 Jun 23 17:26:26 gargamel kernel: [ 521.571392] BTRFS info (device dm-2): enabling ssd optimizations Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750 Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0 Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500 Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536 Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE]. Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds. Jun 23 18:42:20 gargamel kernel: [ 5076.015729] Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1 Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 23 18:42:20 gargamel kernel: [ 5076.060637] syncD0 20253 15327 0x20020080 Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace: Jun 23 18:42:20 gargamel kernel: [ 5076.086366] ? __schedule+0x53e/0x59b Jun 23 18:42:20 gargamel kernel: [ 5076.098311] schedule+0x7f/0x98