Re: unsolvable technical issues?

2018-06-29 Thread Andrei Borzenkov
30.06.2018 06:22, Duncan пишет:
> Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
> excerpted:
> 
>> On 2018-06-24 16:22, Goffredo Baroncelli wrote:
>>> On 06/23/2018 07:11 AM, Duncan wrote:
 waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:

> According to this:
>
> https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
> section 1.2
>
> It claims that BTRFS still have significant technical issues that may
> never be resolved.

 I can speculate a bit.

 1) When I see btrfs "technical issue that may never be resolved", the
 #1 first thing I think of, that AFAIK there are _definitely_ no plans
 to resolve, because it's very deeply woven into the btrfs core by now,
 is...

 [1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
 Universally Unique quite literally, assuming they really *are*
 unique, at least on that system[.]  Because
 btrfs uses this supposedly unique ID to ID devices that belong to the
 filesystem, it can get *very* mixed up, with results possibly
 including dataloss, if it sees devices that don't actually belong to a
 filesystem with the same UUID as a mounted filesystem.
>>>
>>> As partial workaround you can disable udev btrfs rules and then do a
>>> "btrfs dev scan" manually only for the device which you need.
> 
>> You don't even need `btrfs dev scan` if you just specify the exact set
>> of devices in the mount options.  The `device=` mount option tells the
>> kernel to check that device during the mount process.
> 
> Not that lvm does any better in this regard[1], but has btrfs ever solved 
> the bug where only one device= in the kernel commandline's rootflags= 
> would take effect, effectively forcing initr* on people (like me) who 
> would otherwise not need them and prefer to do without them, if they're 
> using a multi-device btrfs as root?
> 

This requires in-kernel device scanning; I doubt we will ever see it.

> Not to mention the fact that as kernel people will tell you, device 
> enumeration isn't guaranteed to be in the same order every boot, so 
> device=/dev/* can't be relied upon and shouldn't be used -- but of course 
> device=LABEL= and device=UUID= and similar won't work without userspace, 
> basically udev (if they work at all, IDK if they actually do).
> 
> Tho in practice from what I've seen, device enumeration order tends to be 
> dependable /enough/ for at least those without enterprise-level numbers 
> of devices to enumerate.

Just boot with USB stick/eSATA drive plugged in, there are good chances
it changes device order.

>  True, it /does/ change from time to time with a 
> new kernel, but anybody sane keeps a tested-dependable old kernel around 
> to boot to until they know the new one works as expected, and that sort 
> of change is seldom enough that users can boot to the old kernel and 
> adjust their settings for the new one as necessary when it does happen.  
> So as "don't do it that way because it's not reliable" as it might indeed 
> be in theory, in practice, just using an ordered /dev/* in kernel 
> commandlines does tend to "just work"... provided one is ready for the 
> occasion when that device parameter might need a bit of adjustment, of 
> course.
> 
...
> 
> ---
> [1] LVM is userspace code on top of the kernelspace devicemapper, and 
> therefore requires an initr* if root is on lvm, regardless.  So btrfs 
> actually does a bit better here, only requiring it for multi-device btrfs.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs suddenly think's it's raid6

2018-06-29 Thread marble
> No log_root* in the super. So no fsyncing at the time? What was
> happening at the time of the power loss?
That I don't know.

> Pretty weird, all three supers are OK and yet two copies of the fs
> tree are corrupt. Why doesn't it fall back automatically to one of the
> other roots?
> 
> You could try 'mount -o usebackuproot,ro' and see if that's permissive
> enough to succeed.
That yields the same result
```
marble@archlinux ~ % sudo mount -o usebackuproot,ro /dev/mapper/black
/tmp/black
mount: /tmp/black: can't read superblock on /dev/mapper/black.
```
> I wonder if the drive firmware reordered things contrary to Btrfs
> expectation right at the powerfail - or even if the powerfail caused
> the drive firmware to do the writes in the wrong order?
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-29 Thread Duncan
Hugo Mills posted on Mon, 25 Jun 2018 16:54:36 + as excerpted:

> On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote:
> [snip]
>> I hope I am not asking for too much (but I know I probably am), but I
>> suggest that having a small snippet of information on the status page
>> showing a little bit about what is either currently the development
>> focus , or what people are known for working at would be very valuable
>> for users and it may of course work both ways, such as exciting people
>> or calming them down. ;)
>> 
>> For example something simple like a "development focus" list...
>> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology
>> 2018-Q3: (planned) Magical feature X
>> 2018-Q2: N-Way mirroring
>> 2018-Q1: Feature work "RAID"5/6
>> 
>> I think it would be good for people living their lives outside as it
>> would perhaps spark some attention from developers and perhaps even
>> media as well.
> 
> I started doing this a couple of years ago, but it turned out to be
> impossible to keep even vaguely accurate or up to date, without going
> round and bugging the developers individually on a per-release basis. I
> don't think it's going to happen.

In addition, anything like quarter, kernel cycle, etc, has been 
repeatedly demonstrated to be entirely broken beyond "current", because 
roadmapped tasks have rather consistently taken longer, sometimes /many/ 
/times/ longer (by a factor of 20+ in the case of raid56), than first 
predicted.

But in theory it might be double, with just a roughly ordered list, no 
dates beyond "current focus", and with suitably big disclaimers about 
other things (generally bugs in otherwise more stable features, but 
occasionally a quick sub-feature that is seen to be easier to introduce 
at the current state than it might be later, etc) possibly getting 
priority and temporarily displacing roadmapped items.

In fact, this last one is the big reason why raid56 has taken so long to 
even somewhat stabilize -- the devs kept finding bugs in already semi-
stable features that took priority... for kernel cycle after kernel 
cycle.  The quotas/qgroups feature, already introduced and intended to be 
at least semi-stable was one such culprit, requiring repeated rewrite and 
kernel cycles worth of bug squashing.  A few critical under the right 
circumstances compression bugs, where compression was supposed to be an 
already reasonably stable feature, were another, tho these took far less 
developer bandwidth than quotas.  Getting a reasonably usable fsck was a 
bunch of little patches.  AFAIK that one wasn't actually an original 
focus and was intended to be back-burnered for some time, but once btrfs 
hit mainline, users started demanding it, so the priority was bumped.  
And of course having it has been good for finding and ultimately fixing 
other bugs as well, so it wasn't a bad thing, but the hard fact is the 
repairing fsck has taken, all told, I'd guess about the same number of 
developer cycles as quotas, and those developer cycles had to come from 
stuff that had been roadmapped for earlier.

As a bit of an optimist I'd be inclined to argue that OK, we've gotten 
btrfs in far better shape general stability-wise now, and going forward, 
the focus can be back on the stuff that was roadmapped for earlier that 
this stuff displaced, so one might hope things will move faster again 
now, but really, who knows?  That's arguably what the devs thought when 
they mainlined btrfs, too, and yet it took all this much longer to mature 
and stabilize since then.  Still, it /has/ to happen at /some/ point, 
right?  And I know for a fact that btrfs is far more stable now than it 
was... because things like ungraceful shutdowns that used to at minimum 
trigger (raid1 mode) scrub fixes on remount and scrub, now... don't -- 
btrfs is now stable enough that the atomic COW is doing its job and 
things "just work", where before, they required scrub repair at best, and 
occasional blow away and restore from backups.  So I can at least /hope/ 
that the worst of the plague of bugs is behind us, and people can work on 
what they intended to do most (say 80%) of the time now, spending say a 
day's worth a week (20%) on bugs, instead of the reverse, 80% (4 days a 
week) on bugs and if they're lucky, a day a week on what they were 
supposed to be focused on, which is what we were seeing for awhile.

Plus the tools to do the debugging, etc, are far more mature now, another 
reason bugs should hopefully take less time now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-29 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:

> On 2018-06-24 16:22, Goffredo Baroncelli wrote:
>> On 06/23/2018 07:11 AM, Duncan wrote:
>>> waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:
>>>
 According to this:

 https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
 section 1.2

 It claims that BTRFS still have significant technical issues that may
 never be resolved.
>>>
>>> I can speculate a bit.
>>>
>>> 1) When I see btrfs "technical issue that may never be resolved", the
>>> #1 first thing I think of, that AFAIK there are _definitely_ no plans
>>> to resolve, because it's very deeply woven into the btrfs core by now,
>>> is...
>>>
>>> [1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
>>> Universally Unique quite literally, assuming they really *are*
>>> unique, at least on that system[.]  Because
>>> btrfs uses this supposedly unique ID to ID devices that belong to the
>>> filesystem, it can get *very* mixed up, with results possibly
>>> including dataloss, if it sees devices that don't actually belong to a
>>> filesystem with the same UUID as a mounted filesystem.
>> 
>> As partial workaround you can disable udev btrfs rules and then do a
>> "btrfs dev scan" manually only for the device which you need.

> You don't even need `btrfs dev scan` if you just specify the exact set
> of devices in the mount options.  The `device=` mount option tells the
> kernel to check that device during the mount process.

Not that lvm does any better in this regard[1], but has btrfs ever solved 
the bug where only one device= in the kernel commandline's rootflags= 
would take effect, effectively forcing initr* on people (like me) who 
would otherwise not need them and prefer to do without them, if they're 
using a multi-device btrfs as root?

Not to mention the fact that as kernel people will tell you, device 
enumeration isn't guaranteed to be in the same order every boot, so 
device=/dev/* can't be relied upon and shouldn't be used -- but of course 
device=LABEL= and device=UUID= and similar won't work without userspace, 
basically udev (if they work at all, IDK if they actually do).

Tho in practice from what I've seen, device enumeration order tends to be 
dependable /enough/ for at least those without enterprise-level numbers 
of devices to enumerate.  True, it /does/ change from time to time with a 
new kernel, but anybody sane keeps a tested-dependable old kernel around 
to boot to until they know the new one works as expected, and that sort 
of change is seldom enough that users can boot to the old kernel and 
adjust their settings for the new one as necessary when it does happen.  
So as "don't do it that way because it's not reliable" as it might indeed 
be in theory, in practice, just using an ordered /dev/* in kernel 
commandlines does tend to "just work"... provided one is ready for the 
occasion when that device parameter might need a bit of adjustment, of 
course.

> Also, while LVM does have 'issues' with cloned PV's, it fails safe (by
> refusing to work on VG's that have duplicate PV's), while BTRFS fails
> very unsafely (by randomly corrupting data).

And IMO that "failing unsafe" is both serious and common enough that it 
easily justifies adding the point to a list of this sort, thus my putting 
it #1.

>>> 2) Subvolume and (more technically) reflink-aware defrag.
>>>
>>> It was there for a couple kernel versions some time ago, but
>>> "impossibly" slow, so it was disabled until such time as btrfs could
>>> be made to scale rather better in this regard.

> I still contend that the biggest issue WRT reflink-aware defrag was that
> it was not optional.  The only way to get the old defrag behavior was to
> boot a kernel that didn't have reflink-aware defrag support.  IOW,
> _everyone_ had to deal with the performance issues, not just the people
> who wanted to use reflink-aware defrag.

Absolutely.

Which of course suggests making it optional, with a suitable warning as 
to the speed implications with lots of snapshots/reflinks, when it does 
get enabled again (and as David mentions elsewhere, there's apparently 
some work going into the idea once again, which potentially moves it from 
the 3-5 year range, at best, back to a 1/2-2-year range, time will tell).

>>> 3) N-way-mirroring.
>>>
>> [...]
>> This is not an issue, but a not implemented feature
> If you're looking at feature parity with competitors, it's an issue.

Exactly my point.  Thanks. =:^)

>>> 4) (Until relatively recently, and still in terms of scaling) Quotas.
>>>
>>> Until relatively recently, quotas could arguably be added to the list.
>>> They were rewritten multiple times, and until recently, appeared to be
>>> effectively eternally broken.
>> 
>> Even tough what you are reporting is correct, I have to point out that
>> the quota in BTRFS is more complex than the equivalent one of the other
>> FS.

Which, arguably, is exactly 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
Well, there goes that. After about 18H:
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
owner: 374857, offset: 235175936) wanted: 1, have: 1452 
backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 
btrfs(+0x3a232)[0x56091704f232] 
btrfs(+0x3ab46)[0x56091704fb46] 
btrfs(+0x3b9f5)[0x5609170509f5] 
btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] 
btrfs(+0x572ff)[0x56091706c2ff] 
btrfs(+0x60b13)[0x560917075b13] 
btrfs(cmd_check+0x2634)[0x56091707d431] 
btrfs(main+0x88)[0x560917027260] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] 
btrfs(_start+0x2a)[0x560917026dfa] 
Aborted 

That's https://github.com/Damenly/btrfs-progs.git

Whoops, I didn't use the tmp1 branch, let me try again with that and
report back, although the problem above is still going to be there since
I think the only difference will be this, correct?
https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: fix race between mkfs and mount

2018-06-29 Thread Anand Jain




On 06/29/2018 08:06 PM, David Sterba wrote:

On Tue, Jun 26, 2018 at 10:42:32PM +0800, Anand Jain wrote:

Last version of the proposed fix is to extend the uuid_mutex over the
whole mount callback and use it around calls to btrfs_scan_one_device.
That way we'll be sure the mount will always get to the device it
scanned and that will not be freed by the parallel scan.


That will break the requisite #1 as above.


I don't see how it breaks 'mount and or scan of two independent fsid+dev
is possible'. It is possible, but the lock does not need to be
filesystem local.

Concurrent mount or scan will block on the uuid_mutex,


   As uuid_mutex is global, if fsid1 is being mounted, the fsid2 mount
   will wait upto certain extent with this code.


Yes it will wait a bit, but the critical section is short. There's some
IO done and it's reading of 4K in btrfs_read_disk_super. The rest is
cpu-bound and hardly measurable in practice, in context of concurrent
mount and scanning.

I took the approach to fix the bug first and then make it faster or
cleaner, also to fix it in a way that's still acceptable for current
development cycle.

   My efforts here was to
   use uuid_mutex only to protect the fs_uuid update part, in this way
   there is concurrency in the mount process of fsid1 and fsid2 and
   provides shorter bootup time when the user uses the mount at boot
   option.


The locking can be still improved but the uuid_mutex is not a contended
lock, mount is an operation that does not happen thousand times a
second, same for the scanning.


 My concern is about the boot up time when there are larger number of
 LUN configured with independent btrfs FSIDs to mount at boot. Since
 BTRFS is a kind of infrastructure for the servers, we can't rule out
 that these kind of configuration won't exists at all. Anyway as you
 said we can use uuid_mutex for the current development cycle.

 However a review on [1] which does fix your earlier concern of
 returning -EBUSY is appreciated. And pls let me know if going ahead
 with this approach would be accepted in the current development cycle.?


So even if there are several mounts happening during boot, it's just a
few and the delay is IMO unnoticeable. If the boot time is longer by
say 100ms at worst, I'm still ok with my patches as a fix.

But not as a final fix, so the updates to locking you mentioned are
still possible. We now have a point of reference where syzbot does is
not able to reproduce the bugs.



[1]
--
When %fs_devices::opened > 0 the device is mounted, %fs_devices is never
freed so its safe to use %fs_devices and it does not need any locks or
flags.

However, when %fs_devices::opened == 0 (idle) that means device is not
yet mounted, and it can be either transition to opened or freed. When
it transitions to freed, fs_devices gets freed and any pointer accessing
will endup with UAF error.

Here are places where we access fs_devcies and it needs locking and
using uuid_mutex is one of method

1.
READY ioctl

2.
Mount

3.
SCAN ioctl

4.
Stale cleanup during SCAN

5.
planned device FORGET ioctl

#4 and #5 may free the fs_devices while #1, #2, and #3 are still
accessing the fs_devices.

using uuid_mutex is one choice however it would slow down the mount
at boot when there are larger number of independent btrfs fsid being
mounted at boot.

Proposed Fix
-

This does not return the -EBUSY. If there is any better way
I am ok to use it.

struct btrfs_fs_devices
{
::
   int ref_count;
::
}

To acquire a fs_devices..

volume_devices_excl_state_enter(x)
{
 fs_devices = NULL
 lock uuid_mutex
   if (fs_devices = find fs_devices(x))
fs_devices::ref_count++
 unlock uuid_mutex
 return fs_devices
}


To release a fs_devices..

volume_devices_excl_state_exit(fs_devices)
{
  lock uuid_mutex
fs_devices::ref_count--
  unlock uuid_mutex
}


To delete a fs_devices..

again:
 lock uuid_mutex
   find fs_devices
   if (fs_devices::ref_count != 0) {
  unlock uuid_mutex
  goto agin;
   }
   if (fs_devices->opened > 0) {
 unlock uuid_mutex
 return -EBUSY
   }

   free_fs_devices()
 unlock uuid_mutex

--


Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Chris Murphy
I've got about 1/2 the snapshots and less than 1/10th the data...but
my btrfs check times are much shorter than either: 15 minutes and 65
minutes (lowmem).


[chris@f28s ~]$ sudo btrfs fi us /mnt/first
Overall:
Device size:1024.00GiB
Device allocated: 774.12GiB
Device unallocated: 249.87GiB
Device missing: 0.00B
Used: 760.48GiB
Free (estimated): 256.95GiB(min: 132.01GiB)
Data ratio:  1.00
Metadata ratio:  2.00
Global reserve: 512.00MiB(used: 0.00B)

Data,single: Size:761.00GiB, Used:753.93GiB
   /dev/mapper/first 761.00GiB

Metadata,DUP: Size:6.50GiB, Used:3.28GiB
   /dev/mapper/first  13.00GiB

System,DUP: Size:64.00MiB, Used:112.00KiB
   /dev/mapper/first 128.00MiB

Unallocated:
   /dev/mapper/first 249.87GiB


146 subvolumes
137 snapshots

total csum bytes: 790549924
total tree bytes: 3519250432
total fs tree bytes: 2546073600
total extent tree bytes: 131350528


Original mode check takes ~15 minutes
Lowmem mode takes ~65 minutes

RAM: 4G
CPU: Intel(R) Pentium(R) CPU  N3700  @ 1.60GHz



Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs suddenly think's it's raid6

2018-06-29 Thread Chris Murphy
On Fri, Jun 29, 2018 at 4:09 PM, marble  wrote:
> Hey,
> Thanks for the quick reply :-)
>
>> Is there anything like unexpected power loss happens?
> Power loos may have happened. It was power via a RPi.
>> And would you provide the following data for debugging?
>>
>> # btrfs ins dump-super -fFa /dev/mapper/black
> See attachment.
>> And further more, what's the device mapper setup for /dev/mapper/black?
>> Is there anything like RAID here?
> This is due to LUKS I think. "black" is the name I gave to cryptsetup open.

No log_root* in the super. So no fsyncing at the time? What was
happening at the time of the power loss?

Pretty weird, all three supers are OK and yet two copies of the fs
tree are corrupt. Why doesn't it fall back automatically to one of the
other roots?

You could try 'mount -o usebackuproot,ro' and see if that's permissive
enough to succeed.

I wonder if the drive firmware reordered things contrary to Btrfs
expectation right at the powerfail - or even if the powerfail caused
the drive firmware to do the writes in the wrong order?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removed Disk - Super Error - Scrub did not fix - next steps?

2018-06-29 Thread Chris Murphy
On Fri, Jun 29, 2018 at 11:55 AM,   wrote:
> Hello,
>
> I recently was mounting some drives in the free drive bays of my server and
> accidentally removed one of my drives from my raid1 btrfs array.  I
> immediately put it back in - but whatever damage that would have caused
> seems to be already done.
>
> I got these log errors [dmesg -T] and after reading up it seemed that all I
> *probably* needed to do was a scrub:
>
> Finally - the original logs I found by running "dmesg -T|grep -i btrfs" -
> from where the log started to where the scrub begain:
>
> [Tue Jun 26 18:30:08 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr
> 40547, rd 42808, flush 2, corrupt 0, gen 0
>
> [Tue Jun 26 18:33:19 2018] btrfs_dev_stat_print_on_error: 2056 callbacks
> suppressed
>
> A bunch of these errors...
>
> Then later:
>
> [Tue Jun 26 19:19:52 2018] BTRFS warning (device sdb): lost page write due
> to IO error on /dev/sdh
>
> I think this is where I unmounted/remounted /mnt/titan:
>
> [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): disk space caching is
> enabled
> [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): has skinny extents
> [Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): bdev /dev/sdh errs: wr
> 55907, rd 59133, flush 2, corrupt 0, gen 0
> [Tue Jun 26 19:49:57 2018] btrfs_dev_stat_print_on_error: 49 callbacks
> suppressed
> [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr
> 55907, rd 59133, flush 2, corrupt 0, gen 1
> [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr
> 55907, rd 59133, flush 2, corrupt 0, gen 2
> [Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: wr
> 55907, rd 59133, flush 2, corrupt 0, gen 3
> [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify
> failed on 9081264963584 wanted 8663 found 8321
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081264963584 (dev /dev/sdh sector 17728418112)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081264967680 (dev /dev/sdh sector 17728418120)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081264971776 (dev /dev/sdh sector 17728418128)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081264975872 (dev /dev/sdh sector 17728418136)
> [Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid verify
> failed on 9081265356800 wanted 8663 found 8321
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265356800 (dev /dev/sdh sector 17728418880)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265360896 (dev /dev/sdh sector 1772841)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265364992 (dev /dev/sdh sector 17728418896)
> [Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265369088 (dev /dev/sdh sector 17728418904)



These are passive fixups that succeed.


>
> According to btrfs scrub status this is when the scrub began...
>
> [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify
> failed on 9081265061888 wanted 8663 found 8321
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265061888 (dev /dev/sdh sector 17728418304)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265065984 (dev /dev/sdh sector 17728418312)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265070080 (dev /dev/sdh sector 17728418320)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081265074176 (dev /dev/sdh sector 17728418328)
> [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify
> failed on 9081263046656 wanted 8664 found 8662
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263046656 (dev /dev/sdh sector 17728414368)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263050752 (dev /dev/sdh sector 17728414376)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263054848 (dev /dev/sdh sector 17728414384)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263058944 (dev /dev/sdh sector 17728414392)
> [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify
> failed on 9081263194112 wanted 8664 found 8662
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263194112 (dev /dev/sdh sector 17728414656)
> [Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error corrected:
> ino 0 off 9081263198208 (dev /dev/sdh sector 17728414664)
> [Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid verify
> 

Re: btrfs suddenly think's it's raid6

2018-06-29 Thread marble
Hey,
Thanks for the quick reply :-)

> Is there anything like unexpected power loss happens?
Power loss may have happened.
> And would you provide the following data for debugging?
> 
> # btrfs ins dump-super -fFa /dev/mapper/black
I attached it.
> And further more, what's the device mapper setup for /dev/mapper/black?
> Is there anything like RAID here?
I think this is the result of luks. "black" is the name I passed to
cryptsetup open.
> Thanks,
> Qu
Cheers,
marble
superblock: bytenr=65536, device=/dev/mapper/black
-
csum_type   0 (crc32c)
csum_size   4
csum0x9814744c [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
label   black
generation  352
root30736384
sys_array_size  129
chunk_root_generation   145
root_level  1
chunk_root  22020096
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 500105764864
bytes_used  153954693120
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
cache_generation187
uuid_tree_generation187
dev_item.uuid   cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
dev_item.fsid   9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 [match]
dev_item.type   0
dev_item.total_bytes500105764864
dev_item.bytes_used 184708759552
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 0
stripe 0 devid 1 offset 22020096
dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
stripe 1 devid 1 offset 30408704
dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
backup_roots[4]:
backup 0:
backup_tree_root:   1084751872  gen: 187level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084702720  gen: 187level: 2
backup_fs_root: 1084653568  gen: 187level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1084817408  gen: 187level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 1:
backup_tree_root:   1083801600  gen: 184level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1083752448  gen: 184level: 2
backup_fs_root: 1083604992  gen: 184level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1083867136  gen: 184level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 2:
backup_tree_root:   1084145664  gen: 185level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084080128  gen: 185level: 2
backup_fs_root: 1083981824  gen: 185level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1084211200  gen: 185level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 3:
backup_tree_root:   1084473344  gen: 186level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084424192  gen: 186level: 2
backup_fs_root: 

Re: btrfs suddenly think's it's raid6

2018-06-29 Thread marble
Hey,
Thanks for the quick reply :-)

> Is there anything like unexpected power loss happens?
Power loos may have happened. It was power via a RPi.
> And would you provide the following data for debugging?
> 
> # btrfs ins dump-super -fFa /dev/mapper/black
See attachment.
> And further more, what's the device mapper setup for /dev/mapper/black?
> Is there anything like RAID here?
This is due to LUKS I think. "black" is the name I gave to cryptsetup open.

Cheers,
marble
superblock: bytenr=65536, device=/dev/mapper/black
-
csum_type   0 (crc32c)
csum_size   4
csum0x9814744c [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
label   black
generation  352
root30736384
sys_array_size  129
chunk_root_generation   145
root_level  1
chunk_root  22020096
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 500105764864
bytes_used  153954693120
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
cache_generation187
uuid_tree_generation187
dev_item.uuid   cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
dev_item.fsid   9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 [match]
dev_item.type   0
dev_item.total_bytes500105764864
dev_item.bytes_used 184708759552
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 0
stripe 0 devid 1 offset 22020096
dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
stripe 1 devid 1 offset 30408704
dev_uuid cbf2a2ce-9487-4c2d-98e9-5cc69ae59b26
backup_roots[4]:
backup 0:
backup_tree_root:   1084751872  gen: 187level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084702720  gen: 187level: 2
backup_fs_root: 1084653568  gen: 187level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1084817408  gen: 187level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 1:
backup_tree_root:   1083801600  gen: 184level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1083752448  gen: 184level: 2
backup_fs_root: 1083604992  gen: 184level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1083867136  gen: 184level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 2:
backup_tree_root:   1084145664  gen: 185level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084080128  gen: 185level: 2
backup_fs_root: 1083981824  gen: 185level: 2
backup_dev_root:1081655296  gen: 178level: 0
backup_csum_root:   1084211200  gen: 185level: 2
backup_total_bytes: 500105764864
backup_bytes_used:  153957298176
backup_num_devices: 1

backup 3:
backup_tree_root:   1084473344  gen: 186level: 1
backup_chunk_root:  22020096gen: 145level: 1
backup_extent_root: 1084424192  gen: 186level: 2
backup_fs_root: 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Chris Murphy
On Fri, Jun 29, 2018 at 9:15 AM, james harvey  wrote:
> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:
>> And an open question I have about scrub is weather it only ever is
>> checking csums, meaning nodatacow files are never scrubbed, or if the
>> copies are at least compared to each other?
>
> Scrub never looks at nodatacow files.  It does not compare the copies
> to each other.
>
> Qu submitted a patch to make check compare the copies:
> https://patchwork.kernel.org/patch/10434509/

Yeah online scrub needs to report any mismatches, even if it can't do
anything about it because it's ambiguous which copy is wrong.


> IMO, I think the offline check should look at nodatacow copies like
> this, but I still think this also needs to be added to scrub.  In the
> patch thread, I discuss my reasons why.  In brief: online scanning;
> this goes along with user's expectation of scrub ensuring mirrored
> data integrity; and recommendations to setup scrub on periodic basis
> to me means it's the place to put it.

I don't mind this being implemented in offline scrub first for testing
purposes. But the online scrub certainly should have this ability
eventually.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Removed Disk - Super Error - Scrub did not fix - next steps?

2018-06-29 Thread j

Hello,

I recently was mounting some drives in the free drive bays of my server 
and accidentally removed one of my drives from my raid1 btrfs array.  I 
immediately put it back in - but whatever damage that would have caused 
seems to be already done.


I got these log errors [dmesg -T] and after reading up it seemed that 
all I *probably* needed to do was a scrub:


Finally - the original logs I found by running "dmesg -T|grep -i btrfs" 
- from where the log started to where the scrub begain:


[Tue Jun 26 18:30:08 2018] BTRFS error (device sdb): bdev /dev/sdh errs: 
wr 40547, rd 42808, flush 2, corrupt 0, gen 0


[Tue Jun 26 18:33:19 2018] btrfs_dev_stat_print_on_error: 2056 callbacks 
suppressed


A bunch of these errors...

Then later:

[Tue Jun 26 19:19:52 2018] BTRFS warning (device sdb): lost page write 
due to IO error on /dev/sdh


I think this is where I unmounted/remounted /mnt/titan:

[Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): disk space caching 
is enabled

[Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): has skinny extents
[Tue Jun 26 19:49:18 2018] BTRFS info (device sdb): bdev /dev/sdh errs: 
wr 55907, rd 59133, flush 2, corrupt 0, gen 0
[Tue Jun 26 19:49:57 2018] btrfs_dev_stat_print_on_error: 49 callbacks 
suppressed
[Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: 
wr 55907, rd 59133, flush 2, corrupt 0, gen 1
[Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: 
wr 55907, rd 59133, flush 2, corrupt 0, gen 2
[Tue Jun 26 19:49:57 2018] BTRFS error (device sdb): bdev /dev/sdh errs: 
wr 55907, rd 59133, flush 2, corrupt 0, gen 3
[Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081264963584 wanted 8663 found 8321
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081264963584 (dev /dev/sdh sector 17728418112)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081264967680 (dev /dev/sdh sector 17728418120)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081264971776 (dev /dev/sdh sector 17728418128)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081264975872 (dev /dev/sdh sector 17728418136)
[Tue Jun 26 19:50:09 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081265356800 wanted 8663 found 8321
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265356800 (dev /dev/sdh sector 17728418880)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265360896 (dev /dev/sdh sector 1772841)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265364992 (dev /dev/sdh sector 17728418896)
[Tue Jun 26 19:50:09 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265369088 (dev /dev/sdh sector 17728418904)


According to btrfs scrub status this is when the scrub began...

[Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081265061888 wanted 8663 found 8321
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265061888 (dev /dev/sdh sector 17728418304)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265065984 (dev /dev/sdh sector 17728418312)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265070080 (dev /dev/sdh sector 17728418320)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081265074176 (dev /dev/sdh sector 17728418328)
[Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081263046656 wanted 8664 found 8662
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263046656 (dev /dev/sdh sector 17728414368)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263050752 (dev /dev/sdh sector 17728414376)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263054848 (dev /dev/sdh sector 17728414384)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263058944 (dev /dev/sdh sector 17728414392)
[Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081263194112 wanted 8664 found 8662
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263194112 (dev /dev/sdh sector 17728414656)
[Tue Jun 26 19:50:29 2018] BTRFS info (device sdb): read error 
corrected: ino 0 off 9081263198208 (dev /dev/sdh sector 17728414664)
[Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081264128000 wanted 8664 found 8662
[Tue Jun 26 19:50:29 2018] BTRFS error (device sdb): parent transid 
verify failed on 9081265520640 wanted 8665 found 8321


So I ran btrfs scrub and this was my 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy 
wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At least, not
unless BTRFS gets proper generational tracking to handle temporarily missing
devices.  As of right now, sanely fixing things requires significant manual
intervention, as you have to bypass the device read selection algorithm to
be able to look at the state of the individual copies so that you can pick
one to use and forcibly rewrite the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.
The thing is though, as things stand right now, you need to manually 
edit the data on-disk directly or restore the file from a backup to fix 
the file.  While it's technically true that you can manually repair this 
type of thing, both of the cases for doing it without those patches I 
mentioned, it's functionally impossible for a regular user to do it 
without potentially losing some data.


Unless that changes, scrub telling you it's corrupt is not going to help 
much aside from making sure you don't make things worse by trying to use 
it.  Given this, it would make sense to have a (disabled by default) 
option to have scrub repair it by just using the newer or older copy of 
the data.  That would require classic RAID generational tracking though, 
which BTRFS doesn't have yet.



A while back, Anand Jain posted some patches that would let you select a
particular device to direct all reads to via a mount option, but I don't
think they ever got merged.  That would have made manual recovery in cases
like this exponentially easier (mount read-only with one device selected,
copy the file out somewhere, remount read-only with the other device, drop
caches, copy the file out again, compare and reconcile the two copies, then
remount the volume writable and write out the repaired file).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread james harvey
On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-06-29 11:15, james harvey wrote:
>>
>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy 
>> wrote:
>>>
>>> And an open question I have about scrub is weather it only ever is
>>> checking csums, meaning nodatacow files are never scrubbed, or if the
>>> copies are at least compared to each other?
>>
>>
>> Scrub never looks at nodatacow files.  It does not compare the copies
>> to each other.
>>
>> Qu submitted a patch to make check compare the copies:
>> https://patchwork.kernel.org/patch/10434509/
>>
>> This hasn't been added to btrfs-progs git yet.
>>
>> IMO, I think the offline check should look at nodatacow copies like
>> this, but I still think this also needs to be added to scrub.  In the
>> patch thread, I discuss my reasons why.  In brief: online scanning;
>> this goes along with user's expectation of scrub ensuring mirrored
>> data integrity; and recommendations to setup scrub on periodic basis
>> to me means it's the place to put it.
>
> That said, it can't sanely fix things if there is a mismatch. At least, not
> unless BTRFS gets proper generational tracking to handle temporarily missing
> devices.  As of right now, sanely fixing things requires significant manual
> intervention, as you have to bypass the device read selection algorithm to
> be able to look at the state of the individual copies so that you can pick
> one to use and forcibly rewrite the whole file by hand.

Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.

> A while back, Anand Jain posted some patches that would let you select a
> particular device to direct all reads to via a mount option, but I don't
> think they ever got merged.  That would have made manual recovery in cases
> like this exponentially easier (mount read-only with one device selected,
> copy the file out somewhere, remount read-only with the other device, drop
> caches, copy the file out again, compare and reconcile the two copies, then
> remount the volume writable and write out the repaired file).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental send/receive broken after snapshot restore

2018-06-29 Thread Andrei Borzenkov
28.06.2018 23:09, Hannes Schweizer пишет:
> Hi,
> 
> Here's my environment:
> Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64
> Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux
> btrfs-progs v4.17
> 
> Label: 'online'  uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390
> Total devices 2 FS bytes used 3.16TiB
> devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0
> devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1
> Data, RAID0: total=3.16TiB, used=3.15TiB
> System, RAID0: total=16.00MiB, used=240.00KiB
> Metadata, RAID0: total=7.00GiB, used=4.91GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Label: 'offline'  uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29
> Total devices 2 FS bytes used 3.52TiB
> devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0
> devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1
> Data, RAID1: total=3.52TiB, used=3.52TiB
> System, RAID1: total=8.00MiB, used=512.00KiB
> Metadata, RAID1: total=6.00GiB, used=5.11GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Label: 'external'  uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0
> Total devices 1 FS bytes used 3.65TiB
> devid1 size 5.46TiB used 3.66TiB path
> /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc
> Data, single: total=3.64TiB, used=3.64TiB
> System, DUP: total=32.00MiB, used=448.00KiB
> Metadata, DUP: total=11.00GiB, used=9.72GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> The following automatic backup scheme is in place:
> hourly:
> btrfs sub snap -r online/root online/root.
> 
> daily:
> btrfs sub snap -r online/root online/root.
> btrfs send -c online/root.
> online/root. | btrfs receive offline
> btrfs sub del -c online/root.
> 
> monthly:
> btrfs sub snap -r online/root online/root.
> btrfs send -c online/root.
> online/root. | btrfs receive external
> btrfs sub del -c online/root.
> 
> Now here are the commands leading up to my problem:
> After the online filesystem suddenly went ro, and btrfs check showed
> massive problems, I decided to start the online array from scratch:
> 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0
> /dev/mapper/online1
> 
> As you can see from the backup commands above, the snapshots of
> offline and external are not related, so in order to at least keep the
> extensive backlog of the external snapshot set (including all
> reflinks), I decided to restore the latest snapshot from external.
> 2: btrfs send external/root. | btrfs receive online
> 
> I wanted to ensure I can restart the incremental backup flow from
> online to external, so I did this
> 3: mv online/root. online/root
> 4: btrfs sub snap -r online/root online/root.
> 5: btrfs property set online/root ro false
> 
> Now, I naively expected a simple restart of my automatic backups for
> external should work.
> However after running
> 6: btrfs sub snap -r online/root online/root.
> 7: btrfs send -c online/root.
> online/root. | btrfs receive external

You just recreated your "online" filesystem from scratch. Where
"old_external_reference" comes from? You did not show steps used to
create it.

> I see the following error:
> ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file
> or directory
> 
> Which is unfortunate, but the second problem actually encouraged me to
> post this message.
> As planned, I had to start the offline array from scratch as well,
> because I no longer had any reference snapshot for incremental backups
> on other devices:
> 8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0
> /dev/mapper/offline1
> 
> However restarting the automatic daily backup flow bails out with a
> similar error, although no potentially problematic previous
> incremental snapshots should be involved here!
> ERROR: unlink o925031-987-0/2139527549 failed. No such file or directory
> 

Again - before you can *re*start incremental-forever sequence you need
initial full copy. How exactly did you restart it if no snapshots exist
either on source or on destination?

> I'm a bit lost now. The only thing I could image which might be
> confusing for btrfs,
> is the residual "Received UUID" of online/root.
> after command 2.
> What's the recommended way to restore snapshots with send/receive
> without breaking subsequent incremental backups (including reflinks of
> existing backups)?
> 
> Any hints appreciated...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 12:28:31AM -0700, Marc MERLIN wrote:
> So, I rebooted, and will now run Su's btrfs check without repair and
> report back.

As expected, it will likely still take days, here's the start:

gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2  
Checking filesystem on /dev/mapper/dshelf2 
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d 
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
owner: 374857, offset: 114540544) wanted: 180, have: 240
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
owner: 374857, offset: 148234240) wanted: 301, have: 431
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
owner: 374857, offset: 148234240) wanted: 355, have: 433
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
owner: 374857, offset: 180371456) wanted: 160, have: 240
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
owner: 374857, offset: 180371456) wanted: 161, have: 240
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
owner: 374857, offset: 192200704) wanted: 169, have: 249
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
owner: 374857, offset: 192200704) wanted: 171, have: 251
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
owner: 374857, offset: 217653248) wanted: 347, have: 418
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
owner: 374857, offset: 235175936) wanted: 1, have: 1449
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
owner: 374857, offset: 235175936) wanted: 1, have: 1452

Mmmh, these look similar (but not identical) to the last run earlier in this 
thread:
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 11:15, james harvey wrote:

On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?


Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.
That said, it can't sanely fix things if there is a mismatch.  At least, 
not unless BTRFS gets proper generational tracking to handle temporarily 
missing devices.  As of right now, sanely fixing things requires 
significant manual intervention, as you have to bypass the device read 
selection algorithm to be able to look at the state of the individual 
copies so that you can pick one to use and forcibly rewrite the whole 
file by hand.


A while back, Anand Jain posted some patches that would let you select a 
particular device to direct all reads to via a mount option, but I don't 
think they ever got merged.  That would have made manual recovery in 
cases like this exponentially easier (mount read-only with one device 
selected, copy the file out somewhere, remount read-only with the other 
device, drop caches, copy the file out again, compare and reconcile the 
two copies, then remount the volume writable and write out the repaired 
file).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive vs rsync

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 10:04:02AM +0200, Lionel Bouton wrote:
> Hi,
> 
> On 29/06/2018 09:22, Marc MERLIN wrote:
> > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> >> On Thu, 28 Jun 2018 23:59:03 -0700
> >> Marc MERLIN  wrote:
> >>
> >>> I don't waste a week recreating the many btrfs send/receive relationships.
> >> Consider not using send/receive, and switching to regular rsync instead.
> >> Send/receive is very limiting and cumbersome, including because of what you
> >> described. And it doesn't gain you much over an incremental rsync. As for
> > Err, sorry but I cannot agree with you here, at all :)
> >
> > btrfs send/receive is pretty much the only reason I use btrfs. 
> > rsync takes hours on big filesystems scanning every single inode on both
> > sides and then seeing what changed, and only then sends the differences
> > It's super inefficient.
> > btrfs send knows in seconds what needs to be sent, and works on it right
> > away.
> 
> I've not yet tried send/receive but I feel the pain of rsyncing millions
> of files (I had to use lsyncd to limit the problem to the time the
> origin servers reboot which is a relatively rare event) so this thread
> picked my attention. Looking at the whole thread I wonder if you could
> get a more manageable solution by splitting the filesystem.

So, let's be clear. I did backups with rsync for 10+ years. It was slow
and painful. On my laptop an hourly rsync between 2 drives slowed down
my machine to a crawl while everything was being stat'ed, it took
forever.
Now with btrfs send/receive, it just works, I don't even see it
happening in the background.

Here is a page I wrote about it in 2014:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive

Here is a talk I gave in 2014 too, scroll to the bottom of the page, and
the bottom of the talk outline:
http://marc.merlins.org/perso/btrfs/2014-05.html#My-Btrfs-Talk-at-Linuxcon-JP-2014
and click on 'Btrfs send/receive'

> If instead of using a single BTRFS filesystem you used LVM volumes
> (maybe with Thin provisioning and monitoring of the volume group free
> space) for each of your servers to backup with one BTRFS filesystem per
> volume you would have less snapshots per filesystem and isolate problems
> in case of corruption. If you eventually decide to start from scratch
> again this might help a lot in your case.

So, I already have problems due to too many block layers:
- raid 5 + ssd
- bcache
- dmcrypt
- btrfs

I get occasional deadlocks due to upper layers sending more data to the
lower layer (bcache) than it can process. I'm a bit warry of adding yet
another layer (LVM), but you're otherwise correct than keeping smaller
btrfs filesystems would help with performance and containing possible
damage.

Has anyone actually done this? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing

2018-06-29 Thread David Sterba
On Fri, Jun 29, 2018 at 10:57:07AM +0200, David Sterba wrote:
> This is called either at the end of the mount or if the mount fails.
> In both cases, there's nothing running that can use the table so the
> lock is pointless.

And then lockdep says no. The umount path frees the table but there's
some unfinished bio that wants to use the table from the interrupt
context. And this is puzzling, there should be no IO in flight, all
workers should be stopped.

btrfs/011:

[ 1339.169842] 
[ 1339.171891] WARNING: inconsistent lock state
[ 1339.173724] 4.17.0-rc7-default+ #168 Not tainted
[ 1339.175661] 
[ 1339.177479] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1339.179758] umount/4029 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1339.182008] 96eee2cd (&(>lock)->rlock){?.-.}, at: 
__remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.183982] {IN-HARDIRQ-W} state was registered at:
[ 1339.184819]   _raw_spin_lock_irqsave+0x43/0x52
[ 1339.185433]   unlock_stripe+0x78/0x3d0 [btrfs]
[ 1339.186041]   rbio_orig_end_io+0x41/0xd0 [btrfs]
[ 1339.186648]   blk_update_request+0xd7/0x330
[ 1339.187205]   blk_mq_end_request+0x18/0x70
[ 1339.187752]   flush_smp_call_function_queue+0x83/0x120
[ 1339.188405]   smp_call_function_single_interrupt+0x43/0x270
[ 1339.189094]   call_function_single_interrupt+0xf/0x20
[ 1339.189733]   native_safe_halt+0x2/0x10
[ 1339.190255]   default_idle+0x1f/0x190
[ 1339.190759]   do_idle+0x217/0x240
[ 1339.191229]   cpu_startup_entry+0x6f/0x80
[ 1339.191768]   start_secondary+0x192/0x1e0
[ 1339.192385]   secondary_startup_64+0xa5/0xb0
[ 1339.193145] irq event stamp: 783817
[ 1339.193701] hardirqs last  enabled at (783817): [] 
_raw_spin_unlock_irqrestore+0x4d/0x60
[ 1339.194896] hardirqs last disabled at (783816): [] 
_raw_spin_lock_irqsave+0x20/0x52
[ 1339.196010] softirqs last  enabled at (777902): [] 
__do_softirq+0x397/0x502
[ 1339.197435] softirqs last disabled at (777879): [] 
irq_exit+0xc1/0xd0
[ 1339.198797]
[ 1339.198797] other info that might help us debug this:
[ 1339.200011]  Possible unsafe locking scenario:
[ 1339.200011]
[ 1339.201066]CPU0
[ 1339.201464]
[ 1339.201845]   lock(&(>lock)->rlock);
[ 1339.202372]   
[ 1339.202768] lock(&(>lock)->rlock);
[ 1339.203313]
[ 1339.203313]  *** DEADLOCK ***
[ 1339.203313]
[ 1339.204215] 1 lock held by umount/4029:
[ 1339.204727]  #0: 7c3dd992 (>s_umount_key#26){}, at: 
deactivate_super+0x43/0x50
[ 1339.205822]
[ 1339.205822] stack backtrace:
[ 1339.206660] CPU: 2 PID: 4029 Comm: umount Not tainted 4.17.0-rc7-default+ 
#168
[ 1339.207875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 1339.209357] Call Trace:
[ 1339.209742]  dump_stack+0x85/0xc0
[ 1339.210204]  print_usage_bug.cold.57+0x1aa/0x1e4
[ 1339.210790]  ? print_shortest_lock_dependencies+0x40/0x40
[ 1339.211680]  mark_lock+0x530/0x630
[ 1339.212291]  ? print_shortest_lock_dependencies+0x40/0x40
[ 1339.213180]  __lock_acquire+0x549/0xf60
[ 1339.213855]  ? flush_work+0x24a/0x280
[ 1339.214485]  ? __lock_is_held+0x4f/0x90
[ 1339.215176]  lock_acquire+0x9e/0x1d0
[ 1339.215850]  ? __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.216755]  _raw_spin_lock+0x2c/0x40
[ 1339.217303]  ? __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.218005]  __remove_rbio_from_cache+0x5f/0x140 [btrfs]
[ 1339.218687]  btrfs_free_stripe_hash_table+0x2a/0x50 [btrfs]
[ 1339.219393]  close_ctree+0x1d7/0x330 [btrfs]
[ 1339.219961]  generic_shutdown_super+0x64/0x100
[ 1339.220659]  kill_anon_super+0xe/0x20
[ 1339.221245]  btrfs_kill_super+0x12/0xa0 [btrfs]
[ 1339.221840]  deactivate_locked_super+0x29/0x60
[ 1339.222416]  cleanup_mnt+0x3b/0x70
[ 1339.222892]  task_work_run+0x9b/0xd0
[ 1339.223388]  exit_to_usermode_loop+0x99/0xa0
[ 1339.223949]  do_syscall_64+0x16c/0x170
[ 1339.224603]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1339.225457] RIP: 0033:0x7ffb9180d017
[ 1339.226063] RSP: 002b:7fff9cb72a28 EFLAGS: 0246 ORIG_RAX: 
00a6
[ 1339.227289] RAX:  RBX: 55b804721970 RCX: 7ffb9180d017
[ 1339.228373] RDX: 0001 RSI:  RDI: 55b804721b50
[ 1339.229451] RBP:  R08: 55b804721b70 R09: 7fff9cb71290
[ 1339.230524] R10:  R11: 0246 R12: 7ffb91d291c4
[ 1339.231604] R13: 55b804721b50 R14:  R15: 7fff9cb72c98
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread james harvey
On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:
> And an open question I have about scrub is weather it only ever is
> checking csums, meaning nodatacow files are never scrubbed, or if the
> copies are at least compared to each other?

Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages

2018-06-29 Thread Nikolay Borisov



On 29.06.2018 17:29, David Sterba wrote:
> On Fri, Jun 29, 2018 at 12:08:10PM +0300, Nikolay Borisov wrote:
>>> for (i = 0; i < num_pages; i++)
>>> copy_page(page_address(dst->pages[i]),
>>> page_address(src->pages[i]));
>>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>>> index 0bfd4aeb822d..d8382a4a7f46 100644
>>> --- a/fs/btrfs/extent_io.h
>>> +++ b/fs/btrfs/extent_io.h
>>> @@ -440,10 +440,10 @@ int read_extent_buffer_pages(struct extent_io_tree 
>>> *tree,
>>>  int mirror_num);
>>>  void wait_on_extent_buffer_writeback(struct extent_buffer *eb);
>>>  
>>> -static inline unsigned long num_extent_pages(u64 start, u64 len)
>>> +static inline unsigned long num_extent_pages(const struct extent_buffer 
>>> *eb)
>>>  {
>>> -   return ((start + len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
>>> -   (start >> PAGE_SHIFT);
>>> +   return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
>>> +   (eb->start >> PAGE_SHIFT);
>>
>> This is a good opportunity to eliminate the open-coded round_up:
>>
>> round_up(eb->start + eb->len, PAGE_SIZE)
> 
> Ok, I'll add this patch:
> 
> @@ -442,8 +442,8 @@ void wait_on_extent_buffer_writeback(struct extent_buffer 
> *eb);
>  
>  static inline int num_extent_pages(const struct extent_buffer *eb)
>  {
> -   return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
> -   (eb->start >> PAGE_SHIFT);
> +   return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) -
> +  (eb->start >> PAGE_SHIFT);
>  }

LGTM
>  
>  static inline void extent_buffer_get(struct extent_buffer *eb)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages

2018-06-29 Thread David Sterba
On Fri, Jun 29, 2018 at 12:08:10PM +0300, Nikolay Borisov wrote:
> > for (i = 0; i < num_pages; i++)
> > copy_page(page_address(dst->pages[i]),
> > page_address(src->pages[i]));
> > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> > index 0bfd4aeb822d..d8382a4a7f46 100644
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -440,10 +440,10 @@ int read_extent_buffer_pages(struct extent_io_tree 
> > *tree,
> >  int mirror_num);
> >  void wait_on_extent_buffer_writeback(struct extent_buffer *eb);
> >  
> > -static inline unsigned long num_extent_pages(u64 start, u64 len)
> > +static inline unsigned long num_extent_pages(const struct extent_buffer 
> > *eb)
> >  {
> > -   return ((start + len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
> > -   (start >> PAGE_SHIFT);
> > +   return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
> > +   (eb->start >> PAGE_SHIFT);
> 
> This is a good opportunity to eliminate the open-coded round_up:
> 
> round_up(eb->start + eb->len, PAGE_SIZE)

Ok, I'll add this patch:

@@ -442,8 +442,8 @@ void wait_on_extent_buffer_writeback(struct extent_buffer 
*eb);
 
 static inline int num_extent_pages(const struct extent_buffer *eb)
 {
-   return ((eb->start + eb->len + PAGE_SIZE - 1) >> PAGE_SHIFT) -
-   (eb->start >> PAGE_SHIFT);
+   return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) -
+  (eb->start >> PAGE_SHIFT);
 }
 
 static inline void extent_buffer_get(struct extent_buffer *eb)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: remove warnings superseded by refcount_t usage

2018-06-29 Thread David Sterba
On Tue, Jun 26, 2018 at 05:00:46PM +0300, Nikolay Borisov wrote:
> On 25.06.2018 19:38, David Sterba wrote:
> > There are several WARN_ON calls that catch incrorrect reference counter
> > updates, but this is what the refcount_t does already:
> > 
> > * refcount_inc from 0 will warn
> > * refcount_dec_and_test from 0 will warn
> > 
> 
> But these warnings are only going to be triggered in the case of
> CONFIG_REFCOUNT_FULL otherwise refcount falls back to plain atomic
> operations.

Right, I'm for REFCOUNT_FULL on by default obviously, but we need to
take handle the case where it's not. And there isn't a refcount type
with the full semantics.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE

2018-06-29 Thread David Sterba
On Fri, Jun 29, 2018 at 09:07:12PM +0800, Qu Wenruo wrote:
> >> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb)
> >>  
> >>spin_lock(>refs_lock);
> >>if (atomic_read(>refs) == 2 &&
> >> -  test_bit(EXTENT_BUFFER_DUMMY, >bflags))
> >> +  test_bit(EXTENT_BUFFER_PRIVATE, >bflags))
> >>atomic_dec(>refs);
> 
> Also discussed in off list mail, this extra atomic_dec for cloned eb
> looks confusing.
> (That also explains why after cloning the eb, we call
> extent_buffer_get() and only frees it once, and still no eb leaking)
> What about just removing such special handling?

The extent_buffer_get can be moved to the cloning function, all callers
increase the reference but the missing dec is indeed confusing. However
I don't think we can remove it completely. We need to keep the eb::refs
at the same level for all eb types (regular, STALE, DUMMY), so we can
use eg. free_extent_buffer.

There may be other way how to clean it that I don't see now, so if you
think you can improve the reference handling, then go for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE

2018-06-29 Thread Nikolay Borisov



On 29.06.2018 16:07, Qu Wenruo wrote:
> 
> 
> On 2018年06月29日 20:46, David Sterba wrote:
>> On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote:
>>> EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
>>> this flag set are not in any way dummy. Rather, they are private in
>>> the sense that are not linked to the global buffer tree. This flag has
>>> subtle implications to the way free_extent_buffer work for example, as
>>> well as controls whether page->mapping->private_lock is held during
>>> extent_buffer release. Pages for a private buffer cannot be under io,
>>> nor can they be written by a 3rd party so taking the lock is
>>> unnecessary.
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> ---
>>>  fs/btrfs/disk-io.c   |  2 +-
>>>  fs/btrfs/extent_io.c | 10 +-
>>>  fs/btrfs/extent_io.h |  2 +-
>>>  3 files changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 8a469f70d5ee..1c655be92690 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer 
>>> *buf)
>>>  * enabled.  Normal people shouldn't be marking dummy buffers as dirty
>>>  * outside of the sanity tests.
>>>  */
>>> -   if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags)))
>>> +   if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags)))
>>
>> This is going to be confusing. There's page Private bit,
>> PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically
>> connected.
>>
>> I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by
>> btrfs_clone_extent_buffer or used in the disconnected way (ie. without
>> the mapping).
> 
> UNMAPPED looks good to me.
> (Or ANONYMOUS?)

I'm more leaning towards UNMAPPED at this point.

> 
>>
>>> return;
>>>  #endif
>>> root = BTRFS_I(buf->pages[0]->mapping->host)->root;
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 6ac15804bab1..6611207e8e1f 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb)
>>>  static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>>>  {
>>> int i;
>>> -   int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
>>> +   int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>>  
>>> BUG_ON(extent_buffer_under_io(eb));
>>>  
>>> @@ -4755,7 +4755,7 @@ struct extent_buffer 
>>> *btrfs_clone_extent_buffer(struct extent_buffer *src)
>>> }
>>>  
>>> set_bit(EXTENT_BUFFER_UPTODATE, >bflags);
>>> -   set_bit(EXTENT_BUFFER_DUMMY, >bflags);
>>> +   set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>>  
>>> return new;
>>>  }
>>> @@ -4780,7 +4780,7 @@ struct extent_buffer 
>>> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>>
>> Would be good to sync the new name with the helpers:
>> __alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then.
>>
>>> }
>>> set_extent_buffer_uptodate(eb);
>>> btrfs_set_header_nritems(eb, 0);
>>> -   set_bit(EXTENT_BUFFER_DUMMY, >bflags);
>>> +   set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>>  
>>> return eb;
>>>  err:
>>> @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer 
>>> *eb)
>>> /* Should be safe to release our pages at this point */
>>> btrfs_release_extent_buffer_page(eb);
>>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>> -   if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) {
>>> +   if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) {
>>> __free_extent_buffer(eb);
>>> return 1;
>>> }
>>> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb)
>>>  
>>> spin_lock(>refs_lock);
>>> if (atomic_read(>refs) == 2 &&
>>> -   test_bit(EXTENT_BUFFER_DUMMY, >bflags))
>>> +   test_bit(EXTENT_BUFFER_PRIVATE, >bflags))
>>> atomic_dec(>refs);
> 
> Also discussed in off list mail, this extra atomic_dec for cloned eb
> looks confusing.
> (That also explains why after cloning the eb, we call
> extent_buffer_get() and only frees it once, and still no eb leaking)
> What about just removing such special handling?

I think this special handling is not needed if we consider the fact that
allocating a new eb already has ref1. In this case the code that
allocated the buffer really "hands over" the reference when putting it
on a btrfs_path struct.

Let's take btrfs_search_old_slot as an example. It calls
tree_mod_log_rewind which rewinds the passed in extent buffer by cloning
it and doing an extra extent_buffer_get and "publishing" it to the given
btrfs_path struct. But really this buffer should have only a single
reference since it's only in the btrfs_path. Then this buffer should be
released when btrfs_release_path is called on the path.

Again, in btrfs_search_old_slot we have the same usage scenario in
get_old_root.

> 

Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE

2018-06-29 Thread Qu Wenruo



On 2018年06月29日 20:46, David Sterba wrote:
> On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote:
>> EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
>> this flag set are not in any way dummy. Rather, they are private in
>> the sense that are not linked to the global buffer tree. This flag has
>> subtle implications to the way free_extent_buffer work for example, as
>> well as controls whether page->mapping->private_lock is held during
>> extent_buffer release. Pages for a private buffer cannot be under io,
>> nor can they be written by a 3rd party so taking the lock is
>> unnecessary.
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>>  fs/btrfs/disk-io.c   |  2 +-
>>  fs/btrfs/extent_io.c | 10 +-
>>  fs/btrfs/extent_io.h |  2 +-
>>  3 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 8a469f70d5ee..1c655be92690 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
>>   * enabled.  Normal people shouldn't be marking dummy buffers as dirty
>>   * outside of the sanity tests.
>>   */
>> -if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags)))
>> +if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags)))
> 
> This is going to be confusing. There's page Private bit,
> PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically
> connected.
> 
> I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by
> btrfs_clone_extent_buffer or used in the disconnected way (ie. without
> the mapping).

UNMAPPED looks good to me.
(Or ANONYMOUS?)

> 
>>  return;
>>  #endif
>>  root = BTRFS_I(buf->pages[0]->mapping->host)->root;
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 6ac15804bab1..6611207e8e1f 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb)
>>  static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>>  {
>>  int i;
>> -int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
>> +int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>  
>>  BUG_ON(extent_buffer_under_io(eb));
>>  
>> @@ -4755,7 +4755,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
>> extent_buffer *src)
>>  }
>>  
>>  set_bit(EXTENT_BUFFER_UPTODATE, >bflags);
>> -set_bit(EXTENT_BUFFER_DUMMY, >bflags);
>> +set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>  
>>  return new;
>>  }
>> @@ -4780,7 +4780,7 @@ struct extent_buffer 
>> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> 
> Would be good to sync the new name with the helpers:
> __alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then.
> 
>>  }
>>  set_extent_buffer_uptodate(eb);
>>  btrfs_set_header_nritems(eb, 0);
>> -set_bit(EXTENT_BUFFER_DUMMY, >bflags);
>> +set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>>  
>>  return eb;
>>  err:
>> @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer 
>> *eb)
>>  /* Should be safe to release our pages at this point */
>>  btrfs_release_extent_buffer_page(eb);
>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> -if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) {
>> +if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) {
>>  __free_extent_buffer(eb);
>>  return 1;
>>  }
>> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb)
>>  
>>  spin_lock(>refs_lock);
>>  if (atomic_read(>refs) == 2 &&
>> -test_bit(EXTENT_BUFFER_DUMMY, >bflags))
>> +test_bit(EXTENT_BUFFER_PRIVATE, >bflags))
>>  atomic_dec(>refs);

Also discussed in off list mail, this extra atomic_dec for cloned eb
looks confusing.
(That also explains why after cloning the eb, we call
extent_buffer_get() and only frees it once, and still no eb leaking)
What about just removing such special handling?

Thanks,
Qu

>>  
>>  if (atomic_read(>refs) == 2 &&
>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>> index 0bfd4aeb822d..bfccaec2c89a 100644
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -46,7 +46,7 @@
>>  #define EXTENT_BUFFER_STALE 6
>>  #define EXTENT_BUFFER_WRITEBACK 7
>>  #define EXTENT_BUFFER_READ_ERR 8/* read IO error */
>> -#define EXTENT_BUFFER_DUMMY 9
>> +#define EXTENT_BUFFER_PRIVATE 9
>>  #define EXTENT_BUFFER_IN_TREE 10
>>  #define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
>>  
>> -- 
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to 

Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-29 Thread Christoph Anton Mitterer
On Fri, 2018-06-29 at 09:10 +0800, Qu Wenruo wrote:
> Maybe it's the old mkfs causing the problem?
> Although mkfs.btrfs added device size alignment much earlier than
> kernel, it's still possible that the old mkfs doesn't handle the
> initial
> device and extra device (mkfs.btrfs will always create a temporary fs
> on
> the first device, then add all the other devices to the system) the
> same
> way.

Well who knows,.. at least now everything's fine again :-)

Thanks guys!

Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lockdep splat between fsync and DIO

2018-06-29 Thread Nikolay Borisov
Running xfstest on today's misc-next revealed the following lockdep splat
(but the machine didn't lock up): 

[ 1477.192040] ==
[ 1477.192226] WARNING: possible circular locking dependency detected
[ 1477.192393] 4.18.0-rc1-nbor #755 Not tainted
[ 1477.192522] --
[ 1477.192686] fsstress/4314 is trying to acquire lock:
[ 1477.192827] 3e0774ac (sb_internal#2){.+.+}, at: 
start_transaction+0x2e8/0x4b0
[ 1477.193027] 
   but task is already holding lock:
[ 1477.193191] ef79de77 (>dio_sem){}, at: 
btrfs_direct_IO+0x3bb/0x450
[ 1477.193395] 
   which lock already depends on the new lock.

[ 1477.193589] 
   the existing dependency chain (in reverse order) is:
[ 1477.193774] 
   -> #2 (>dio_sem){}:
[ 1477.193928]btrfs_log_changed_extents.isra.5+0x6e/0x9e0
[ 1477.194143]btrfs_log_inode+0x96c/0xf10
[ 1477.194344]btrfs_log_inode_parent+0x295/0xb10
[ 1477.194540]btrfs_log_dentry_safe+0x4a/0x70
[ 1477.194743]btrfs_sync_file+0x3eb/0x580
[ 1477.194928]do_fsync+0x38/0x60
[ 1477.195114]__x64_sys_fsync+0x10/0x20
[ 1477.195320]do_syscall_64+0x5a/0x1a0
[ 1477.195496]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1477.195698] 
   -> #1 (>log_mutex){+.+.}:
[ 1477.196001]btrfs_record_unlink_dir+0x2a/0xa0
[ 1477.196225]btrfs_unlink+0x5e/0xd0
[ 1477.196401]vfs_unlink+0xc4/0x190
[ 1477.196602]do_unlinkat+0x2ab/0x310
[ 1477.196774]do_syscall_64+0x5a/0x1a0
[ 1477.197016]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1477.197230] 
   -> #0 (sb_internal#2){.+.+}:
[ 1477.197480]__sb_start_write+0x126/0x1a0
[ 1477.197662]start_transaction+0x2e8/0x4b0
[ 1477.197845]find_free_extent+0x10c1/0x1430
[ 1477.198027]btrfs_reserve_extent+0x9b/0x180
[ 1477.198224]btrfs_get_blocks_direct+0x38d/0x700
[ 1477.198419]__blockdev_direct_IO+0xb2f/0x39c7
[ 1477.198611]btrfs_direct_IO+0x190/0x450
[ 1477.198790]generic_file_direct_write+0x9d/0x160
[ 1477.198986]btrfs_file_write_iter+0x217/0x60b
[ 1477.199179]aio_write+0x136/0x1f0
[ 1477.199359]io_submit_one+0x584/0x6d0
[ 1477.199543]__se_sys_io_submit+0x91/0x220
[ 1477.199789]do_syscall_64+0x5a/0x1a0
[ 1477.199964]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1477.200163] 
   other info that might help us debug this:

[ 1477.200522] Chain exists of:
 sb_internal#2 --> >log_mutex --> >dio_sem

[ 1477.200891]  Possible unsafe locking scenario:

[ 1477.201137]CPU0CPU1
[ 1477.201356]
[ 1477.201533]   lock(>dio_sem);
[ 1477.201687]lock(>log_mutex);
[ 1477.201887]lock(>dio_sem);
[ 1477.202088]   lock(sb_internal#2);
[ 1477.202290] 
*** DEADLOCK ***

[ 1477.202581] 1 lock held by fsstress/4314:
[ 1477.202745]  #0: ef79de77 (>dio_sem){}, at: 
btrfs_direct_IO+0x3bb/0x450
[ 1477.203039] 
   stack backtrace:
[ 1477.203312] CPU: 0 PID: 4314 Comm: fsstress Not tainted 4.18.0-rc1-nbor #755
[ 1477.203537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 1477.203850] Call Trace:
[ 1477.203992]  dump_stack+0x85/0xcb
[ 1477.204151]  print_circular_bug.isra.19+0x1c8/0x2b0
[ 1477.204383]  __lock_acquire+0x182e/0x1900
[ 1477.204551]  ? lock_acquire+0xa3/0x210
[ 1477.204715]  lock_acquire+0xa3/0x210
[ 1477.204889]  ? start_transaction+0x2e8/0x4b0
[ 1477.205065]  __sb_start_write+0x126/0x1a0
[ 1477.205271]  ? start_transaction+0x2e8/0x4b0
[ 1477.205455]  start_transaction+0x2e8/0x4b0
[ 1477.205631]  find_free_extent+0x10c1/0x1430
[ 1477.205806]  btrfs_reserve_extent+0x9b/0x180
[ 1477.205980]  btrfs_get_blocks_direct+0x38d/0x700
[ 1477.206193]  __blockdev_direct_IO+0xb2f/0x39c7
[ 1477.206397]  ? __lock_acquire+0x2b6/0x1900
[ 1477.206571]  ? __lock_acquire+0x2b6/0x1900
[ 1477.206744]  ? can_nocow_extent+0x480/0x480
[ 1477.206921]  ? btrfs_run_delalloc_work+0x40/0x40
[ 1477.207107]  ? btrfs_direct_IO+0x190/0x450
[ 1477.207322]  btrfs_direct_IO+0x190/0x450
[ 1477.207492]  ? btrfs_run_delalloc_work+0x40/0x40
[ 1477.207678]  generic_file_direct_write+0x9d/0x160
[ 1477.207862]  btrfs_file_write_iter+0x217/0x60b
[ 1477.208044]  aio_write+0x136/0x1f0
[ 1477.208244]  ? lock_acquire+0xa3/0x210
[ 1477.208415]  ? __might_fault+0x3e/0x90
[ 1477.208580]  ? io_submit_one+0x584/0x6d0
[ 1477.208749]  io_submit_one+0x584/0x6d0
[ 1477.208917]  ? lock_acquire+0xa3/0x210
[ 1477.209086]  __se_sys_io_submit+0x91/0x220
[ 1477.209296]  ? do_syscall_64+0x5a/0x1a0
[ 1477.209468]  do_syscall_64+0x5a/0x1a0
[ 1477.209636]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1477.209826] RIP: 0033:0x7fa509cee697
[ 

Re: [PATCH 3/4] btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_PRIVATE

2018-06-29 Thread David Sterba
On Wed, Jun 27, 2018 at 04:38:24PM +0300, Nikolay Borisov wrote:
> EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
> this flag set are not in any way dummy. Rather, they are private in
> the sense that are not linked to the global buffer tree. This flag has
> subtle implications to the way free_extent_buffer work for example, as
> well as controls whether page->mapping->private_lock is held during
> extent_buffer release. Pages for a private buffer cannot be under io,
> nor can they be written by a 3rd party so taking the lock is
> unnecessary.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/disk-io.c   |  2 +-
>  fs/btrfs/extent_io.c | 10 +-
>  fs/btrfs/extent_io.h |  2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8a469f70d5ee..1c655be92690 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4093,7 +4093,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
>* enabled.  Normal people shouldn't be marking dummy buffers as dirty
>* outside of the sanity tests.
>*/
> - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags)))
> + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags)))

This is going to be confusing. There's page Private bit,
PAGE_SET_PRIVATE2 and EXTENT_PAGE_PRIVATE, that are somehow logically
connected.

I'd suggest EXTENT_BUFFER_CLONED or _UNMAPPED as it's created by
btrfs_clone_extent_buffer or used in the disconnected way (ie. without
the mapping).

>   return;
>  #endif
>   root = BTRFS_I(buf->pages[0]->mapping->host)->root;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6ac15804bab1..6611207e8e1f 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4642,7 +4642,7 @@ int extent_buffer_under_io(struct extent_buffer *eb)
>  static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  {
>   int i;
> - int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
> + int mapped = !test_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>  
>   BUG_ON(extent_buffer_under_io(eb));
>  
> @@ -4755,7 +4755,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
> extent_buffer *src)
>   }
>  
>   set_bit(EXTENT_BUFFER_UPTODATE, >bflags);
> - set_bit(EXTENT_BUFFER_DUMMY, >bflags);
> + set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>  
>   return new;
>  }
> @@ -4780,7 +4780,7 @@ struct extent_buffer 
> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,

Would be good to sync the new name with the helpers:
__alloc_dummy_extent_buffer and alloc_dummy_extent_buffer then.

>   }
>   set_extent_buffer_uptodate(eb);
>   btrfs_set_header_nritems(eb, 0);
> - set_bit(EXTENT_BUFFER_DUMMY, >bflags);
> + set_bit(EXTENT_BUFFER_PRIVATE, >bflags);
>  
>   return eb;
>  err:
> @@ -5086,7 +5086,7 @@ static int release_extent_buffer(struct extent_buffer 
> *eb)
>   /* Should be safe to release our pages at this point */
>   btrfs_release_extent_buffer_page(eb);
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> - if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, >bflags))) {
> + if (unlikely(test_bit(EXTENT_BUFFER_PRIVATE, >bflags))) {
>   __free_extent_buffer(eb);
>   return 1;
>   }
> @@ -5117,7 +5117,7 @@ void free_extent_buffer(struct extent_buffer *eb)
>  
>   spin_lock(>refs_lock);
>   if (atomic_read(>refs) == 2 &&
> - test_bit(EXTENT_BUFFER_DUMMY, >bflags))
> + test_bit(EXTENT_BUFFER_PRIVATE, >bflags))
>   atomic_dec(>refs);
>  
>   if (atomic_read(>refs) == 2 &&
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 0bfd4aeb822d..bfccaec2c89a 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -46,7 +46,7 @@
>  #define EXTENT_BUFFER_STALE 6
>  #define EXTENT_BUFFER_WRITEBACK 7
>  #define EXTENT_BUFFER_READ_ERR 8/* read IO error */
> -#define EXTENT_BUFFER_DUMMY 9
> +#define EXTENT_BUFFER_PRIVATE 9
>  #define EXTENT_BUFFER_IN_TREE 10
>  #define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs suddenly think's it's raid6

2018-06-29 Thread Qu Wenruo


On 2018年06月29日 19:04, marble wrote:
> Hello,
> I have an external HDD. The HDD contains no partition.
> I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs.
> It's used to store some media files.
> The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux
> to play music from the drive.
> 
> After disconnecting the drive from the Pi and connecting it to my laptop
> again, I couldn't mount it anymore. If I read the dmesg right, it now
> thinks that it's part of a raid6.

Don't panic, the raid6 module is always loaded by btrfs, no matter if
btrfs uses or not.

> 
> btrfs check --repair also didn't help.

Normally you shouldn't try it anyway.

> 
> ```
> marble@archlinux ~ % uname -a
> Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC
> 2018 x86_64 GNU/Linux
> 
> marble@archlinux ~ % btrfs --version
> btrfs-progs v4.16.1
> 
> marble@archlinux ~ % sudo cryptsetup open /dev/sda black
> [sudo] password for marble:
> Enter passphrase for /dev/sda:
> 
> marble@archlinux ~ % mkdir /tmp/black
> marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black
> mount: /tmp/black: can't read superblock on /dev/mapper/black.
> 
> marble@archlinux ~ % sudo btrfs fi show
> Label: 'black'  uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
> Total devices 1 FS bytes used 143.38GiB
> devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black
> 
> marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black
> enabling repair mode
> Checking filesystem on /dev/mapper/black
> UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
> Fixed 0 roots.
> checking extents
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3

This means, both copies for logical bytenr 1082114048 are corrupted.

> bytenr mismatch, want=1082114048, have=9385453728028469028
> owner ref check failed [1082114048 16384]
> repair deleting extent record: key [1082114048,168,16384]

You shouldn't really use --repair here.
Your extent tree is corrupted, --repair could make things worse here.

> adding new tree backref on start 1082114048 len 16384 parent 0 root 5
> Repaired extent references for 1082114048
> ref mismatch on [59038556160 4096] extent item 1, found 0
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> bytenr mismatch, want=1082114048, have=9385453728028469028
> incorrect local backref count on 59038556160 root 5 owner 334393 offset
> 0 found 0 wanted 1 back 0x56348aee5de0
> backref disk bytenr does not match extent record, bytenr=59038556160,
> ref bytenr=0
> backpointer mismatch on [59038556160 4096]
> owner ref check failed [59038556160 4096]
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
> checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
> bytenr mismatch, want=1082114048, have=9385453728028469028
> failed to repair damaged filesystem, aborting
> 
> marble@archlinux ~ % dmesg > /tmp/dmesg.log

This is the root cause:
[  124.544439] BTRFS error (device dm-1): bad tree block start
9385453728028469028 1082114048
[  124.555229] BTRFS error (device dm-1): bad tree block start
2365503423870651471 1082114048
[  124.555275] BTRFS warning (device dm-1): failed to read fs tree: -5

Your fs tree is corrupted.
I'm not sure how it's corrupted, but corruption to fs tree is a little rare.

Is there anything like unexpected power loss happens?
And would you provide the following data for debugging?

# btrfs ins dump-super -fFa /dev/mapper/black

And further more, what's the device mapper setup for /dev/mapper/black?
Is there anything like RAID here?

Thanks,
Qu


> ```
> 
> Any clues?
> 
> Best regards,
> marble
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/4] btrfs: Refactor loop in btrfs_release_extent_buffer_page

2018-06-29 Thread David Sterba
On Wed, Jun 27, 2018 at 04:38:22PM +0300, Nikolay Borisov wrote:
> The purpose of the function is to free all the pages comprising an
> extent buffer. This can be achieved with a simple for loop rather than
> the slitghly more involved 'do {} while' construct. So rewrite the
> loop using a 'for' construct. Additionally we can never have an
> extent_buffer that is 0 pages so remove the check for index == 0. No
> functional changes.

The reversed loop makes sense, the first page is special and used for
locking the whole extent buffer's pages, as can be seen eg.
897ca6e9b4fef86d5dfb6b31 or 4f2de97acee6532b36dd6e99 . Also see current
alloc_extent_buffer for the locking and managing the private bit.

So you can still rewrite it as a for loop but would have to preserve the
logic or provide the reason that it's correct to iterate from 0 to
num_pages. There are some subltle races regarding pages and extents and
their presence in various trees, so I'd rather be careful here.

> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/extent_io.c | 13 -
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cce6087d6880..4180a3b7e725 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4641,19 +4641,14 @@ int extent_buffer_under_io(struct extent_buffer *eb)
>   */
>  static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  {
> - unsigned long index;
> - struct page *page;
> + int i;
>   int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
>  
>   BUG_ON(extent_buffer_under_io(eb));
>  
> - index = num_extent_pages(eb->start, eb->len);
> - if (index == 0)
> - return;

This check does seem to be redundant.

> + for (i = 0; i < num_extent_pages(eb->start, eb->len); i++) {

I think that the num_extent_pages(...) would be better put to a
temporary variable so it's not evaluated each time the loop termination
condition is checked.

> + struct page *page = eb->pages[i];
>  
> - do {
> - index--;
> - page = eb->pages[index];
>   if (!page)
>   continue;
>   if (mapped)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: fix race between mkfs and mount

2018-06-29 Thread David Sterba
On Tue, Jun 26, 2018 at 10:42:32PM +0800, Anand Jain wrote:
> >>> Last version of the proposed fix is to extend the uuid_mutex over the
> >>> whole mount callback and use it around calls to btrfs_scan_one_device.
> >>> That way we'll be sure the mount will always get to the device it
> >>> scanned and that will not be freed by the parallel scan.
> >>
> >>That will break the requisite #1 as above.
> > 
> > I don't see how it breaks 'mount and or scan of two independent fsid+dev
> > is possible'. It is possible, but the lock does not need to be
> > filesystem local.
> > 
> > Concurrent mount or scan will block on the uuid_mutex,
> 
>   As uuid_mutex is global, if fsid1 is being mounted, the fsid2 mount
>   will wait upto certain extent with this code.

Yes it will wait a bit, but the critical section is short. There's some
IO done and it's reading of 4K in btrfs_read_disk_super. The rest is
cpu-bound and hardly measurable in practice, in context of concurrent
mount and scanning.

I took the approach to fix the bug first and then make it faster or
cleaner, also to fix it in a way that's still acceptable for current
development cycle.

>   My efforts here was to
>   use uuid_mutex only to protect the fs_uuid update part, in this way
>   there is concurrency in the mount process of fsid1 and fsid2 and
>   provides shorter bootup time when the user uses the mount at boot
>   option.

The locking can be still improved but the uuid_mutex is not a contended
lock, mount is an operation that does not happen thousand times a
second, same for the scanning.

So even if there are several mounts happening during boot, it's just a
few and the delay is IMO unnoticeable. If the boot time is longer by
say 100ms at worst, I'm still ok with my patches as a fix.

But not as a final fix, so the updates to locking you mentioned are
still possible. We now have a point of reference where syzbot does is
not able to reproduce the bugs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"

2018-06-29 Thread David Sterba
On Fri, Jun 29, 2018 at 10:13:44AM +0300, Nikolay Borisov wrote:
> On 29.06.2018 10:11, Anand Jain wrote:
> > 
> > On 06/29/2018 01:26 PM, Nikolay Borisov wrote:
> >> Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for
> >> device list traversal") btrfs_show_devname no longer takes
> >> device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix
> >> a possible umount deadlock") aimed to fix no longer exists. So remove
> >> the extra code that commit added.
> >>
> >> This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa.
> > 
> >  Its not exactly a revert.
> > 
> >  Because before 88c14590cdd6 was written, 0ccd05285e7f was already their
> >  for the right purpose OR a new title is even better.
> 
> What I'm saying is that commit 88c14590cdd6 obsoleted 0ccd05285e7f,
> hence this patch reverts the latter. I've literally generated it with
> git revert 0ccd05285e7f + minor fixups.

I do not recommend to do a revert here.  Even if a patch reverts
functionality because it's not needed anymore, then it's a "forward"
change.  There are also other changes that may affect the behaviour and
in this case it's 47dba17171a76ea2a2a71 that removes rcu barrier, so the
patch has to be put into the context of current code.

The commit is almost 2 years old, the idea of reverts is IMHO more to
provide an easy way do a small step back during one devlopment cycle
when the moving parts are still in sight.

Back then the commit fixed a deadlock, a revert here would read as 'ok,
we want the deadlock back'.

So, the code is ok. The subject needs to drop the word 'revert' and
changelg maybe mention a few more references why the logic is not needed
anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs suddenly think's it's raid6

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 07:04, marble wrote:

Hello,
I have an external HDD. The HDD contains no partition.
I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs.
It's used to store some media files.
The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux
to play music from the drive.

After disconnecting the drive from the Pi and connecting it to my laptop
again, I couldn't mount it anymore. If I read the dmesg right, it now
thinks that it's part of a raid6.

btrfs check --repair also didn't help.

```
marble@archlinux ~ % uname -a
Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC
2018 x86_64 GNU/Linux

marble@archlinux ~ % btrfs --version
btrfs-progs v4.16.1

marble@archlinux ~ % sudo cryptsetup open /dev/sda black
[sudo] password for marble:
Enter passphrase for /dev/sda:

marble@archlinux ~ % mkdir /tmp/black
marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black
mount: /tmp/black: can't read superblock on /dev/mapper/black.

marble@archlinux ~ % sudo btrfs fi show
Label: 'black'  uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
 Total devices 1 FS bytes used 143.38GiB
 devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black

marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black
enabling repair mode
Checking filesystem on /dev/mapper/black
UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
Fixed 0 roots.
checking extents
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
owner ref check failed [1082114048 16384]
repair deleting extent record: key [1082114048,168,16384]
adding new tree backref on start 1082114048 len 16384 parent 0 root 5
Repaired extent references for 1082114048
ref mismatch on [59038556160 4096] extent item 1, found 0
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
incorrect local backref count on 59038556160 root 5 owner 334393 offset
0 found 0 wanted 1 back 0x56348aee5de0
backref disk bytenr does not match extent record, bytenr=59038556160,
ref bytenr=0
backpointer mismatch on [59038556160 4096]
owner ref check failed [59038556160 4096]
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
failed to repair damaged filesystem, aborting

marble@archlinux ~ % dmesg > /tmp/dmesg.log
```

Any clues?


It's not thinking it's a raid6 array.  All the messages before this one:

Btrfs loaded, crc32c=crc32c-intel

Are completely unrelated to BTRFS (because anything before that message 
happened before any BTRFS code ran).  The raid6 messages are actually 
from the startup code for the kernel's generic parity RAID implementation.


These:

BTRFS error (device dm-1): bad tree block start 9385453728028469028 
1082114048
BTRFS error (device dm-1): bad tree block start 2365503423870651471 
1082114048


Are the relevant error messages.  Unfortunately, I don't really know 
what's wrong in this case though.  Hopefully one of the developers will 
have some further insight.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs suddenly think's it's raid6

2018-06-29 Thread marble
Hello,
I have an external HDD. The HDD contains no partition.
I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs.
It's used to store some media files.
The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux
to play music from the drive.

After disconnecting the drive from the Pi and connecting it to my laptop
again, I couldn't mount it anymore. If I read the dmesg right, it now
thinks that it's part of a raid6.

btrfs check --repair also didn't help.

```
marble@archlinux ~ % uname -a
Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC
2018 x86_64 GNU/Linux

marble@archlinux ~ % btrfs --version
btrfs-progs v4.16.1

marble@archlinux ~ % sudo cryptsetup open /dev/sda black
[sudo] password for marble:
Enter passphrase for /dev/sda:

marble@archlinux ~ % mkdir /tmp/black
marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black
mount: /tmp/black: can't read superblock on /dev/mapper/black.

marble@archlinux ~ % sudo btrfs fi show
Label: 'black'  uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
Total devices 1 FS bytes used 143.38GiB
devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black

marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black
enabling repair mode
Checking filesystem on /dev/mapper/black
UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
Fixed 0 roots.
checking extents
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
owner ref check failed [1082114048 16384]
repair deleting extent record: key [1082114048,168,16384]
adding new tree backref on start 1082114048 len 16384 parent 0 root 5
Repaired extent references for 1082114048
ref mismatch on [59038556160 4096] extent item 1, found 0
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
incorrect local backref count on 59038556160 root 5 owner 334393 offset
0 found 0 wanted 1 back 0x56348aee5de0
backref disk bytenr does not match extent record, bytenr=59038556160,
ref bytenr=0
backpointer mismatch on [59038556160 4096]
owner ref check failed [59038556160 4096]
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
failed to repair damaged filesystem, aborting

marble@archlinux ~ % dmesg > /tmp/dmesg.log
```

Any clues?

Best regards,
marble
[0.00] Linux version 4.17.2-1-ARCH (builduser@heftig-9574) (gcc version 8.1.1 20180531 (GCC)) #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018
[0.00] Command line: initrd=\initramfs-linux.img cryptdevice=PARTLABEL=root:root_luks root=/dev/mapper/root_luks quiet rw
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0008bfff] usable
[0.00] BIOS-e820: [mem 0x0008c000-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xb50dcfff] usable
[0.00] BIOS-e820: [mem 0xb50dd000-0xb50ddfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xb50de000-0xb50defff] reserved
[0.00] BIOS-e820: [mem 0xb50df000-0xbea16fff] usable
[0.00] BIOS-e820: [mem 0xbea17000-0xbff2dfff] reserved
[0.00] BIOS-e820: [mem 0xbff2e000-0xbff99fff] ACPI NVS
[   

Re: [PATCH] btrfs: use customized batch size for total_bytes_pinned

2018-06-29 Thread ethanlien

Nikolay Borisov 於 2018-06-29 16:43 寫到:

On 29.06.2018 11:27, Ethan Lien wrote:

We use percpu_counter to reduce lock contention of frequently updated
counter. But the default 32 batch size is suitable only for inc/dec
update, not for sector size update. The end result is we still acquire
spin lock for every percpu_counter_add(). So use customized batch size
for total_bytes_pinned as we already done for dirty_metadata_bytes
and delalloc_bytes. Also we fix dirty_metadata_bytes to use
__percpu_counter_compare of customized batch so we get precise value 
of

dirty_metadata_bytes.


Do you have any numbers to prove this is problematic? Perhaps a fio
workload + lockdep contention numbers before/after the patch ?



I think a contented path may come from heavily cow metadata and short 
data
extents, where we call add_pinned_bytes() frequently in releasing old 
blocks.

I'll try if I can find out a test to show the difference.



Signed-off-by: Ethan Lien 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c | 10 ++
 fs/btrfs/extent-tree.c | 43 
+++---

 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 * time the transaction commits.
 */
struct percpu_counter total_bytes_pinned;
+   s32 total_bytes_pinned_batch;

struct list_head list;
/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..dfed08e70ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -961,8 +961,9 @@ static int btree_writepages(struct address_space 
*mapping,


fs_info = BTRFS_I(mapping->host)->root->fs_info;
/* this is a bit racy, but that's ok */
-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret < 0)
return 0;
}
@@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,

if (flush_delayed)
btrfs_balance_delayed_items(fs_info);

-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret > 0) {

balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
}


This hunk needs to be in a separate patch. And the last sentence which
relates to it needs to be expanded further to explain the problem.



OK I'll prepare a separate patch.


diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..a067c047904c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info 
*fs_info, s64 num_bytes,


space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes,
+   space_info->total_bytes_pinned_batch);
 }

 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct 
btrfs_trans_handle *trans,

flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned,
-  -head->num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  space_info->total_bytes_pinned_batch);

if (head->is_data) {
spin_lock(_refs->lock);
@@ -4035,6 +4037,10 @@ static int create_space_info(struct 
btrfs_fs_info *info, u64 flags)

INIT_LIST_HEAD(_info->ro_bgs);
INIT_LIST_HEAD(_info->tickets);
INIT_LIST_HEAD(_info->priority_tickets);
+   if (flags & BTRFS_BLOCK_GROUP_DATA)
+   space_info->total_bytes_pinned_batch = info->delalloc_batch;
+   else
+   space_info->total_bytes_pinned_batch = 
info->dirty_metadata_batch;

ret = kobject_init_and_add(_info->kobj, _info_ktype,
info->space_info_kobj, "%s",
@@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct 

Re: [PATCH 05/14] btrfs: pass only eb to num_extent_pages

2018-06-29 Thread Nikolay Borisov



On 29.06.2018 11:56, David Sterba wrote:
> Almost all callers pass the start and len as 2 arguments but this is not
> necessary, all the information is provided by the eb. By reordering the
> calls to num_extent_pages, we don't need the local variables with
> start/len.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_io.c | 30 +++---
>  fs/btrfs/extent_io.h |  6 +++---
>  2 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8e4a7cdbc9f5..577bbb026042 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
>struct extent_buffer *eb, int mirror_num)
>  {
>   u64 start = eb->start;
> - unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
> + unsigned long i, num_pages = num_extent_pages(eb);
>   int ret = 0;
>  
>   if (sb_rdonly(fs_info->sb))
> @@ -3591,7 +3591,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>   if (!ret)
>   return ret;
>  
> - num_pages = num_extent_pages(eb->start, eb->len);
> + num_pages = num_extent_pages(eb);
>   for (i = 0; i < num_pages; i++) {
>   struct page *p = eb->pages[i];
>  
> @@ -3721,7 +3721,7 @@ static noinline_for_stack int write_one_eb(struct 
> extent_buffer *eb,
>   int ret = 0;
>  
>   clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags);
> - num_pages = num_extent_pages(eb->start, eb->len);
> + num_pages = num_extent_pages(eb);
>   atomic_set(>io_pages, num_pages);
>  
>   /* set btree blocks beyond nritems with 0 to avoid stale content. */
> @@ -4650,7 +4650,7 @@ static void btrfs_release_extent_buffer_page(struct 
> extent_buffer *eb)
>  
>   BUG_ON(extent_buffer_under_io(eb));
>  
> - index = num_extent_pages(eb->start, eb->len);
> + index = num_extent_pages(eb);
>   if (index == 0)
>   return;
>  
> @@ -4743,7 +4743,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
> extent_buffer *src)
>   unsigned long i;
>   struct page *p;
>   struct extent_buffer *new;
> - unsigned long num_pages = num_extent_pages(src->start, src->len);
> + unsigned long num_pages = num_extent_pages(src);
>  
>   new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
>   if (new == NULL)
> @@ -4775,12 +4775,11 @@ struct extent_buffer 
> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>   unsigned long num_pages;
>   unsigned long i;
>  
> - num_pages = num_extent_pages(start, len);
> -
>   eb = __alloc_extent_buffer(fs_info, start, len);
>   if (!eb)
>   return NULL;
>  
> + num_pages = num_extent_pages(eb);
>   for (i = 0; i < num_pages; i++) {
>   eb->pages[i] = alloc_page(GFP_NOFS);
>   if (!eb->pages[i])
> @@ -4844,7 +4843,7 @@ static void mark_extent_buffer_accessed(struct 
> extent_buffer *eb,
>  
>   check_buffer_tree_ref(eb);
>  
> - num_pages = num_extent_pages(eb->start, eb->len);
> + num_pages = num_extent_pages(eb);
>   for (i = 0; i < num_pages; i++) {
>   struct page *p = eb->pages[i];
>  
> @@ -4941,7 +4940,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
> btrfs_fs_info *fs_info,
> u64 start)
>  {
>   unsigned long len = fs_info->nodesize;
> - unsigned long num_pages = num_extent_pages(start, len);
> + unsigned long num_pages;
>   unsigned long i;
>   unsigned long index = start >> PAGE_SHIFT;
>   struct extent_buffer *eb;
> @@ -4964,6 +4963,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
> btrfs_fs_info *fs_info,
>   if (!eb)
>   return ERR_PTR(-ENOMEM);
>  
> + num_pages = num_extent_pages(eb);
>   for (i = 0; i < num_pages; i++, index++) {
>   p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>   if (!p) {
> @@ -5160,7 +5160,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
>   unsigned long num_pages;
>   struct page *page;
>  
> - num_pages = num_extent_pages(eb->start, eb->len);
> + num_pages = num_extent_pages(eb);
>  
>   for (i = 0; i < num_pages; i++) {
>   page = eb->pages[i];
> @@ -5194,7 +5194,7 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
>  
>   was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, >bflags);
>  
> - num_pages = num_extent_pages(eb->start, eb->len);
> + num_pages = num_extent_pages(eb);
>   WARN_ON(atomic_read(>refs) == 0);
>   WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
>  
> @@ -5210,7 +5210,7 @@ void clear_extent_buffer_uptodate(struct extent_buffer 
> *eb)
>   unsigned long num_pages;
>  
>   clear_bit(EXTENT_BUFFER_UPTODATE, >bflags);
> - num_pages = num_extent_pages(eb->start, eb->len);
> + 

[PATCH 13/14] btrfs: raid56: don't lock stripe cache table when freeing

2018-06-29 Thread David Sterba
This is called either at the end of the mount or if the mount fails.
In both cases, there's nothing running that can use the table so the
lock is pointless.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 272acd9b1192..0840b054e4b7 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -407,19 +407,16 @@ static void remove_rbio_from_cache(struct btrfs_raid_bio 
*rbio)
 static void btrfs_clear_rbio_cache(struct btrfs_fs_info *info)
 {
struct btrfs_stripe_hash_table *table;
-   unsigned long flags;
struct btrfs_raid_bio *rbio;
 
table = info->stripe_hash_table;
 
-   spin_lock_irqsave(>cache_lock, flags);
while (!list_empty(>stripe_cache)) {
rbio = list_entry(table->stripe_cache.next,
  struct btrfs_raid_bio,
  stripe_cache);
__remove_rbio_from_cache(rbio);
}
-   spin_unlock_irqrestore(>cache_lock, flags);
 }
 
 /*
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/14] btrfs: raid56: use new helper for async_scrub_parity

2018-06-29 Thread David Sterba
Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f9b349171d61..339cce0878d1 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -170,7 +170,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check);
-static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+static void scrub_parity_work(struct btrfs_work *work);
 
 static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t 
work_func)
 {
@@ -812,7 +812,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
start_async_work(next, rmw_work);
} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
steal_rbio(rbio, next);
-   async_scrub_parity(next);
+   start_async_work(next, scrub_parity_work);
}
 
goto done_nolock;
@@ -2703,18 +2703,10 @@ static void scrub_parity_work(struct btrfs_work *work)
raid56_parity_scrub_stripe(rbio);
 }
 
-static void async_scrub_parity(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper,
-   scrub_parity_work, NULL, NULL);
-
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 {
if (!lock_stripe_add(rbio))
-   async_scrub_parity(rbio);
+   start_async_work(rbio, scrub_parity_work);
 }
 
 /* The following code is used for dev replace of a missing RAID 5/6 device. */
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/14] btrfs: raid56: use new helper for async_rmw_stripe

2018-06-29 Thread David Sterba
Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f30d847baf07..96a7d3445623 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_work(struct btrfs_work *work);
 static void read_rebuild_work(struct btrfs_work *work);
-static void async_rmw_stripe(struct btrfs_raid_bio *rbio);
 static void async_read_rebuild(struct btrfs_raid_bio *rbio);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
@@ -811,7 +810,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
async_read_rebuild(next);
} else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
-   async_rmw_stripe(next);
+   start_async_work(next, rmw_work);
} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
steal_rbio(rbio, next);
async_scrub_parity(next);
@@ -1501,12 +1500,6 @@ static void raid_rmw_end_io(struct bio *bio)
rbio_orig_end_io(rbio, BLK_STS_IOERR);
 }
 
-static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper, rmw_work, NULL, NULL);
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 static void async_read_rebuild(struct btrfs_raid_bio *rbio)
 {
btrfs_init_work(>work, btrfs_rmw_helper,
@@ -1645,7 +1638,7 @@ static int partial_stripe_write(struct btrfs_raid_bio 
*rbio)
 
ret = lock_stripe_add(rbio);
if (ret == 0)
-   async_rmw_stripe(rbio);
+   start_async_work(rbio, rmw_work);
return 0;
 }
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/14] btrfs: raid56: merge rbio_is_full helpers

2018-06-29 Thread David Sterba
There's only one call site of the unlocked helper so it can be folded
into the caller.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 339cce0878d1..272acd9b1192 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -507,32 +507,21 @@ static void run_xor(void **pages, int src_cnt, ssize_t 
len)
 }
 
 /*
- * returns true if the bio list inside this rbio
- * covers an entire stripe (no rmw required).
- * Must be called with the bio list lock held, or
- * at a time when you know it is impossible to add
- * new bios into the list
+ * Returns true if the bio list inside this rbio covers an entire stripe (no
+ * rmw required).
  */
-static int __rbio_is_full(struct btrfs_raid_bio *rbio)
+static int rbio_is_full(struct btrfs_raid_bio *rbio)
 {
+   unsigned long flags;
unsigned long size = rbio->bio_list_bytes;
int ret = 1;
 
+   spin_lock_irqsave(>bio_list_lock, flags);
if (size != rbio->nr_data * rbio->stripe_len)
ret = 0;
-
BUG_ON(size > rbio->nr_data * rbio->stripe_len);
-   return ret;
-}
-
-static int rbio_is_full(struct btrfs_raid_bio *rbio)
-{
-   unsigned long flags;
-   int ret;
-
-   spin_lock_irqsave(>bio_list_lock, flags);
-   ret = __rbio_is_full(rbio);
spin_unlock_irqrestore(>bio_list_lock, flags);
+
return ret;
 }
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/14] btrfs: raid56: catch errors from full_stripe_write

2018-06-29 Thread David Sterba
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0840b054e4b7..84889d10d5b0 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1683,8 +1683,11 @@ static void run_plug(struct btrfs_plug_cb *plug)
list_del_init(>plug_list);
 
if (rbio_is_full(cur)) {
+   int ret;
+
/* we have a full stripe, send it down */
-   full_stripe_write(cur);
+   ret = full_stripe_write(cur);
+   BUG_ON(ret);
continue;
}
if (last) {
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/14] btrfs: open-code bio_set_op_attrs

2018-06-29 Thread David Sterba
The helper is trivial and marked as deprecated.

Signed-off-by: David Sterba 
---
 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/compression.c |  4 ++--
 fs/btrfs/extent_io.c   |  2 +-
 fs/btrfs/inode.c   |  2 +-
 fs/btrfs/raid56.c  | 10 +-
 fs/btrfs/scrub.c   |  6 +++---
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index dc062b195c46..7addfa41b393 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1624,7 +1624,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
bio = btrfs_io_bio_alloc(num_pages - i);
bio_set_dev(bio, block_ctx->dev->bdev);
bio->bi_iter.bi_sector = dev_bytenr >> 9;
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio->bi_opf = REQ_OP_READ;
 
for (j = i; j < num_pages; j++) {
ret = bio_add_page(bio, block_ctx->pagev[j],
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f48794a36068..70dace47258b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -609,7 +609,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
cb->len = bio->bi_iter.bi_size;
 
comp_bio = btrfs_bio_alloc(bdev, cur_disk_byte);
-   bio_set_op_attrs (comp_bio, REQ_OP_READ, 0);
+   comp_bio->bi_opf = REQ_OP_READ;
comp_bio->bi_private = cb;
comp_bio->bi_end_io = end_compressed_bio_read;
refcount_set(>pending_bios, 1);
@@ -656,7 +656,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
}
 
comp_bio = btrfs_bio_alloc(bdev, cur_disk_byte);
-   bio_set_op_attrs(comp_bio, REQ_OP_READ, 0);
+   comp_bio->bi_opf = REQ_OP_READ;
comp_bio->bi_private = cb;
comp_bio->bi_end_io = end_compressed_bio_read;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 94ca8d644de9..8d86531150cd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2401,7 +2401,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
  start - page_offset(page),
  (int)phy_offset, failed_bio->bi_end_io,
  NULL);
-   bio_set_op_attrs(bio, REQ_OP_READ, read_mode);
+   bio->bi_opf = REQ_OP_READ | read_mode;
 
btrfs_debug(btrfs_sb(inode->i_sb),
"Repair Read Error: submitting new read[%#x] to this_mirror=%d, 
in_validation=%d",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 02e12e8bffc4..0d0df4fae4b0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7861,7 +7861,7 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
isector >>= inode->i_sb->s_blocksize_bits;
bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
pgoff, isector, repair_endio, repair_arg);
-   bio_set_op_attrs(bio, REQ_OP_READ, read_mode);
+   bio->bi_opf = REQ_OP_READ | read_mode;
 
btrfs_debug(BTRFS_I(inode)->root->fs_info,
"repair DIO read error: submitting new dio read[%#x] to 
this_mirror=%d, in_validation=%d",
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 42631079c492..1a1b7d6c44cb 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1330,7 +1330,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio->bi_private = rbio;
bio->bi_end_io = raid_write_end_io;
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   bio->bi_opf = REQ_OP_WRITE;
 
submit_bio(bio);
}
@@ -1586,7 +1586,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
bio->bi_private = rbio;
bio->bi_end_io = raid_rmw_end_io;
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio->bi_opf = REQ_OP_READ;
 
btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
 
@@ -2130,7 +2130,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio)
 
bio->bi_private = rbio;
bio->bi_end_io = raid_recover_end_io;
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio->bi_opf = REQ_OP_READ;
 
btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
 
@@ -2502,7 +2502,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
 
bio->bi_private = rbio;
bio->bi_end_io = raid_write_end_io;
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   bio->bi_opf = REQ_OP_WRITE;
 
submit_bio(bio);
}
@@ -2684,7 +2684,7 @@ static void 

[PATCH 06/14] btrfs: switch types to int when counting eb pages

2018-06-29 Thread David Sterba
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).

Reviewed-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 44 ++--
 fs/btrfs/extent_io.h |  2 +-
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 577bbb026042..94ca8d644de9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb, int mirror_num)
 {
u64 start = eb->start;
-   unsigned long i, num_pages = num_extent_pages(eb);
+   int i, num_pages = num_extent_pages(eb);
int ret = 0;
 
if (sb_rdonly(fs_info->sb))
@@ -3541,7 +3541,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
  struct btrfs_fs_info *fs_info,
  struct extent_page_data *epd)
 {
-   unsigned long i, num_pages;
+   int i, num_pages;
int flush = 0;
int ret = 0;
 
@@ -3715,7 +3715,7 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
struct extent_io_tree *tree = _I(fs_info->btree_inode)->io_tree;
u64 offset = eb->start;
u32 nritems;
-   unsigned long i, num_pages;
+   int i, num_pages;
unsigned long start, end;
unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
int ret = 0;
@@ -4644,7 +4644,7 @@ int extent_buffer_under_io(struct extent_buffer *eb)
  */
 static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 {
-   unsigned long index;
+   int index;
struct page *page;
int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
 
@@ -4740,10 +4740,10 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, 
u64 start,
 
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 {
-   unsigned long i;
+   int i;
struct page *p;
struct extent_buffer *new;
-   unsigned long num_pages = num_extent_pages(src);
+   int num_pages = num_extent_pages(src);
 
new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
if (new == NULL)
@@ -4772,8 +4772,8 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct 
btrfs_fs_info *fs_info,
  u64 start, unsigned long len)
 {
struct extent_buffer *eb;
-   unsigned long num_pages;
-   unsigned long i;
+   int num_pages;
+   int i;
 
eb = __alloc_extent_buffer(fs_info, start, len);
if (!eb)
@@ -4839,7 +4839,7 @@ static void check_buffer_tree_ref(struct extent_buffer 
*eb)
 static void mark_extent_buffer_accessed(struct extent_buffer *eb,
struct page *accessed)
 {
-   unsigned long num_pages, i;
+   int num_pages, i;
 
check_buffer_tree_ref(eb);
 
@@ -4940,8 +4940,8 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
  u64 start)
 {
unsigned long len = fs_info->nodesize;
-   unsigned long num_pages;
-   unsigned long i;
+   int num_pages;
+   int i;
unsigned long index = start >> PAGE_SHIFT;
struct extent_buffer *eb;
struct extent_buffer *exists = NULL;
@@ -5156,8 +5156,8 @@ void free_extent_buffer_stale(struct extent_buffer *eb)
 
 void clear_extent_buffer_dirty(struct extent_buffer *eb)
 {
-   unsigned long i;
-   unsigned long num_pages;
+   int i;
+   int num_pages;
struct page *page;
 
num_pages = num_extent_pages(eb);
@@ -5186,8 +5186,8 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 
 int set_extent_buffer_dirty(struct extent_buffer *eb)
 {
-   unsigned long i;
-   unsigned long num_pages;
+   int i;
+   int num_pages;
int was_dirty = 0;
 
check_buffer_tree_ref(eb);
@@ -5205,9 +5205,9 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
 
 void clear_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-   unsigned long i;
+   int i;
struct page *page;
-   unsigned long num_pages;
+   int num_pages;
 
clear_bit(EXTENT_BUFFER_UPTODATE, >bflags);
num_pages = num_extent_pages(eb);
@@ -5220,9 +5220,9 @@ void clear_extent_buffer_uptodate(struct extent_buffer 
*eb)
 
 void set_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-   unsigned long i;
+   int i;
struct page *page;
-   unsigned long num_pages;
+   int num_pages;
 
set_bit(EXTENT_BUFFER_UPTODATE, >bflags);
num_pages = num_extent_pages(eb);
@@ -5235,13 +5235,13 @@ void set_extent_buffer_uptodate(struct extent_buffer 
*eb)
 int read_extent_buffer_pages(struct extent_io_tree *tree,
 struct extent_buffer *eb, int wait, 

[PATCH 05/14] btrfs: pass only eb to num_extent_pages

2018-06-29 Thread David Sterba
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.

Reviewed-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 30 +++---
 fs/btrfs/extent_io.h |  6 +++---
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8e4a7cdbc9f5..577bbb026042 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2062,7 +2062,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb, int mirror_num)
 {
u64 start = eb->start;
-   unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
+   unsigned long i, num_pages = num_extent_pages(eb);
int ret = 0;
 
if (sb_rdonly(fs_info->sb))
@@ -3591,7 +3591,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
if (!ret)
return ret;
 
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
@@ -3721,7 +3721,7 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
int ret = 0;
 
clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags);
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
atomic_set(>io_pages, num_pages);
 
/* set btree blocks beyond nritems with 0 to avoid stale content. */
@@ -4650,7 +4650,7 @@ static void btrfs_release_extent_buffer_page(struct 
extent_buffer *eb)
 
BUG_ON(extent_buffer_under_io(eb));
 
-   index = num_extent_pages(eb->start, eb->len);
+   index = num_extent_pages(eb);
if (index == 0)
return;
 
@@ -4743,7 +4743,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
extent_buffer *src)
unsigned long i;
struct page *p;
struct extent_buffer *new;
-   unsigned long num_pages = num_extent_pages(src->start, src->len);
+   unsigned long num_pages = num_extent_pages(src);
 
new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
if (new == NULL)
@@ -4775,12 +4775,11 @@ struct extent_buffer 
*__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
unsigned long num_pages;
unsigned long i;
 
-   num_pages = num_extent_pages(start, len);
-
eb = __alloc_extent_buffer(fs_info, start, len);
if (!eb)
return NULL;
 
+   num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
eb->pages[i] = alloc_page(GFP_NOFS);
if (!eb->pages[i])
@@ -4844,7 +4843,7 @@ static void mark_extent_buffer_accessed(struct 
extent_buffer *eb,
 
check_buffer_tree_ref(eb);
 
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
@@ -4941,7 +4940,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
  u64 start)
 {
unsigned long len = fs_info->nodesize;
-   unsigned long num_pages = num_extent_pages(start, len);
+   unsigned long num_pages;
unsigned long i;
unsigned long index = start >> PAGE_SHIFT;
struct extent_buffer *eb;
@@ -4964,6 +4963,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
if (!eb)
return ERR_PTR(-ENOMEM);
 
+   num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++, index++) {
p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
if (!p) {
@@ -5160,7 +5160,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
unsigned long num_pages;
struct page *page;
 
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
 
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
@@ -5194,7 +5194,7 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
 
was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, >bflags);
 
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
WARN_ON(atomic_read(>refs) == 0);
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
 
@@ -5210,7 +5210,7 @@ void clear_extent_buffer_uptodate(struct extent_buffer 
*eb)
unsigned long num_pages;
 
clear_bit(EXTENT_BUFFER_UPTODATE, >bflags);
-   num_pages = num_extent_pages(eb->start, eb->len);
+   num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
if (page)
@@ -5225,7 +5225,7 

[PATCH 08/14] btrfs: raid56: add new helper for starting async work

2018-06-29 Thread David Sterba
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.

Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1a1b7d6c44cb..f30d847baf07 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -174,6 +174,12 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
 int need_check);
 static void async_scrub_parity(struct btrfs_raid_bio *rbio);
 
+static void start_async_work(struct btrfs_raid_bio *rbio, btrfs_func_t 
work_func)
+{
+   btrfs_init_work(>work, btrfs_rmw_helper, work_func, NULL, NULL);
+   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
+}
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/14] btrfs: raid56: use new helper for async_read_rebuild

2018-06-29 Thread David Sterba
Signed-off-by: David Sterba 
---
 fs/btrfs/raid56.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 96a7d3445623..f9b349171d61 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,7 +162,6 @@ static int __raid56_parity_recover(struct btrfs_raid_bio 
*rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_work(struct btrfs_work *work);
 static void read_rebuild_work(struct btrfs_work *work);
-static void async_read_rebuild(struct btrfs_raid_bio *rbio);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void __free_raid_bio(struct btrfs_raid_bio *rbio);
@@ -804,10 +803,10 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock_irqrestore(>lock, flags);
 
if (next->operation == BTRFS_RBIO_READ_REBUILD)
-   async_read_rebuild(next);
+   start_async_work(next, read_rebuild_work);
else if (next->operation == BTRFS_RBIO_REBUILD_MISSING) 
{
steal_rbio(rbio, next);
-   async_read_rebuild(next);
+   start_async_work(next, read_rebuild_work);
} else if (next->operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
start_async_work(next, rmw_work);
@@ -1500,14 +1499,6 @@ static void raid_rmw_end_io(struct bio *bio)
rbio_orig_end_io(rbio, BLK_STS_IOERR);
 }
 
-static void async_read_rebuild(struct btrfs_raid_bio *rbio)
-{
-   btrfs_init_work(>work, btrfs_rmw_helper,
-   read_rebuild_work, NULL, NULL);
-
-   btrfs_queue_work(rbio->fs_info->rmw_workers, >work);
-}
-
 /*
  * the stripe must be locked by the caller.  It will
  * unlock after all the writes are done
@@ -2765,5 +2756,5 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, 
struct bio *bio,
 void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio)
 {
if (!lock_stripe_add(rbio))
-   async_read_rebuild(rbio);
+   start_async_work(rbio, read_rebuild_work);
 }
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/14] btrfs: prune unused includes

2018-06-29 Thread David Sterba
Remove includes if none of the interfaces and exports is used in the
given source file.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c  |  4 
 fs/btrfs/dev-replace.c  |  5 -
 fs/btrfs/disk-io.c  |  2 --
 fs/btrfs/file.c |  3 ---
 fs/btrfs/inode-map.c|  1 -
 fs/btrfs/inode.c|  4 
 fs/btrfs/ioctl.c|  5 -
 fs/btrfs/ordered-data.c |  1 -
 fs/btrfs/raid56.c   | 13 -
 fs/btrfs/reada.c|  1 -
 fs/btrfs/struct-funcs.c |  1 -
 fs/btrfs/super.c|  3 ---
 fs/btrfs/sysfs.c|  2 --
 fs/btrfs/volumes.c  |  3 ---
 14 files changed, 48 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d3e447b45bf7..f48794a36068 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -5,7 +5,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -14,10 +13,7 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e2ba0419297a..819440204fd5 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -6,14 +6,9 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
-#include 
-#include 
 #include 
 #include 
-#include 
 #include "ctree.h"
 #include "extent_map.h"
 #include "disk-io.h"
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7ee825b96187..efdfbd334b85 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -5,8 +5,6 @@
 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 11e43e69d12f..9f11ce754f75 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -5,14 +5,11 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 12fcd8897c33..c120b8e25a3e 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -3,7 +3,6 @@
  * Copyright (C) 2007 Oracle.  All rights reserved.
  */
 
-#include 
 #include 
 #include 
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f30608dc038b..02e12e8bffc4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -14,17 +14,13 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5556e9ea2a4b..a78a2c8cbd09 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5,23 +5,18 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2e1a1694a33d..6055d357ce4c 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include "ctree.h"
 #include "transaction.h"
 #include "btrfs_inode.h"
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 27ed47a23f26..42631079c492 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -5,32 +5,19 @@
  */
 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include "ctree.h"
-#include "extent_map.h"
 #include "disk-io.h"
-#include "transaction.h"
-#include "print-tree.h"
 #include "volumes.h"
 #include "raid56.h"
 #include "async-thread.h"
-#include "check-integrity.h"
-#include "rcu-string.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT1
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 40f1bcef394d..83d49f2e7bc6 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -7,7 +7,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include "ctree.h"
diff --git a/fs/btrfs/struct-funcs.c b/fs/btrfs/struct-funcs.c
index b7b4acb12833..4c13b737f568 100644
--- a/fs/btrfs/struct-funcs.c
+++ b/fs/btrfs/struct-funcs.c
@@ -3,7 +3,6 @@
  * Copyright (C) 2007 Oracle.  All rights reserved.
  */
 
-#include 
 #include 
 
 #include "ctree.h"
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 81107ad49f3a..a264bea21c3a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -5,7 +5,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -15,8 +14,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4a4e960c7c66..3717c864ba23 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -7,10 +7,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
 
 #include 

[PATCH 01/14] btrfs: simplify some assignments of inode numbers

2018-06-29 Thread David Sterba
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.

Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c12b7a6e534a..a27e68405aa3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4243,9 +4243,9 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
prev = node;
entry = rb_entry(node, struct btrfs_inode, rb_node);
 
-   if (objectid < btrfs_ino(BTRFS_I(>vfs_inode)))
+   if (objectid < btrfs_ino(entry))
node = node->rb_left;
-   else if (objectid > btrfs_ino(BTRFS_I(>vfs_inode)))
+   else if (objectid > btrfs_ino(entry))
node = node->rb_right;
else
break;
@@ -4253,7 +4253,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
if (!node) {
while (prev) {
entry = rb_entry(prev, struct btrfs_inode, rb_node);
-   if (objectid <= btrfs_ino(BTRFS_I(>vfs_inode))) {
+   if (objectid <= btrfs_ino(entry)) {
node = prev;
break;
}
@@ -4262,7 +4262,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
}
while (node) {
entry = rb_entry(node, struct btrfs_inode, rb_node);
-   objectid = btrfs_ino(BTRFS_I(>vfs_inode)) + 1;
+   objectid = btrfs_ino(entry) + 1;
inode = igrab(>vfs_inode);
if (inode) {
spin_unlock(>inode_lock);
@@ -5615,9 +5615,9 @@ static void inode_tree_add(struct inode *inode)
parent = *p;
entry = rb_entry(parent, struct btrfs_inode, rb_node);
 
-   if (ino < btrfs_ino(BTRFS_I(>vfs_inode)))
+   if (ino < btrfs_ino(entry))
p = >rb_left;
-   else if (ino > btrfs_ino(BTRFS_I(>vfs_inode)))
+   else if (ino > btrfs_ino(entry))
p = >rb_right;
else {
WARN_ON(!(entry->vfs_inode.i_state &
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/14] btrfs: use copy_page for copying pages instead of memcpy

2018-06-29 Thread David Sterba
Use the helper that's possibly optimized for full page copies.

Signed-off-by: David Sterba 
---
 fs/btrfs/free-space-cache.c |  4 ++--
 fs/btrfs/raid56.c   | 12 +---
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 38ad5e9f0bf2..15732be6bd56 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -547,7 +547,7 @@ static int io_ctl_add_bitmap(struct btrfs_io_ctl *io_ctl, 
void *bitmap)
io_ctl_map_page(io_ctl, 0);
}
 
-   memcpy(io_ctl->cur, bitmap, PAGE_SIZE);
+   copy_page(io_ctl->cur, bitmap);
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
if (io_ctl->index < io_ctl->num_pages)
io_ctl_map_page(io_ctl, 0);
@@ -607,7 +607,7 @@ static int io_ctl_read_bitmap(struct btrfs_io_ctl *io_ctl,
if (ret)
return ret;
 
-   memcpy(entry->bitmap, io_ctl->cur, PAGE_SIZE);
+   copy_page(entry->bitmap, io_ctl->cur);
io_ctl_unmap_page(io_ctl);
 
return 0;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5e4ad134b9ad..27ed47a23f26 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -260,7 +260,7 @@ static void cache_rbio_pages(struct btrfs_raid_bio *rbio)
s = kmap(rbio->bio_pages[i]);
d = kmap(rbio->stripe_pages[i]);
 
-   memcpy(d, s, PAGE_SIZE);
+   copy_page(d, s);
 
kunmap(rbio->bio_pages[i]);
kunmap(rbio->stripe_pages[i]);
@@ -1275,7 +1275,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
pointers);
} else {
/* raid5 */
-   memcpy(pointers[nr_data], pointers[0], PAGE_SIZE);
+   copy_page(pointers[nr_data], pointers[0]);
run_xor(pointers + 1, nr_data - 1, PAGE_SIZE);
}
 
@@ -1941,9 +1941,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
BUG_ON(failb != -1);
 pstripe:
/* Copy parity block into failed block to start with */
-   memcpy(pointers[faila],
-  pointers[rbio->nr_data],
-  PAGE_SIZE);
+   copy_page(pointers[faila], pointers[rbio->nr_data]);
 
/* rearrange the pointer array */
p = pointers[faila];
@@ -2448,7 +2446,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
pointers);
} else {
/* raid5 */
-   memcpy(pointers[nr_data], pointers[0], PAGE_SIZE);
+   copy_page(pointers[nr_data], pointers[0]);
run_xor(pointers + 1, nr_data - 1, PAGE_SIZE);
}
 
@@ -2456,7 +2454,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
p = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
parity = kmap(p);
if (memcmp(parity, pointers[rbio->scrubp], PAGE_SIZE))
-   memcpy(parity, pointers[rbio->scrubp], PAGE_SIZE);
+   copy_page(parity, pointers[rbio->scrubp]);
else
/* Parity is right, needn't writeback */
bitmap_clear(rbio->dbitmap, pagenr, 1);
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/14] btrfs: simplify pointer chasing of local fs_info variables

2018-06-29 Thread David Sterba
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.

If the transaction handle is available and not NULL it's used instead.

Signed-off-by: David Sterba 
---
 fs/btrfs/delayed-inode.c|  4 ++--
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent-tree.c  |  6 +++---
 fs/btrfs/file-item.c|  2 +-
 fs/btrfs/file.c | 14 +++---
 fs/btrfs/free-space-cache.c |  7 ++-
 fs/btrfs/inode.c|  6 +++---
 fs/btrfs/tree-log.c |  6 +++---
 8 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index fe6caa7e698b..596d2af0c8aa 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1222,7 +1222,7 @@ int btrfs_commit_inode_delayed_items(struct 
btrfs_trans_handle *trans,
 
 int btrfs_commit_inode_delayed_inode(struct btrfs_inode *inode)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_trans_handle *trans;
struct btrfs_delayed_node *delayed_node = btrfs_get_delayed_node(inode);
struct btrfs_path *path;
@@ -1837,7 +1837,7 @@ int btrfs_delayed_update_inode(struct btrfs_trans_handle 
*trans,
 
 int btrfs_delayed_delete_inode_ref(struct btrfs_inode *inode)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_delayed_node *delayed_node;
 
/*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..7ee825b96187 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -212,7 +212,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode 
*inode,
struct page *page, size_t pg_offset, u64 start, u64 len,
int create)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_map_tree *em_tree = >extent_tree;
struct extent_map *em;
int ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..59e42481a515 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6019,7 +6019,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct 
btrfs_fs_info *fs_info,
 
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned nr_extents;
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
int ret = 0;
@@ -6092,7 +6092,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
 bool qgroup_free)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
spin_lock(>lock);
@@ -6121,7 +6121,7 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode 
*inode, u64 num_bytes,
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes,
bool qgroup_free)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned num_extents;
 
spin_lock(>lock);
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f9dd6d1836a3..ec0a1acbe783 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -922,7 +922,7 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode 
*inode,
 const bool new_inline,
 struct extent_map *em)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_root *root = inode->root;
struct extent_buffer *leaf = path->nodes[0];
const int slot = path->slots[0];
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f660ba1e5e58..11e43e69d12f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -83,7 +83,7 @@ static int __compare_inode_defrag(struct inode_defrag 
*defrag1,
 static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
struct inode_defrag *defrag)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct inode_defrag *entry;
struct rb_node **p;
struct rb_node *parent = NULL;
@@ -135,8 +135,8 @@ static inline int __need_auto_defrag(struct btrfs_fs_info 

[PATCH 00/14] Misc cleanups for 4.19

2018-06-29 Thread David Sterba
Minor interface/API and other safe cleanups.

David Sterba (14):
  btrfs: simplify some assignments of inode numbers
  btrfs: simplify pointer chasing of local fs_info variables
  btrfs: use copy_page for copying pages instead of memcpy
  btrfs: prune unused includes
  btrfs: pass only eb to num_extent_pages
  btrfs: switch types to int when counting eb pages
  btrfs: open-code bio_set_op_attrs
  btrfs: raid56: add new helper for starting async work
  btrfs: raid56: use new helper for async_rmw_stripe
  btrfs: raid56: use new helper for async_read_rebuild
  btrfs: raid56: use new helper for async_scrub_parity
  btrfs: raid56: merge rbio_is_full helpers
  btrfs: raid56: don't lock stripe cache table when freeing
  btrfs: raid56: catch errors from full_stripe_write

 fs/btrfs/check-integrity.c  |   2 +-
 fs/btrfs/compression.c  |   8 +--
 fs/btrfs/delayed-inode.c|   4 +-
 fs/btrfs/dev-replace.c  |   5 --
 fs/btrfs/disk-io.c  |   4 +-
 fs/btrfs/extent-tree.c  |   6 +-
 fs/btrfs/extent_io.c|  70 +++---
 fs/btrfs/extent_io.h|   6 +-
 fs/btrfs/file-item.c|   2 +-
 fs/btrfs/file.c |  17 +++---
 fs/btrfs/free-space-cache.c |  11 ++--
 fs/btrfs/inode-map.c|   1 -
 fs/btrfs/inode.c|  24 
 fs/btrfs/ioctl.c|   5 --
 fs/btrfs/ordered-data.c |   1 -
 fs/btrfs/raid56.c   | 112 +++-
 fs/btrfs/reada.c|   1 -
 fs/btrfs/scrub.c|   6 +-
 fs/btrfs/struct-funcs.c |   1 -
 fs/btrfs/super.c|   3 -
 fs/btrfs/sysfs.c|   2 -
 fs/btrfs/tree-log.c |   6 +-
 fs/btrfs/volumes.c  |   3 -
 23 files changed, 109 insertions(+), 191 deletions(-)

-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use customized batch size for total_bytes_pinned

2018-06-29 Thread Nikolay Borisov



On 29.06.2018 11:27, Ethan Lien wrote:
> We use percpu_counter to reduce lock contention of frequently updated
> counter. But the default 32 batch size is suitable only for inc/dec
> update, not for sector size update. The end result is we still acquire
> spin lock for every percpu_counter_add(). So use customized batch size
> for total_bytes_pinned as we already done for dirty_metadata_bytes
> and delalloc_bytes. Also we fix dirty_metadata_bytes to use
> __percpu_counter_compare of customized batch so we get precise value of
> dirty_metadata_bytes.

Do you have any numbers to prove this is problematic? Perhaps a fio
workload + lockdep contention numbers before/after the patch ?

> 
> Signed-off-by: Ethan Lien 
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/disk-io.c | 10 ++
>  fs/btrfs/extent-tree.c | 43 +++---
>  3 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 118346aceea9..df682a521635 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -422,6 +422,7 @@ struct btrfs_space_info {
>* time the transaction commits.
>*/
>   struct percpu_counter total_bytes_pinned;
> + s32 total_bytes_pinned_batch;
>  
>   struct list_head list;
>   /* Protected by the spinlock 'lock'. */
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 205092dc9390..dfed08e70ec1 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
>  
>   fs_info = BTRFS_I(mapping->host)->root->fs_info;
>   /* this is a bit racy, but that's ok */
> - ret = percpu_counter_compare(_info->dirty_metadata_bytes,
> -  BTRFS_DIRTY_METADATA_THRESH);
> + ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
> +  BTRFS_DIRTY_METADATA_THRESH,
> +  fs_info->dirty_metadata_batch);
>   if (ret < 0)
>   return 0;
>   }
> @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
> btrfs_fs_info *fs_info,
>   if (flush_delayed)
>   btrfs_balance_delayed_items(fs_info);
>  
> - ret = percpu_counter_compare(_info->dirty_metadata_bytes,
> -  BTRFS_DIRTY_METADATA_THRESH);
> + ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
> +  BTRFS_DIRTY_METADATA_THRESH,
> +  fs_info->dirty_metadata_batch);
>   if (ret > 0) {
>   
> balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
>   }

This hunk needs to be in a separate patch. And the last sentence which
relates to it needs to be expanded further to explain the problem.

> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d9fe58c0080..a067c047904c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info 
> *fs_info, s64 num_bytes,
>  
>   space_info = __find_space_info(fs_info, flags);
>   ASSERT(space_info);
> - percpu_counter_add(_info->total_bytes_pinned, num_bytes);
> + percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes,
> + space_info->total_bytes_pinned_batch);
>  }
>  
>  /*
> @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   flags = BTRFS_BLOCK_GROUP_METADATA;
>   space_info = __find_space_info(fs_info, flags);
>   ASSERT(space_info);
> - percpu_counter_add(_info->total_bytes_pinned,
> --head->num_bytes);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +space_info->total_bytes_pinned_batch);
>  
>   if (head->is_data) {
>   spin_lock(_refs->lock);
> @@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info 
> *info, u64 flags)
>   INIT_LIST_HEAD(_info->ro_bgs);
>   INIT_LIST_HEAD(_info->tickets);
>   INIT_LIST_HEAD(_info->priority_tickets);
> + if (flags & BTRFS_BLOCK_GROUP_DATA)
> + space_info->total_bytes_pinned_batch = info->delalloc_batch;
> + else
> + space_info->total_bytes_pinned_batch = 
> info->dirty_metadata_batch;
>  
>   ret = kobject_init_and_add(_info->kobj, _info_ktype,
>   info->space_info_kobj, "%s",
> @@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
> *inode, u64 bytes)
>* allocation, and no removed chunk in current transaction,
>* don't bother committing the transaction.
>*/
> - 

[PATCH] btrfs: use customized batch size for total_bytes_pinned

2018-06-29 Thread Ethan Lien
We use percpu_counter to reduce lock contention of frequently updated
counter. But the default 32 batch size is suitable only for inc/dec
update, not for sector size update. The end result is we still acquire
spin lock for every percpu_counter_add(). So use customized batch size
for total_bytes_pinned as we already done for dirty_metadata_bytes
and delalloc_bytes. Also we fix dirty_metadata_bytes to use
__percpu_counter_compare of customized batch so we get precise value of
dirty_metadata_bytes.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c | 10 ++
 fs/btrfs/extent-tree.c | 43 +++---
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 * time the transaction commits.
 */
struct percpu_counter total_bytes_pinned;
+   s32 total_bytes_pinned_batch;
 
struct list_head list;
/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..dfed08e70ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
 
fs_info = BTRFS_I(mapping->host)->root->fs_info;
/* this is a bit racy, but that's ok */
-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret < 0)
return 0;
}
@@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
if (flush_delayed)
btrfs_balance_delayed_items(fs_info);
 
-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret > 0) {

balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..a067c047904c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, 
s64 num_bytes,
 
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes,
+   space_info->total_bytes_pinned_batch);
 }
 
 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned,
-  -head->num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  space_info->total_bytes_pinned_batch);
 
if (head->is_data) {
spin_lock(_refs->lock);
@@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags)
INIT_LIST_HEAD(_info->ro_bgs);
INIT_LIST_HEAD(_info->tickets);
INIT_LIST_HEAD(_info->priority_tickets);
+   if (flags & BTRFS_BLOCK_GROUP_DATA)
+   space_info->total_bytes_pinned_batch = info->delalloc_batch;
+   else
+   space_info->total_bytes_pinned_batch = 
info->dirty_metadata_batch;
 
ret = kobject_init_and_add(_info->kobj, _info_ktype,
info->space_info_kobj, "%s",
@@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 * allocation, and no removed chunk in current transaction,
 * don't bother committing the transaction.
 */
-   have_pinned_space = percpu_counter_compare(
+   have_pinned_space = __percpu_counter_compare(
_sinfo->total_bytes_pinned,
-   used + bytes - data_sinfo->total_bytes);
+   used + bytes - data_sinfo->total_bytes,
+   data_sinfo->total_bytes_pinned_batch);
spin_unlock(_sinfo->lock);
 
/* 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Lionel Bouton
Hi,

On 29/06/2018 09:22, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
>> On Thu, 28 Jun 2018 23:59:03 -0700
>> Marc MERLIN  wrote:
>>
>>> I don't waste a week recreating the many btrfs send/receive relationships.
>> Consider not using send/receive, and switching to regular rsync instead.
>> Send/receive is very limiting and cumbersome, including because of what you
>> described. And it doesn't gain you much over an incremental rsync. As for
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
> It's super inefficient.
> btrfs send knows in seconds what needs to be sent, and works on it right
> away.

I've not yet tried send/receive but I feel the pain of rsyncing millions
of files (I had to use lsyncd to limit the problem to the time the
origin servers reboot which is a relatively rare event) so this thread
picked my attention. Looking at the whole thread I wonder if you could
get a more manageable solution by splitting the filesystem.

If instead of using a single BTRFS filesystem you used LVM volumes
(maybe with Thin provisioning and monitoring of the volume group free
space) for each of your servers to backup with one BTRFS filesystem per
volume you would have less snapshots per filesystem and isolate problems
in case of corruption. If you eventually decide to start from scratch
again this might help a lot in your case.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Roman Mamedov
On Fri, 29 Jun 2018 00:22:10 -0700
Marc MERLIN  wrote:

> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> > On Thu, 28 Jun 2018 23:59:03 -0700
> > Marc MERLIN  wrote:
> > 
> > > I don't waste a week recreating the many btrfs send/receive relationships.
> > 
> > Consider not using send/receive, and switching to regular rsync instead.
> > Send/receive is very limiting and cumbersome, including because of what you
> > described. And it doesn't gain you much over an incremental rsync. As for
> 
> Err, sorry but I cannot agree with you here, at all :)
> 
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences

I use it for backing up root filesystems of about 20 hosts, and for syncing
large multi-terabyte media collections -- it's fast enough in both.
Admittedly neither of those case has millions of subdirs or files where
scanning may take a long time. And in the former case it's also all from and
to SSDs. Maybe your use case is different where it doesn't work as well. But
perhaps then general day-to-day performance is not great either, so I'd suggest
looking into SSD-based LVM caching, it really works wonders with Btrfs.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 03:20:42PM +0800, Qu Wenruo wrote:
> If certain btrfs specific operations are involved, it's definitely not OK:
> 1) Balance
> 2) Quota
> 3) Btrfs check

Ok, I understand. I'll try to balance almost never then. My problems did
indeed start because I ran balance and it got stuck 2 days with 0
progress.
That still seems like a bug though. I'm ok with slow, but stuck for 2
days with only 270 snapshots or so means there is a bug, or the
algorithm is so expensive that 270 snapshots could cause it to take days
or weeks to proceed?

> > It's a backup server, it only contains data from other machines.
> > If the filesystem cannot be recovered to a working state, I will need
> > over a week to restart the many btrfs send commands from many servers.
> > This is why anything other than --repair is useless ot me, I don't need
> > the data back, it's still on the original machines, I need the
> > filesystem to work again so that I don't waste a week recreating the
> > many btrfs send/receive relationships.
> 
> Now totally understand why you need to repair the fs.

I also understand that my use case is atypical :)
But I guess this also means that using btrfs for a lot of send/receive
on a backup server is not going to work well unfortunately :-/

Now I'm wondering if I'm the only person even doing this.

> > Does the pastebin help and is 270 snapshots ok enough?
> 
> The super dump doesn't show anything wrong.
> 
> So the problem may be in the super large extent tree.
> 
> In this case, plain check result with Su's patch would help more, other
> than the not so interesting super dump.

First I tried to mount with skip balance after the partial repair, and
it hung a long time:
[445635.716318] BTRFS info (device dm-2): disk space caching is enabled
[445635.736229] BTRFS info (device dm-2): has skinny extents
[445636.101999] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
[445825.053205] BTRFS info (device dm-2): enabling ssd optimizations
[446511.006588] BTRFS info (device dm-2): disk space caching is enabled
[446511.026737] BTRFS info (device dm-2): has skinny extents
[446511.325470] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
[446699.593501] BTRFS info (device dm-2): enabling ssd optimizations
[446964.077045] INFO: task btrfs-transacti:9211 blocked for more than 120 
seconds.
[446964.099802]   Not tainted 4.17.2-amd64-preempt-sysrq-20180818 #3
[446964.120004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.

So, I rebooted, and will now run Su's btrfs check without repair and
report back.

Thanks both for your help.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> On Thu, 28 Jun 2018 23:59:03 -0700
> Marc MERLIN  wrote:
> 
> > I don't waste a week recreating the many btrfs send/receive relationships.
> 
> Consider not using send/receive, and switching to regular rsync instead.
> Send/receive is very limiting and cumbersome, including because of what you
> described. And it doesn't gain you much over an incremental rsync. As for

Err, sorry but I cannot agree with you here, at all :)

btrfs send/receive is pretty much the only reason I use btrfs. 
rsync takes hours on big filesystems scanning every single inode on both
sides and then seeing what changed, and only then sends the differences
It's super inefficient.
btrfs send knows in seconds what needs to be sent, and works on it right
away.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Qu Wenruo


On 2018年06月29日 14:59, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
>>> If --repair doesn't work, check is useless to me sadly.
>>
>> Not exactly.
>> Although it's time consuming, I have manually patched several users fs,
>> which normally ends pretty well.
>  
> Ok I understand now.
> 
>>> Agreed, I doubt I have over or much over 100 snapshots though (but I
>>> can't check right now).
>>> Sadly I'm not allowed to mount even read only while check is running:
>>> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
>>> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
> Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
> because I use btrfs send a lot :)
> 
>> This looks like super block corruption?
>>
>> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
> 
> Sure, there you go: https://pastebin.com/uF1pHTsg
> 
>> And what about "skip_balance" mount option?
>  
> I have this in my fstab :)
> 
>> Another problem is, with so many snapshots, balance is also hugely
>> slowed, thus I'm not 100% sure if it's really a hang.
> 
> I sent another thread about this last week, balance got hung after 2
> days of doing nothing and just moving a single chunk.
> 
> Ok, I was able to remount the filesystem read only. I was wrong, I have
> 270 snapshots:
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
> 74
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
> 196
> 
> It's a backup server, I use btrfs send for many machines and for each btrs
> send, I keep history, maybe 10 or so backups. So it adds up in the end.
> 
> Is btrfs unable to deal with this well enough?

It depends.
For certain and rare case, if the only operations to the filesystem are
non-btrfs specific operations (POSIX file operations), then you're fine.
(Maybe you can go thousands snapshots before any obvious performance
degrade)

If certain btrfs specific operations are involved, it's definitely not OK:
1) Balance
2) Quota
3) Btrfs check

> 
>> If for that usage, btrfs-restore would fit your use case more,
>> Unfortunately it needs extra disk space and isn't good at restoring
>> subvolume/snapshots.
>> (Although it's much faster than repairing the possible corrupted extent
>> tree)
> 
> It's a backup server, it only contains data from other machines.
> If the filesystem cannot be recovered to a working state, I will need
> over a week to restart the many btrfs send commands from many servers.
> This is why anything other than --repair is useless ot me, I don't need
> the data back, it's still on the original machines, I need the
> filesystem to work again so that I don't waste a week recreating the
> many btrfs send/receive relationships.

Now totally understand why you need to repair the fs.

> 
>>> Is that possible at all?
>>
>> At least for file recovery (fs tree repair), we have such behavior.
>>
>> However, the problem you hit (and a lot of users hit) is all about
>> extent tree repair, which doesn't even goes to file recovery.
>>
>> All the hassle are in extent tree, and for extent tree, it's just good
>> or bad. Any corruption in extent tree may lead to later bugs.
>> The only way to avoid extent tree problems is to mount the fs RO.
>>
>> So, I'm afraid it is at least impossible for recent years.
> 
> Understood, thanks for answering.
> 
> Does the pastebin help and is 270 snapshots ok enough?

The super dump doesn't show anything wrong.

So the problem may be in the super large extent tree.

In this case, plain check result with Su's patch would help more, other
than the not so interesting super dump.

Thanks,
Qu

> 
> Thanks,
> Marc
> 



signature.asc
Description: OpenPGP digital signature


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Roman Mamedov
On Thu, 28 Jun 2018 23:59:03 -0700
Marc MERLIN  wrote:

> I don't waste a week recreating the many btrfs send/receive relationships.

Consider not using send/receive, and switching to regular rsync instead.
Send/receive is very limiting and cumbersome, including because of what you
described. And it doesn't gain you much over an incremental rsync. As for
snapshots on the backup server, you can either automate making one as soon as a
backup has finished, or simply make them once/twice a day, during a period
when no backups are ongoing.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"

2018-06-29 Thread Nikolay Borisov



On 29.06.2018 10:11, Anand Jain wrote:
> 
> On 06/29/2018 01:26 PM, Nikolay Borisov wrote:
>> Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for
>> device list traversal") btrfs_show_devname no longer takes
>> device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix
>> a possible umount deadlock") aimed to fix no longer exists. So remove
>> the extra code that commit added.
>>
>> This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa.
> 
>  Its not exactly a revert.
> 
>  Because before 88c14590cdd6 was written, 0ccd05285e7f was already their
>  for the right purpose OR a new title is even better.

What I'm saying is that commit 88c14590cdd6 obsoleted 0ccd05285e7f,
hence this patch reverts the latter. I've literally generated it with
git revert 0ccd05285e7f + minor fixups.

> 
> 
>> Why do we first replace the closed device with a copy in
>> btrfs_close_one_device, then dispose of the copied
>> devices in btrfs_close_devices IFF we had fs_devices->seed not being
>> NULL?
> 
>  I doubt if its for the fs_devices->seed purpose, we could have
>  just NULLed few items in device and it would have worked
>  as well. I guess it for the purpose of RCU satisfactions,
>  I remember running into that issues while trying to fix.
>  If you have plans to fix that pls go ahead. Thanks. While
>  my time is occupied with volume locks as of now. Otherwise
>  its in the list to fix.

Yeah , I spoke with David about that and he said it was related to RCU
and of course the commit that introduced that had nothing about RCU in
the change log. Device close logic seems to be a train wreck atm.
> 
>> Signed-off-by: Nikolay Borisov 
> 
> Reviewed-by: Anand Jain 
> 
> Thanks, Anand
> 
>> ---
>>
>> V2:
>>   * Fixed build failure due to using old name of free_device_rcu
>> function.
>>   fs/btrfs/volumes.c | 26 ++
>>   1 file changed, 6 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 5bd6f3a40f9c..011a19b7930f 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device
>> *device)
>>   blkdev_put(device->bdev, device->mode);
>>   }
>>   -static void btrfs_prepare_close_one_device(struct btrfs_device
>> *device)
>> +static void btrfs_close_one_device(struct btrfs_device *device)
>>   {
>>   struct btrfs_fs_devices *fs_devices = device->fs_devices;
>>   struct btrfs_device *new_device;
>> @@ -1022,6 +1022,8 @@ static void
>> btrfs_prepare_close_one_device(struct btrfs_device *device)
>>   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
>>   fs_devices->missing_devices--;
>>   +    btrfs_close_bdev(device);
>> +
>>   new_device = btrfs_alloc_device(NULL, >devid,
>>   device->uuid);
>>   BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
>> @@ -1035,39 +1037,23 @@ static void
>> btrfs_prepare_close_one_device(struct btrfs_device *device)
>>     list_replace_rcu(>dev_list, _device->dev_list);
>>   new_device->fs_devices = device->fs_devices;
>> +
>> +    call_rcu(>rcu, free_device_rcu);
>>   }
>>     static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
>>   {
>>   struct btrfs_device *device, *tmp;
>> -    struct list_head pending_put;
>> -
>> -    INIT_LIST_HEAD(_put);
>>     if (--fs_devices->opened > 0)
>>   return 0;
>>     mutex_lock(_devices->device_list_mutex);
>>   list_for_each_entry_safe(device, tmp, _devices->devices,
>> dev_list) {
>> -    btrfs_prepare_close_one_device(device);
>> -    list_add(>dev_list, _put);
>> +    btrfs_close_one_device(device);
>>   }
>>   mutex_unlock(_devices->device_list_mutex);
>>   -    /*
>> - * btrfs_show_devname() is using the device_list_mutex,
>> - * sometimes call to blkdev_put() leads vfs calling
>> - * into this func. So do put outside of device_list_mutex,
>> - * as of now.
>> - */
>> -    while (!list_empty(_put)) {
>> -    device = list_first_entry(_put,
>> -    struct btrfs_device, dev_list);
>> -    list_del(>dev_list);
>> -    btrfs_close_bdev(device);
>> -    call_rcu(>rcu, free_device_rcu);
>> -    }
>> -
>>   WARN_ON(fs_devices->open_devices);
>>   WARN_ON(fs_devices->rw_devices);
>>   fs_devices->opened = 0;
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Revert "btrfs: fix a possible umount deadlock"

2018-06-29 Thread Anand Jain



On 06/29/2018 01:26 PM, Nikolay Borisov wrote:

Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix
a possible umount deadlock") aimed to fix no longer exists. So remove
the extra code that commit added.

This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa.


 Its not exactly a revert.

 Because before 88c14590cdd6 was written, 0ccd05285e7f was already their
 for the right purpose OR a new title is even better.


> Why do we first replace the closed device with a copy in
> btrfs_close_one_device, then dispose of the copied
> devices in btrfs_close_devices IFF we had fs_devices->seed not being
> NULL?

 I doubt if its for the fs_devices->seed purpose, we could have
 just NULLed few items in device and it would have worked
 as well. I guess it for the purpose of RCU satisfactions,
 I remember running into that issues while trying to fix.
 If you have plans to fix that pls go ahead. Thanks. While
 my time is occupied with volume locks as of now. Otherwise
 its in the list to fix.


Signed-off-by: Nikolay Borisov 


Reviewed-by: Anand Jain 

Thanks, Anand


---

V2:
  * Fixed build failure due to using old name of free_device_rcu function.
  fs/btrfs/volumes.c | 26 ++
  1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd6f3a40f9c..011a19b7930f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
blkdev_put(device->bdev, device->mode);
  }
  
-static void btrfs_prepare_close_one_device(struct btrfs_device *device)

+static void btrfs_close_one_device(struct btrfs_device *device)
  {
struct btrfs_fs_devices *fs_devices = device->fs_devices;
struct btrfs_device *new_device;
@@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
fs_devices->missing_devices--;
  
+	btrfs_close_bdev(device);

+
new_device = btrfs_alloc_device(NULL, >devid,
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
@@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
  
  	list_replace_rcu(>dev_list, _device->dev_list);

new_device->fs_devices = device->fs_devices;
+
+   call_rcu(>rcu, free_device_rcu);
  }
  
  static int close_fs_devices(struct btrfs_fs_devices *fs_devices)

  {
struct btrfs_device *device, *tmp;
-   struct list_head pending_put;
-
-   INIT_LIST_HEAD(_put);
  
  	if (--fs_devices->opened > 0)

return 0;
  
  	mutex_lock(_devices->device_list_mutex);

list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
-   btrfs_prepare_close_one_device(device);
-   list_add(>dev_list, _put);
+   btrfs_close_one_device(device);
}
mutex_unlock(_devices->device_list_mutex);
  
-	/*

-* btrfs_show_devname() is using the device_list_mutex,
-* sometimes call to blkdev_put() leads vfs calling
-* into this func. So do put outside of device_list_mutex,
-* as of now.
-*/
-   while (!list_empty(_put)) {
-   device = list_first_entry(_put,
-   struct btrfs_device, dev_list);
-   list_del(>dev_list);
-   btrfs_close_bdev(device);
-   call_rcu(>rcu, free_device_rcu);
-   }
-
WARN_ON(fs_devices->open_devices);
WARN_ON(fs_devices->rw_devices);
fs_devices->opened = 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()

2018-06-29 Thread Omar Sandoval
On Mon, Jun 25, 2018 at 07:16:38PM +0200, David Sterba wrote:
> On Mon, May 14, 2018 at 06:35:48PM +0200, David Sterba wrote:
> > On Fri, May 11, 2018 at 01:30:01PM -0700, Omar Sandoval wrote:
> > > On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote:
> > > > On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote:
> > > > >  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> > > > > struct block_device *bdev, struct iov_iter *iter,
> > > > > get_block_t get_block, dio_iodone_t end_io,
> > > > > -   dio_submit_t submit_io, int flags)
> > > > > +   dio_submit_t submit_io, int flags, void *private)
> > > > 
> > > > Oh, dear...  That's what, 9 arguments?  I agree that the hack in 
> > > > question
> > > > is obscene, but so is this ;-/
> > > 
> > > So looking at these one by one, obviously needed:
> > > 
> > > - iocb
> > > - inode
> > > - iter
> > > 
> > > bdev is almost always inode->i_sb->s_bdev, except for Btrfs :(
> > > 
> > > These could _maybe_ go in struct kiocb:
> > > 
> > > - flags could maybe be folded into ki_flags
> > > - private could maybe go in iocb->private, but I haven't yet read
> > >   through to figure out if we're already using iocb->private for direct
> > >   I/O
> > 
> > I think the kiocb::private can be used for the purpose. There's only one
> > user, ext4, that also passes some DIO data around so it would in line
> > with the interface AFAICS.
> 
> Omar, do you have an update to the patchset? Thanks.

Al, what do you think of changing all users of map_bh->b_private to use
iocb->private? We'd have to pass the iocb to get_block() and
submit_io(), but we could get rid of dio->private.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
> > If --repair doesn't work, check is useless to me sadly.
> 
> Not exactly.
> Although it's time consuming, I have manually patched several users fs,
> which normally ends pretty well.
 
Ok I understand now.

> > Agreed, I doubt I have over or much over 100 snapshots though (but I
> > can't check right now).
> > Sadly I'm not allowed to mount even read only while check is running:
> > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
because I use btrfs send a lot :)

> This looks like super block corruption?
> 
> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

Sure, there you go: https://pastebin.com/uF1pHTsg

> And what about "skip_balance" mount option?
 
I have this in my fstab :)

> Another problem is, with so many snapshots, balance is also hugely
> slowed, thus I'm not 100% sure if it's really a hang.

I sent another thread about this last week, balance got hung after 2
days of doing nothing and just moving a single chunk.

Ok, I was able to remount the filesystem read only. I was wrong, I have
270 snapshots:
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
74
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
196

It's a backup server, I use btrfs send for many machines and for each btrs
send, I keep history, maybe 10 or so backups. So it adds up in the end.

Is btrfs unable to deal with this well enough?

> If for that usage, btrfs-restore would fit your use case more,
> Unfortunately it needs extra disk space and isn't good at restoring
> subvolume/snapshots.
> (Although it's much faster than repairing the possible corrupted extent
> tree)

It's a backup server, it only contains data from other machines.
If the filesystem cannot be recovered to a working state, I will need
over a week to restart the many btrfs send commands from many servers.
This is why anything other than --repair is useless ot me, I don't need
the data back, it's still on the original machines, I need the
filesystem to work again so that I don't waste a week recreating the
many btrfs send/receive relationships.

> > Is that possible at all?
> 
> At least for file recovery (fs tree repair), we have such behavior.
> 
> However, the problem you hit (and a lot of users hit) is all about
> extent tree repair, which doesn't even goes to file recovery.
> 
> All the hassle are in extent tree, and for extent tree, it's just good
> or bad. Any corruption in extent tree may lead to later bugs.
> The only way to avoid extent tree problems is to mount the fs RO.
> 
> So, I'm afraid it is at least impossible for recent years.

Understood, thanks for answering.

Does the pastebin help and is 270 snapshots ok enough?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > 
> > Not sure if I undertand that you meant, here.
> > 
> Sorry for my unclear words.
> Simply speaking, I suggest you to stop current running check.
> Then, clone above branch to compile binary then run
> 'btrfs check --mode=lowmem $dev'.
 
I understand, I'll build and try it.

> > This filesystem is trash to me and will require over a week to rebuild
> > manually if I can't repair it.
> 
> Understood your anxiety, a log of check without '--repair' will help
> us to make clear what's wrong with your filesystem.

Ok, I'll run your new code without repair and report back. It will
likely take over a day though.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Qu Wenruo


On 2018年06月29日 14:06, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
>> Just normal btrfs check, and post the output.
>> If normal check eats up all your memory, btrfs check --mode=lowmem.
>  
> Does check without --repair eat less RAM?

Unfortunately, no.

> 
>> --repair should be considered as the last method.
> 
> If --repair doesn't work, check is useless to me sadly.

Not exactly.
Although it's time consuming, I have manually patched several users fs,
which normally ends pretty well.

If it's not a wide-spread problem but some small fatal one, it may be fixed.

> I know that for
> FS analysis and bug reporting, you want to have the FS without changing
> it to something maybe worse, but for my use, if it can't be mounted and
> can't be fixed, then it gets deleted which is even worse than check
> doing the wrong thing.
> 
>>> The last two ERROR lines took over a day to get generated, so I'm not sure 
>>> if it's still working, but just slowly.
>>
>> OK, that explains something.
>>
>> One extent is referred hundreds times, no wonder it will take a long time.
>>
>> Just one tip here, there are really too many snapshots/reflinked files.
>> It's highly recommended to keep the number of snapshots to a reasonable
>> number (lower two digits).
>> Although btrfs snapshot is super fast, it puts a lot of pressure on its
>> extent tree, so there is no free lunch here.
>  
> Agreed, I doubt I have over or much over 100 snapshots though (but I
> can't check right now).
> Sadly I'm not allowed to mount even read only while check is running:
> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
>>> I see. Is there any reasonably easy way to check on this running process?
>>
>> GDB attach would be good.
>> Interrupt and check the inode number if it's checking fs tree.
>> Check the extent bytenr number if it's checking extent tree.
>>
>> But considering how many snapshots there are, it's really hard to determine.
>>
>> In this case, the super large extent tree is causing a lot of problem,
>> maybe it's a good idea to allow btrfs check to skip extent tree check?
> 
> I only see --init-extent-tree in the man page, which option did you have
> in mind?

That feature is just in my mind, not even implemented yet.

> 
>>> Then again, maybe it already fixed enough that I can mount my filesystem 
>>> again.
>>
>> This needs the initial btrfs check report and the kernel messages how it
>> fails to mount.
> 
> mount command hangs, kernel does not show anything special outside of disk 
> access hanging.
> 
> Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 
> 'recovery' is deprecated, use 'useback
> uproot' instead
> Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): 
> trying to use backup root at mount time
> Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): 
> disk space caching is enabled
> Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has 
> skinny extents
> Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): 
> bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): 
> enabling ssd optimizations
> Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long 
> (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
> Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): 
> disk space caching is enabled
> Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has 
> skinny extents
> Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): 
> bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): 
> enabling ssd optimizations
> Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): 
> disk space caching is enabled
> Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has 
> skinny extents
> Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): 
> bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0

This looks like super block corruption?

What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

And what about "skip_balance" mount option?

Another problem is, with so many snapshots, balance is also hugely
slowed, thus I'm not 100% sure if it's really a hang.

> Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): 
> enabling ssd optimizations
> Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long 
> (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
> Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
> Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Su Yue




On 06/29/2018 02:10 PM, Marc MERLIN wrote:

On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:

I have figured out the bug is lowmem check can't deal with shared tree block
in reloc tree. The fix is simple, you can try the follow repo:

https://github.com/Damenly/btrfs-progs/tree/tmp1


Not sure if I undertand that you meant, here.


Sorry for my unclear words.
Simply speaking, I suggest you to stop current running check.
Then, clone above branch to compile binary then run
'btrfs check --mode=lowmem $dev'.


Please run lowmem check "without =--repair" first to be sure whether
your filesystem is fine.
  
The filesystem is not fine, it caused btrfs balance to hang, whether

balance actually broke it further or caused the breakage, I can't say.

Then mount hangs, even with recovery, unless I use ro.

This filesystem is trash to me and will require over a week to rebuild
manually if I can't repair it.


Understood your anxiety, a log of check without '--repair' will help
us to make clear what's wrong with your filesystem.

Thanks,
Su

Running check without repair for likely several days just to know that
my filesystem is not clear (I already know this) isn't useful :)
Or am I missing something?


Though the bug and phenomenon are clear enough, before sending my patch,
I have to make a test image. I have spent a week to study btrfs balance
but it seems a liitle hard for me.


thanks for having a look, either way.

Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates for 4.18

2018-06-29 Thread Anand Jain




On 06/29/2018 02:26 AM, David Sterba wrote:

On Thu, Jun 28, 2018 at 07:22:59PM +0800, Anand Jain wrote:

   The circular locking dependency warning occurs at FSSTRESS_PROG.
   And in particular at doproc() in xfstests/ltp/fsstress.c, randomly
   at any of the command at
   opdesc_tops[] = { ..}
   which involves calling mmap file operation and if there is something
   to commit.

   The commit transaction does need device_list_mutex which is also being
   used for the btrfs_open_devices() in the commit 542c5908abfe84f7.

   But btrfs_open_devices() is only called at mount, and mmap() can
   establish only be established after the mount has completed. With
   this give its unclear to me why the circular locking dependency check
   is warning about this.

   I feel until we have clarity about this and also solve other problem
   related to the streamlining of uuid_mutex, I suggest we revert
   542c5908abfe84f7. Sorry for the inconvenience.


Ok, the revert is one option. I'm cosidering adding both the locks, like
is in https://patchwork.kernel.org/patch/10478443/ . This would have no
effect, as btrfs_open_devices is called only from mount path and the
list_sort is done only for the first time when there are not other
users of the list that would not also be under the uuid_mutex.



This passed the syzbot and other tests, so this does not break things
and goes towards pushing the device_list_mutex as the real protection
mechanism for the fs_devices members.



Let me know what you think, the revert should be the last option if we
don't have anything better.


 With this patch [1] as well I find the circular lock warning[2].
 [1]
  https://patchwork.kernel.org/patch/10478443/
 Test case:

  mkfs.btrfs -fq /dev/sdc && mount /dev/sdc /btrfs && 
/xfstests/ltp/fsstress -d /btrfs -w -p 1 -n 2000


 However when the device_list_mutex is removed, the warning goes away.
 Let me investigate bit more about circular locking dependency.

About using uuid_mutex in btrfs_open_devices().
 I am planning to be more conceivable about the using the
 bit map for the volume flags and which shall also include the
 EXCL OPS in progress flag for the fs_devices. Which means we hold
 uuid_mutex and set/reset EXCL OPS flag for the fs_devices. And so
 the other fsids like fsid2 can still hold the uuid_mutex while
 fsid1 is still mounting/opening (which may sleep).
 I hope you would agree to use bit map for volume, we also need this
 bit map to manage the volume status. Or if there is a better solution
 I am fine. However uuid_mutex isn't as it blocks fsids2 to mount.

Thanks, Anand

[2]
---
 kernel:
 kernel: ==
 kernel: WARNING: possible circular locking dependency detected
 kernel: 4.18.0-rc1+ #63 Not tainted
 kernel: --
 kernel: fsstress/3062 is trying to acquire lock:
 kernel: 7d28aeca (_info->reloc_mutex){+.+.}, at: 
btrfs_record_root_in_trans+0x43/0x70 [btrfs]

 kernel:
  but task is already holding lock:
 kernel: 2fc78565 (>mmap_sem){}, at: 
vm_mmap_pgoff+0x9f/0x110

 kernel:
  which lock already depends on the new lock.
 kernel:
  the existing dependency chain (in reverse 
order) is:

 kernel:
  -> #5 (>mmap_sem){}:
 kernel:_copy_from_user+0x1e/0x90
 kernel:scsi_cmd_ioctl+0x2ba/0x480
 kernel:cdrom_ioctl+0x3b/0xb2e
 kernel:sr_block_ioctl+0x7e/0xc0
 kernel:blkdev_ioctl+0x4ea/0x980
 kernel:block_ioctl+0x39/0x40
 kernel:do_vfs_ioctl+0xa2/0x6c0
 kernel:ksys_ioctl+0x70/0x80
 kernel:__x64_sys_ioctl+0x16/0x20
 kernel:do_syscall_64+0x4a/0x180
 kernel:entry_SYSCALL_64_after_hwframe+0x49/0xbe
 kernel:
  -> #4 (sr_mutex){+.+.}:
 kernel:sr_block_open+0x24/0xd0
 kernel:__blkdev_get+0xcb/0x480
 kernel:blkdev_get+0x144/0x3a0
 kernel:do_dentry_open+0x1b1/0x2d0
 kernel:path_openat+0x57b/0xcc0
 kernel:do_filp_open+0x9b/0x110
 kernel:do_sys_open+0x1bd/0x250
 kernel:do_syscall_64+0x4a/0x180
 kernel:entry_SYSCALL_64_after_hwframe+0x49/0xbe
 kernel:
  -> #3 (>bd_mutex){+.+.}:
 kernel:__blkdev_get+0x5d/0x480
 kernel:blkdev_get+0x243/0x3a0
 kernel:blkdev_get_by_path+0x4a/0x80
 kernel:btrfs_get_bdev_and_sb+0x1b/0xa0 [btrfs]
 kernel:open_fs_devices+0x85/0x270 [btrfs]
 kernel:btrfs_open_devices+0x6b/0x70 [btrfs]
 kernel:btrfs_mount_root+0x41a/0x7e0 [btrfs]
 kernel:mount_fs+0x30/0x150
 kernel:vfs_kern_mount.part.31+0x54/0x140
 kernel:btrfs_mount+0x175/0x920 [btrfs]
 kernel:mount_fs+0x30/0x150
 kernel:vfs_kern_mount.part.31+0x54/0x140
 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
> I have figured out the bug is lowmem check can't deal with shared tree block
> in reloc tree. The fix is simple, you can try the follow repo:
> 
> https://github.com/Damenly/btrfs-progs/tree/tmp1

Not sure if I undertand that you meant, here.

> Please run lowmem check "without =--repair" first to be sure whether
> your filesystem is fine.
 
The filesystem is not fine, it caused btrfs balance to hang, whether
balance actually broke it further or caused the breakage, I can't say.

Then mount hangs, even with recovery, unless I use ro.

This filesystem is trash to me and will require over a week to rebuild
manually if I can't repair it.
Running check without repair for likely several days just to know that
my filesystem is not clear (I already know this) isn't useful :)
Or am I missing something?

> Though the bug and phenomenon are clear enough, before sending my patch,
> I have to make a test image. I have spent a week to study btrfs balance
> but it seems a liitle hard for me.

thanks for having a look, either way.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
> Just normal btrfs check, and post the output.
> If normal check eats up all your memory, btrfs check --mode=lowmem.
 
Does check without --repair eat less RAM?

> --repair should be considered as the last method.

If --repair doesn't work, check is useless to me sadly. I know that for
FS analysis and bug reporting, you want to have the FS without changing
it to something maybe worse, but for my use, if it can't be mounted and
can't be fixed, then it gets deleted which is even worse than check
doing the wrong thing.

> > The last two ERROR lines took over a day to get generated, so I'm not sure 
> > if it's still working, but just slowly.
> 
> OK, that explains something.
> 
> One extent is referred hundreds times, no wonder it will take a long time.
> 
> Just one tip here, there are really too many snapshots/reflinked files.
> It's highly recommended to keep the number of snapshots to a reasonable
> number (lower two digits).
> Although btrfs snapshot is super fast, it puts a lot of pressure on its
> extent tree, so there is no free lunch here.
 
Agreed, I doubt I have over or much over 100 snapshots though (but I
can't check right now).
Sadly I'm not allowed to mount even read only while check is running:
gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

> > I see. Is there any reasonably easy way to check on this running process?
> 
> GDB attach would be good.
> Interrupt and check the inode number if it's checking fs tree.
> Check the extent bytenr number if it's checking extent tree.
> 
> But considering how many snapshots there are, it's really hard to determine.
> 
> In this case, the super large extent tree is causing a lot of problem,
> maybe it's a good idea to allow btrfs check to skip extent tree check?

I only see --init-extent-tree in the man page, which option did you have
in mind?

> > Then again, maybe it already fixed enough that I can mount my filesystem 
> > again.
> 
> This needs the initial btrfs check report and the kernel messages how it
> fails to mount.

mount command hangs, kernel does not show anything special outside of disk 
access hanging.

Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 
'recovery' is deprecated, use 'useback
uproot' instead
Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): 
trying to use backup root at mount time
Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): disk 
space caching is enabled
Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has 
skinny extents
Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): bdev 
/dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): 
enabling ssd optimizations
Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long 
(2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk 
space caching is enabled
Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has 
skinny extents
Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev 
/dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): 
enabling ssd optimizations
Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk 
space caching is enabled
Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has 
skinny extents
Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev 
/dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): 
enabling ssd optimizations
Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long 
(3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 
65536
Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W 
MODULE].
Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked 
for more than 120 seconds.
Jun 23 18:42:20 gargamel kernel: [ 5076.015729]   Not tainted 
4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:42:20 gargamel kernel: [ 5076.060637] syncD0 20253  
15327 0x20020080
Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
Jun 23 18:42:20 gargamel kernel: [ 5076.086366]  ? __schedule+0x53e/0x59b
Jun 23 18:42:20 gargamel kernel: [ 5076.098311]  schedule+0x7f/0x98