force btrfs to release underlying block device(s)

2017-03-31 Thread Glenn Washburn
I've run into a frustrating problem with a btrfs volume just now.  I
have a USB drive which has many partitions, two of which are luks
encrypted, which can be unlocked as a single, multi-device btrfs
volume.  For some reason the drive logically disconnected at the USB
protocol level, but not physically.  Then it reconnected.  This caused
the mount point to be removed at the vfs layer, however I could not
close the luks devices.

When looking in /sys/fs/btrfs, I see a directory with the UUID of the
offending volume, which shows the luks devices under the devices
directory.  So I presume the btrfs module is still holding references
to the block devices, not allowing them to be closed.  I know I can do
a "dmsetup remove --force" to force closing the luks devices, but I
doubt that will cause the btrfs module to release the offending block
devices.  So if I do that and then open the luks devices again and try
to remount the btrfs volume, I'm guessing insanity will ensue.

I can't unload/reload the btrfs module because the root fs among others
are using it.  Obviously, I can reboot, but that's a windows solution.
Anyone have a solution to this issue?  Is anyone looking into ways to
prevent this from happening?  I think this situation should be trivial
to reproduce.

Any help would be welcome,
Glenn

PS. I'm on a 4.10 kernel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread GWB
Indeed, that does make sense.  It's the output of the size command in
the Berkeley format of "text", not decimal, octal or hex.  Out of
curiosity about kernel module sizes, I dug up some old MacBooks and
looked around in:

/System/Library/Extensions/[modulename].kext/Content/MacOS:

udf is 637K on Mac OS 10.6
exfat is 75K on Mac OS 10.9
msdosfs is 79K on Mac OS 10.9
ntfs is 394K (That must be Paragon's ntfs for Mac)

And here's the kernel extension sizes for zfs (From OpenZFS):

/Library/Extensions/[modulename].kext/Content/MacOS:

zfs is 1.7M (10.9)
spl is 247K (10.9)

Different kernel from linux, of course (evidently a "mish mash" of
NextStep, BSD, Mach and Apple's own code), but that is one large
kernel extension for zfs.  If they are somehow comparable even with
the differences, 833K is not bad for btrfs compared to zfs.  I did not
look at the format of the file; it must be binary, but compression may
be optional for third party kexts.

So the kernel module sizes are large for both btrfs and zfs.  Given
the feature sets of both, is that surprising?

My favourite kernel extension in Mac OS X is:

/System/Library/Extensions/Dont Steal Mac OS X.kext/

Subtle, very subtle.

Gordon

On Fri, Mar 31, 2017 at 9:42 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted:
>
>> It is confusing, and now that I look at it, more than a little funny.
>> Your use of xargs returns the size of the kernel module for each of the
>> filesystem types.  I think I get it now: you are pointing to how large
>> the kernel module for btrfs is compared to other file system kernel
>> modules, 833 megs (piping find through xargs to sed).  That does not
>> mean the btrfs kernel module can accommodate an upper limit of a command
>> line length that is 833 megs.  It is just a very big loadable kernel
>> module.
>
> Umm... 833 K, not M, I believe.  (The unit is bytes not KiB.)
>
> Because if just one kernel module is nearing a gigabyte, then the kernel
> must be many gigabytes either monolithic or once assembled in memory, and
> it just ain't so.
>
> But FWIW megs was my first-glance impression too, until my brain said "No
> way!  Doesn't work!" and I took a second look.
>
> The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still
> got a ways to go before it's multiple GiB! =:^)  While they're XZ-
> compressed, I'm still fitting several monolithic-build kernels including
> their appended initramfs, along with grub, its config and modules, and a
> few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB
> capacity, including the 16 MiB system chunk so 112 MiB for data and
> metadata.  That simply wouldn't be possible if the kernel itself were
> multi-GB, even uncompressed.  Even XZ isn't /that/ good!
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Duncan
GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted:

> It is confusing, and now that I look at it, more than a little funny.
> Your use of xargs returns the size of the kernel module for each of the
> filesystem types.  I think I get it now: you are pointing to how large
> the kernel module for btrfs is compared to other file system kernel
> modules, 833 megs (piping find through xargs to sed).  That does not
> mean the btrfs kernel module can accommodate an upper limit of a command
> line length that is 833 megs.  It is just a very big loadable kernel
> module.

Umm... 833 K, not M, I believe.  (The unit is bytes not KiB.)

Because if just one kernel module is nearing a gigabyte, then the kernel 
must be many gigabytes either monolithic or once assembled in memory, and 
it just ain't so.

But FWIW megs was my first-glance impression too, until my brain said "No 
way!  Doesn't work!" and I took a second look.

The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still 
got a ways to go before it's multiple GiB! =:^)  While they're XZ-
compressed, I'm still fitting several monolithic-build kernels including 
their appended initramfs, along with grub, its config and modules, and a 
few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB 
capacity, including the 16 MiB system chunk so 112 MiB for data and 
metadata.  That simply wouldn't be possible if the kernel itself were 
multi-GB, even uncompressed.  Even XZ isn't /that/ good!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-03-31 Thread Duncan
Marat Khalili posted on Fri, 31 Mar 2017 15:28:20 +0300 as excerpted:

>> and that if you try the same thing with one of the filesystems being
>> for instance ext4, you'll see the same problem there as well

> Not sure if it's possible to reproduce the problem with ext4, since it's
> not possible to perform such extensive metadata operations there, and
> simply moving large amount of data never created any problems for me
> regardless of filesystem.

Try ext4 as the one hosting the innocent process...

And you said moving large amounts of data never triggered problems, but 
were you doing that over USB?

As for knobs I mentioned...

I'm not particularly sure about the knobs on USB, but...

For instance on my old PCI-X (pre-PCIE) server board, the BIOS had a 
setting for size of PCI transfer.  Given that each transfer has an 
effectively fixed overhead and the bus itself has a maximum bandwidth, 
the actually reasonably common elsewhere as well tradeoff was between 
high thruput (due to lower transfer overhead) at larger transfer sizes, 
but at the expense of interactivity and other processes having to wait 
for the transfer to complete, and better interactivity and shorter waits 
on a full bus at lower transfer sizes, at the expense of thruput due to 
higher transfer overhead.

I was having trouble with music cutouts and tried various Linux and ALSA 
settings to no avail, but once I set the BIOS to a much lower PCI 
transfer size, everything functioned much more smoothly, not just the 
music, but the mouse, less waiting on disk reads (because the writes were 
shorter), etc.

I /think/ the USB knobs are all in the kernel, but believe there's 
similar transfer size knobs there, if you know where to look.

Beyond that, there's more generic IO knobs as listed below, but if it was 
CPU not IO blocking, then they might not help in this context, but it's 
worth knowing about them, particularly the dirty_* stuff mentioned last, 
anyway.  (USB is much more CPU intensive than most transfer buses, one 
reason Intel pushed it so hard as opposed to say firewire, which offloads 
far more to the bus hardware and thus isn't as CPU intensive.  So the USB 
knobs may well be worth investigating even if it was CPU.  I just wish I 
knew more about them.)

There's also the IO-scheduler.  CFQ has long been the default, but you 
might try deadline, and there's now multiqueue-deadline (aka MQ deadline) 
as well.  NoOp is occasionally recommended for certain SSD use-cases, but 
it's not appropriate for spinning rust.  Of course most of the schedulers 
have detail knobs you can twist too, but I'm not sufficiently 
knowledgeable about those to say much about them.

And 4.10 introduced the block-device writeback throttling global option 
(BLK_WBT) along with separate options underneath it for single-queue and 
multi-queue writeback throttling.  I turned those on here, but as most of 
my system's on fast ssd, I didn't notice, nor did I expect to notice, 
much difference.  However, in theory it could make quite some difference 
with USB-based storage, particularly slow thumb-drives and spinning rust.

Last but certainly not least as it can make quite a difference, and 
indeed did make a difference here back when I was on spinning rust, 
there's the dirty-data write-caching typically configured via the 
distro's sysctrl mechanism, but which can be manually configured via 
the /proc/sys/vm/dirty_* files.  The writeback-throttling features 
mentioned above may eventually reduce the need to tweak these, but until 
they're in commonly deployed kernels, tweaking these settings can make 
QUITE a big difference, because the percentage-of-RAM defaults were 
configured back in the day when 64 MB of RAM was big, and they simply 
aren't appropriate to modern systems with often double-digit GiB RAM.  
I'll skip the details here as there's plenty of writeups on the web about 
tweaking these, as well as kernel text-file documentation, but you may 
want to look into this if you haven't, because as I said it can make a 
HUGE difference in effective system interactivity.


That's what I know of.  I'd be a lot more comfortable with things if 
someone else had confirmed my original post as I'm not a dev, just a 
btrfs user and list regular, but I do know we've not had a lot of reports 
of this sort of problem posted, and when we have in the past and it was 
actually separate btrfss, it turned out it was /not/ btrfs, so I'm 
/reasonably/ sure about it.  I also run multiple btrfs here and haven't 
seen the issue, but they're all on the same pair of partitioned quite 
fast ssds on SATA, so the comparison is admittedly of highly limited 
value.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Shrinking a device - performance?

2017-03-31 Thread GWB
It is confusing, and now that I look at it, more than a little funny.
Your use of xargs returns the size of the kernel module for each of
the filesystem types.  I think I get it now: you are pointing to how
large the kernel module for btrfs is compared to other file system
kernel modules, 833 megs (piping find through xargs to sed).  That
does not mean the btrfs kernel module can accommodate an upper limit
of a command line length that is 833 megs.  It is just a very big
loadable kernel module.

So same question, but different expression: what is the signifigance
of the large size of the btrfs kernel module?  Is it that the larger
the module, the more complex, the more prone to breakage, and more
difficult to debug?  Is the hfsplus kernel module less complex, and
more robust?  What did the file system designers of hfsplus (or udf)
know better (or worse?) than the file system designers of btrfs?

VAX/VMS clusters just aren't happy outside of a deeply hidden bunker
running 9 machines in a cluster from one storage device connected by
myranet over 500 miles to the next cluster.  I applaud the move to
x86, but like I wrote earlier, time has moved on.  I suppose weird is
in the eye of the beholder, but yes, when dial up was king and disco
pants roamed the earth, they were nice.  I don't think x86 is a viable
use case even for OpenVMS.  If you really need a VAX/VMS cluster,
chances are you have already have had one running with a continuous
uptime of more than a decade and you have already upgraded and changed
out every component several times by cycling down one machine in the
cluster at a time.

Gordon

On Fri, Mar 31, 2017 at 3:27 PM, Peter Grandi  
wrote:
>> [ ... ] what the signifigance of the xargs size limits of
>> btrfs might be. [ ... ] So what does it mean that btrfs has a
>> higher xargs size limit than other file systems? [ ... ] Or
>> does the lower capacity for argument length for hfsplus
>> demonstrate it is the superior file system for avoiding
>> breakage? [ ... ]
>
> That confuses, as my understanding of command argument size
> limit is that it is a system, not filesystem, property, and for
> example can be obtained with 'getconf _POSIX_ARG_MAX'.
>
>> Personally, I would go back to fossil and venti on Plan 9 for
>> an archival data server (using WORM drives),
>
> In an ideal world we would be using Plan 9. Not necessarily with
> Fossil and Venti. As a to storage/backup/archival Linux based
> options are not bad, even if the platform is far messier than
> Plan 9 (or some other alternatives). BTW I just noticed with a
> search that AWS might be offering Plan 9 hosts :-).
>
>> and VAX/VMS cluster for an HA server. [ ... ]
>
> Uhmmm, however nice it was, it was fairly weird. An IA32 or
> AMD64 port has been promised however :-).
>
> https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Hans van Kranenburg
On 03/31/2017 10:43 PM, Adam Borowski wrote:
> On Fri, Mar 31, 2017 at 10:24:57PM +0200, Hans van Kranenburg wrote:
>>
>> Yes, but we're not doing the same thing here.
>>
>> You have a file via a loop mount. If I do that, I get the same output as
>> you show, the right messages when I remount ssd and nossd.
>>
>> My test was lvm based on an ssd. When I mount that, I get the "detected
>> SSD devices, enabling SSD mode", and everytime I remount, being it ssd
>> or nossd, it *always* says "use ssd allocation scheme".
>>
>> So, this needs some more research I guess. It doesn't feel right.
> 
> I can't reproduce:
> 
> [~]# cat /proc/swaps
> Filename  TypeSizeUsedPriority
> /dev/sda2   partition 8822780 0   -1
> [~]# swapoff /dev/sda2
> [~]# mkfs.btrfs -f /dev/sda2
> ...
> [ 2459.856819] BTRFS info (device sda2): detected SSD devices, enabling SSD 
> mode
> [ 2459.857699] BTRFS info (device sda2): creating UUID tree
> [ 2477.234868] BTRFS info (device sda2): not using ssd allocation scheme
> [ 2477.234873] BTRFS info (device sda2): disk space caching is enabled
> [ 2482.306649] BTRFS info (device sda2): use ssd allocation scheme
> [ 2482.306654] BTRFS info (device sda2): disk space caching is enabled
> [ 2483.618578] BTRFS info (device sda2): not using ssd allocation scheme
> [ 2483.618583] BTRFS info (device sda2): disk space caching is enabled
> 
> Same partition on lvm:
> [ 2813.259749] BTRFS info (device dm-0): detected SSD devices, enabling SSD 
> mode
> [ 2813.260586] BTRFS info (device dm-0): creating UUID tree
> [ 2827.131076] BTRFS info (device dm-0): not using ssd allocation scheme
> [ 2827.131081] BTRFS info (device dm-0): disk space caching is enabled
> [ 2828.618841] BTRFS info (device dm-0): use ssd allocation scheme
> [ 2828.618845] BTRFS info (device dm-0): disk space caching is enabled
> [ 2829.546796] BTRFS info (device dm-0): not using ssd allocation scheme
> [ 2829.546801] BTRFS info (device dm-0): disk space caching is enabled
> [ 2833.770787] BTRFS info (device dm-0): use ssd allocation scheme
> [ 2833.770792] BTRFS info (device dm-0): disk space caching is enabled
> 
> Seems to flip back and forth correctly for me.
> 
> Are you sure you have this patch applied?

Oh ok, that's with the patch. The output I show is without the patch.

If it does my output without the patch instead and the right output with
it applied, then the puzzle pieces are in the right place again.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Adam Borowski
On Fri, Mar 31, 2017 at 10:24:57PM +0200, Hans van Kranenburg wrote:
> >>> How did you test this?
> >>>
> >>> This was also my first thought, but here's a weird thing:
> >>>
> >>> -# mount -o nossd /dev/sdx /mnt/btrfs/
> >>>
> >>> BTRFS info (device sdx): not using ssd allocation scheme
> >>>
> >>> -# mount -o remount,ssd /mnt/btrfs/
> >>>
> >>> BTRFS info (device sdx): use ssd allocation scheme
> >>>
> >>> -# mount -o remount,nossd /mnt/btrfs/
> >>>
> >>> BTRFS info (device sdx): use ssd allocation scheme
> >>>
> >>> That means that the case Opt_nossd: is never reached when doing this?
> > 
> > Seems to work for me:
> > 
> > [/tmp]# mount -onoatime foo /mnt/vol1 
> > [  619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 
> > devid 1 transid 5 /dev/loop0
> > [  619.438625] BTRFS info (device loop0): disk space caching is enabled
> > [  619.438627] BTRFS info (device loop0): has skinny extents
> > [  619.438629] BTRFS info (device loop0): flagging fs with big metadata 
> > feature
> > [  619.441989] BTRFS info (device loop0): creating UUID tree
> > [/tmp]# mount -oremount,ssd /mnt/vol1
> > [  629.755584] BTRFS info (device loop0): use ssd allocation scheme
> > [  629.755589] BTRFS info (device loop0): disk space caching is enabled
> > [/tmp]# mount -oremount,nossd /mnt/vol1
> > [  633.675867] BTRFS info (device loop0): not using ssd allocation scheme
> > [  633.675872] BTRFS info (device loop0): disk space caching is enabled
> 
> Yes, but we're not doing the same thing here.
> 
> You have a file via a loop mount. If I do that, I get the same output as
> you show, the right messages when I remount ssd and nossd.
> 
> My test was lvm based on an ssd. When I mount that, I get the "detected
> SSD devices, enabling SSD mode", and everytime I remount, being it ssd
> or nossd, it *always* says "use ssd allocation scheme".
> 
> So, this needs some more research I guess. It doesn't feel right.

I can't reproduce:

[~]# cat /proc/swaps
FilenameTypeSizeUsedPriority
/dev/sda2   partition   8822780 0   -1
[~]# swapoff /dev/sda2
[~]# mkfs.btrfs -f /dev/sda2
...
[ 2459.856819] BTRFS info (device sda2): detected SSD devices, enabling SSD mode
[ 2459.857699] BTRFS info (device sda2): creating UUID tree
[ 2477.234868] BTRFS info (device sda2): not using ssd allocation scheme
[ 2477.234873] BTRFS info (device sda2): disk space caching is enabled
[ 2482.306649] BTRFS info (device sda2): use ssd allocation scheme
[ 2482.306654] BTRFS info (device sda2): disk space caching is enabled
[ 2483.618578] BTRFS info (device sda2): not using ssd allocation scheme
[ 2483.618583] BTRFS info (device sda2): disk space caching is enabled

Same partition on lvm:
[ 2813.259749] BTRFS info (device dm-0): detected SSD devices, enabling SSD mode
[ 2813.260586] BTRFS info (device dm-0): creating UUID tree
[ 2827.131076] BTRFS info (device dm-0): not using ssd allocation scheme
[ 2827.131081] BTRFS info (device dm-0): disk space caching is enabled
[ 2828.618841] BTRFS info (device dm-0): use ssd allocation scheme
[ 2828.618845] BTRFS info (device dm-0): disk space caching is enabled
[ 2829.546796] BTRFS info (device dm-0): not using ssd allocation scheme
[ 2829.546801] BTRFS info (device dm-0): disk space caching is enabled
[ 2833.770787] BTRFS info (device dm-0): use ssd allocation scheme
[ 2833.770792] BTRFS info (device dm-0): disk space caching is enabled

Seems to flip back and forth correctly for me.

Are you sure you have this patch applied?

> >> Adding the 'nossd_spread' would be good to have, even if it might be
> >> just a marginal usecase.
> 
> Please no, don't make it more complex if not needed.
> 
> > Not sure if there's much point.  In any case, that's a separate patch.
> > Should I add one while we're here?
> 
> Since the whole ssd thing is a bit of a joke actually, I'd rather see it
> replaces with an option to choose an extent allocator algorithm.
> 
> The amount of if statements using this SSD things in btrfs in the kernel
> can be counted on one hand, and what they actually do is quite
> questionable (food for another mail thread).

Ok, let's fix only existing options for now then.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
> [ ... ] what the signifigance of the xargs size limits of
> btrfs might be. [ ... ] So what does it mean that btrfs has a
> higher xargs size limit than other file systems? [ ... ] Or
> does the lower capacity for argument length for hfsplus
> demonstrate it is the superior file system for avoiding
> breakage? [ ... ]

That confuses, as my understanding of command argument size
limit is that it is a system, not filesystem, property, and for
example can be obtained with 'getconf _POSIX_ARG_MAX'.

> Personally, I would go back to fossil and venti on Plan 9 for
> an archival data server (using WORM drives),

In an ideal world we would be using Plan 9. Not necessarily with
Fossil and Venti. As a to storage/backup/archival Linux based
options are not bad, even if the platform is far messier than
Plan 9 (or some other alternatives). BTW I just noticed with a
search that AWS might be offering Plan 9 hosts :-).

> and VAX/VMS cluster for an HA server. [ ... ]

Uhmmm, however nice it was, it was fairly weird. An IA32 or
AMD64 port has been promised however :-).

https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Hans van Kranenburg
On 03/31/2017 10:08 PM, Adam Borowski wrote:
> And when turning on nossd, drop ssd_spread.
> 
> Reported-by: Hans van Kranenburg 
> Signed-off-by: Adam Borowski 
> ---
> On Fri, Mar 31, 2017 at 07:10:16PM +0200, David Sterba wrote:
>> On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote:
>>> On 03/31/2017 05:19 PM, Adam Borowski wrote:
 Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently
 no way to disable that option once set.
>>
>> Missing inverse of ssd_spread is probably unintentional, as we once
>> added all complementary no* options, this one was forgotten.
>>
>> And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without
>> ssd does not nothing anyway.
> 
> Added that.
> 
>>> How did you test this?
>>>
>>> This was also my first thought, but here's a weird thing:
>>>
>>> -# mount -o nossd /dev/sdx /mnt/btrfs/
>>>
>>> BTRFS info (device sdx): not using ssd allocation scheme
>>>
>>> -# mount -o remount,ssd /mnt/btrfs/
>>>
>>> BTRFS info (device sdx): use ssd allocation scheme
>>>
>>> -# mount -o remount,nossd /mnt/btrfs/
>>>
>>> BTRFS info (device sdx): use ssd allocation scheme
>>>
>>> That means that the case Opt_nossd: is never reached when doing this?
> 
> Seems to work for me:
> 
> [/tmp]# mount -onoatime foo /mnt/vol1 
> [  619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 devid 
> 1 transid 5 /dev/loop0
> [  619.438625] BTRFS info (device loop0): disk space caching is enabled
> [  619.438627] BTRFS info (device loop0): has skinny extents
> [  619.438629] BTRFS info (device loop0): flagging fs with big metadata 
> feature
> [  619.441989] BTRFS info (device loop0): creating UUID tree
> [/tmp]# mount -oremount,ssd /mnt/vol1
> [  629.755584] BTRFS info (device loop0): use ssd allocation scheme
> [  629.755589] BTRFS info (device loop0): disk space caching is enabled
> [/tmp]# mount -oremount,nossd /mnt/vol1
> [  633.675867] BTRFS info (device loop0): not using ssd allocation scheme
> [  633.675872] BTRFS info (device loop0): disk space caching is enabled

Yes, but we're not doing the same thing here.

You have a file via a loop mount. If I do that, I get the same output as
you show, the right messages when I remount ssd and nossd.

My test was lvm based on an ssd. When I mount that, I get the "detected
SSD devices, enabling SSD mode", and everytime I remount, being it ssd
or nossd, it *always* says "use ssd allocation scheme".

So, this needs some more research I guess. It doesn't feel right.

>>> The fact that nossd,ssd,ssd_spread are different options complicates the
>>> whole thing, compared to e.g. autodefrag, noautodefrag.
>>
>> I think the the ssd flags reflect the autodetection of ssd, unlike
>> autodefrag and others.
> 
> The autodetection works for /dev/sd* and /dev/mmcblk*, but not for most
> other devices.
> 
> Two examples:
> nbd to a piece of rotating rust says:
> [45697.575192] BTRFS info (device nbd0): detected SSD devices, enabling SSD 
> mode
> loop on tmpfs (and in case it spills, all swap is on ssd):
> claims it's rotational
> 
>> The ssd options says "enable the ssd mode", but it could be also
>> auto-detected if the non-rotational device is detected.
>>
>> nossd says, "do not do the autodetection, even if it's a non-rot
>> device, also disable all ssd modes".
> 
> These two options are nice whenever the autodetection goes wrong.
> 
>> So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD.

Ack.

> M'kay, updated this patch.
> 
>> Adding the 'nossd_spread' would be good to have, even if it might be
>> just a marginal usecase.

Please no, don't make it more complex if not needed.

> Not sure if there's much point.  In any case, that's a separate patch.
> Should I add one while we're here?

Since the whole ssd thing is a bit of a joke actually, I'd rather see it
replaces with an option to choose an extent allocator algorithm.

The amount of if statements using this SSD things in btrfs in the kernel
can be counted on one hand, and what they actually do is quite
questionable (food for another mail thread).

> 
> Meow!
> 
>  fs/btrfs/super.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 06bd9b332e18..ac1ca22d0c34 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -549,16 +549,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> char *options,
>   case Opt_ssd:
>   btrfs_set_and_info(info, SSD,
>  "use ssd allocation scheme");
> + btrfs_clear_opt(info->mount_opt, NOSSD);
>   break;
>   case Opt_ssd_spread:
>   btrfs_set_and_info(info, SSD_SPREAD,
>  "use spread ssd allocation scheme");
>   btrfs_set_opt(info->mount_opt, SSD);
> + 

[PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Adam Borowski
And when turning on nossd, drop ssd_spread.

Reported-by: Hans van Kranenburg 
Signed-off-by: Adam Borowski 
---
On Fri, Mar 31, 2017 at 07:10:16PM +0200, David Sterba wrote:
> On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote:
> > On 03/31/2017 05:19 PM, Adam Borowski wrote:
> > > Not sure if setting NOSSD should also disable SSD_SPREAD, there's 
> > > currently
> > > no way to disable that option once set.
> 
> Missing inverse of ssd_spread is probably unintentional, as we once
> added all complementary no* options, this one was forgotten.
> 
> And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without
> ssd does not nothing anyway.

Added that.

> > How did you test this?
> > 
> > This was also my first thought, but here's a weird thing:
> > 
> > -# mount -o nossd /dev/sdx /mnt/btrfs/
> > 
> > BTRFS info (device sdx): not using ssd allocation scheme
> > 
> > -# mount -o remount,ssd /mnt/btrfs/
> > 
> > BTRFS info (device sdx): use ssd allocation scheme
> > 
> > -# mount -o remount,nossd /mnt/btrfs/
> > 
> > BTRFS info (device sdx): use ssd allocation scheme
> > 
> > That means that the case Opt_nossd: is never reached when doing this?

Seems to work for me:

[/tmp]# mount -onoatime foo /mnt/vol1 
[  619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 devid 1 
transid 5 /dev/loop0
[  619.438625] BTRFS info (device loop0): disk space caching is enabled
[  619.438627] BTRFS info (device loop0): has skinny extents
[  619.438629] BTRFS info (device loop0): flagging fs with big metadata feature
[  619.441989] BTRFS info (device loop0): creating UUID tree
[/tmp]# mount -oremount,ssd /mnt/vol1
[  629.755584] BTRFS info (device loop0): use ssd allocation scheme
[  629.755589] BTRFS info (device loop0): disk space caching is enabled
[/tmp]# mount -oremount,nossd /mnt/vol1
[  633.675867] BTRFS info (device loop0): not using ssd allocation scheme
[  633.675872] BTRFS info (device loop0): disk space caching is enabled

> > The fact that nossd,ssd,ssd_spread are different options complicates the
> > whole thing, compared to e.g. autodefrag, noautodefrag.
> 
> I think the the ssd flags reflect the autodetection of ssd, unlike
> autodefrag and others.

The autodetection works for /dev/sd* and /dev/mmcblk*, but not for most
other devices.

Two examples:
nbd to a piece of rotating rust says:
[45697.575192] BTRFS info (device nbd0): detected SSD devices, enabling SSD mode
loop on tmpfs (and in case it spills, all swap is on ssd):
claims it's rotational

> The ssd options says "enable the ssd mode", but it could be also
> auto-detected if the non-rotational device is detected.
> 
> nossd says, "do not do the autodetection, even if it's a non-rot
> device, also disable all ssd modes".

These two options are nice whenever the autodetection goes wrong.

> So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD.

M'kay, updated this patch.

> Adding the 'nossd_spread' would be good to have, even if it might be
> just a marginal usecase.

Not sure if there's much point.  In any case, that's a separate patch.
Should I add one while we're here?


Meow!

 fs/btrfs/super.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 06bd9b332e18..ac1ca22d0c34 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -549,16 +549,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_ssd:
btrfs_set_and_info(info, SSD,
   "use ssd allocation scheme");
+   btrfs_clear_opt(info->mount_opt, NOSSD);
break;
case Opt_ssd_spread:
btrfs_set_and_info(info, SSD_SPREAD,
   "use spread ssd allocation scheme");
btrfs_set_opt(info->mount_opt, SSD);
+   btrfs_clear_opt(info->mount_opt, NOSSD);
break;
case Opt_nossd:
btrfs_set_and_info(info, NOSSD,
 "not using ssd allocation scheme");
btrfs_clear_opt(info->mount_opt, SSD);
+   btrfs_clear_opt(info->mount_opt, SSD_SPREAD);
break;
case Opt_barrier:
btrfs_clear_and_info(info, NOBARRIER,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread GWB
Well, now I am curious.  Until we hear back from Christiane on the
progress of the never ending file system shrinkage, I suppose it can't
hurt to ask what the signifigance of the xargs size limits of btrfs
might be.  Or, again, if Christiane is already happily on his way to
an xfs server running over lvm, skip, ignore, delete.

Here is the output of xargs --size-limits on my laptop:

<<
$ xargs --show-limits
Your environment variables take up 4830 bytes
POSIX upper limit on argument length (this system): 2090274
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085444
Size of command buffer we are actually using: 131072

Execution of xargs will continue now...
>>

That is for a laptop system.  So what does it mean that btrfs has a
higher xargs size limit than other file systems?  Could I
theoretically use 40% of the total allowed argument length of the
system for btrfs arguments alone?  Would that make balance, shrinkage,
etc., faster?  Does the higher capacity for argument length mean btrfs
is overly complex and therefore more prone to breakage?  Or does the
lower capacity for argument length for hfsplus demonstrate it is the
superior file system for avoiding breakage?

Or does it means that hfsplus is very old (and reflects older xargs
limits), and that btrfs is newer code?  I am relatively new to btrfs,
and would like to find out.  I am also attracted to the idea that it
is better to leave some operations to the system itself, and not code
them into the file system.  For example, I think deduplication "off
line" or "out of band" is an advantage for btrfs over zfs.  But that's
only for what I do.  For other uses deduplication "in line", while
writing the file, is preferred, and that is what zfs does (preferably
with lots of memory, at least one ssd to run zil, caches, etc.).

I use btrfs now because Ubuntu has it as a default in the kernel, and
I assume that when (not "if") I have to use a system rescue disk (USB
or CD) it will have some capacity to repair btrfs.  Along the way,
btrfs has been quite good as a general purpose file system on root; it
makes and sends snapshots, and so far only needs an occasional scrub
and balance.  My earlier experience with btrfs on a 2TB drive was more
complicated, but I expected that for a file system with a lot of
potential but less maturity.

Personally, I would go back to fossil and venti on Plan 9 for an
archival data server (using WORM drives), and VAX/VMS cluster for an
HA server.  But of course that no longer makes sense except for a very
few usage cases.  Time has moved on, prices have dropped drastically,
and hardware can do a lot more per penny than it used to.

Gordon

On Fri, Mar 31, 2017 at 12:25 PM, Peter Grandi  
wrote:
 My guess is that very complex risky slow operations like
 that are provided by "clever" filesystem developers for
 "marketing" purposes, to win box-ticking competitions.
>
 That applies to those system developers who do know better;
 I suspect that even some filesystem developers are
 "optimistic" as to what they can actually achieve.
>
 There are cases where there really is no other sane
 option. Not everyone has the kind of budget needed for
 proper HA setups,
>
>>> Thnaks for letting me know, that must have never occurred to
>>> me, just as it must have never occurred to me that some
>>> people expect extremely advanced features that imply
>>> big-budget high-IOPS high-reliability storage to be fast and
>>> reliable on small-budget storage too :-)
>
>> You're missing my point (or intentionally ignoring it).
>
> In "Thanks for letting me know" I am not missing your point, I
> am simply pointing out that I do know that people try to run
> high-budget workloads on low-budget storage.
>
> The argument as to whether "very complex risky slow operations"
> should be provided in the filesystem itself is a very different
> one, and I did not develop it fully. But is quite "optimistic"
> to simply state "there really is no other sane option", even
> when for people that don't have "proper HA setups".
>
> Let'a start by assuming for the time being. that "very complex
> risky slow operations" are indeed feasible on very reliable high
> speed storage layers. Then the questions become:
>
> * Is it really true that "there is no other sane option" to
>   running "very complex risky slow operations" even on storage
>   that is not "big-budget high-IOPS high-reliability"?
>
> * Is is really true that it is a good idea to run "very complex
>   risky slow operations" even on ¨big-budget high-IOPS
>   high-reliability storage"?
>
>> Those types of operations are implemented because there are
>> use cases that actually need them, not because some developer
>> thought it would be cool. [ ... ]
>
> And this is the really crucial bit, I'll disregard without
> agreeing too much (but in part I do) with the rest of the
> 

Re: Confusion about snapshots containers

2017-03-31 Thread Kai Krakow
Am Wed, 29 Mar 2017 16:27:30 -0500
schrieb Tim Cuthbertson :

> I have recently switched from multiple partitions with multiple
> btrfs's to a flat layout. I will try to keep my question concise.
> 
> I am confused as to whether a snapshots container should be a normal
> directory or a mountable subvolume. I do not understand how it can be
> a normal directory while being at the same level as, for example, a
> rootfs subvolume. This is with the understanding that the rootfs is
> NOT at the btrfs top level.
> 
> Which should it be, a normal directory or a mountable subvolume
> directly under btrfs top level? If either way can work, what are the
> pros and cons of each?

I think there is no exact standard you could follow. Many distributions
seems to go for the standard to prepend subvolumes with "@" if they are
meant to be mounted. However, I'm not doing so.

Generally speaking, subvolumes organize your volume into logical
containers which make sense to be snapshotted on their own. Snapshots
won't propagate to sub-subvolumes, it is important to keep that in mind
while designing your idea of a structure.

I'm using it like this:

In subvol=0 I have the following subvolumes:

/* - contains distribution specific file systems
/home - contains home directories
/snapshots - contains snapshots I want to keep
/other
  - misc stuff, i.e. a dump of the subvol structure in a txt
  - a copy of my restore script
  - some other supporting docs for restore
  - this subvolume is kept in sync with my backup volume

This means: If I mount one of the rootfs, my home will not be part of
this mount automatically because that subvolume is out of scope of the
rootfs.

Now I have the following subvolumes below these:

/gentoo/rootfs - rootfs of my main distribution
  Note 1: Everything below (except subvolumes) should be maintained
  by the package manager.
  Note 2: currently I installed no other distributions
  Note 3: I could have called it main-system-rootfs

/gentoo/usr
  - actually not a subvolume but a directory for volumes shareable with
other distribution instances

/gentoo/usr/portage - portage, shareable by other gentoo instances
/gentoo/usr/src - the gentoo linux kernel sources, shareable

The following are put below /gentoo/rootfs so they not need to be
mounted separately:

/gentoo/rootfs/var/log
  - log volume because I don't want it to be snapshotted
/gentoo/rootfs/var/tmp
  - tmp volume because it makes no sense to be snapshotted
/gentoo/rootfs/var/lib/machines
  - subvolume for keeping nspawn containers
/gentoo/rootfs/var/lib/machines/*
  - different machines cloned from each other
/gentoo/rootfs/usr/local
  - non-package manager stuff

/home/myuser - my user home
/home/myuser/.VirtualBox
  - VirtualBox machines because I want them snapshotted separately

/etc/fstab now only mounts subvolumes outside of the scope
of /gentoo/rootfs:

LABEL=system /homebtrfs compress=lzo,subvol=home,noatime
LABEL=system /usr/portage btrfs 
noauto,compress=lzo,subvol=gentoo/usr/portage,noatime,x-systemd.automount
LABEL=system /usr/src btrfs 
noauto,compress=lzo,subvol=gentoo/usr/src,noatime,x-systemd.automount

Additionally, I mount the subvol=0 for two special purposes:

LABEL=system /mnt/btrfs-pool btrfs 
noauto,compress=lzo,subvolid=0,x-systemd.automount,noatime

First: For managing all the subvolumes and have an untampered view
(without tmpfs or special purpose mounts) to the volumes.

Second: To take a clean backup of the whole system.

Now, I can give the bootloader subvol=gentoo/rootfs to select which
system to boot (or make it the default subvolume).

Maybe you get the idea and find that idea helpful.

PS: It can make sense to have var/lib/machines outside of the rootfs
scope if you want to share it with other distributions.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/5] btrfs: scrub: Introduce full stripe lock for RAID56

2017-03-31 Thread Liu Bo
On Fri, Mar 31, 2017 at 09:29:20AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/31/2017 12:49 AM, Liu Bo wrote:
> > On Thu, Mar 30, 2017 at 02:32:47PM +0800, Qu Wenruo wrote:
> > > Unlike mirror based profiles, RAID5/6 recovery needs to read out the
> > > whole full stripe.
> > > 
> > > And if we don't do proper protect, it can easily cause race condition.
> > > 
> > > Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
> > > for RAID5/6.
> > > Which stores a rb_tree of mutex for full stripes, so scrub callers can
> > > use them to lock a full stripe to avoid race.
> > > 
> > > Signed-off-by: Qu Wenruo 
> > > Reviewed-by: Liu Bo 
> > > ---
> > >  fs/btrfs/ctree.h   |  17 
> > >  fs/btrfs/extent-tree.c |  11 +++
> > >  fs/btrfs/scrub.c   | 217 
> > > +
> > >  3 files changed, 245 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 29b7fc28c607..9fe56da21fed 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
[...]
> > > +/*
> > > + * Helper to get full stripe logical from a normal bytenr.
> > > + *
> > > + * Caller must ensure @cache is a RAID56 block group.
> > > + */
> > > +static u64 get_full_stripe_logical(struct btrfs_block_group_cache *cache,
> > > +u64 bytenr)
> > > +{
> > > + u64 ret;
> > > +
> > > + /*
> > > +  * round_down() can only handle power of 2, while RAID56 full
> > > +  * stripe len can be 64KiB * n, so need manual round down.
> > > +  */
> > > + ret = (bytenr - cache->key.objectid) / cache->full_stripe_len *
> > > +   cache->full_stripe_len + cache->key.objectid;
> > 
> > Can you please use div_u64 instead?  '/' would cause building errors.
> 
> No problem, but I'm still curious about under which arch/compiler it would
> cause build error?
>

Sorry, it should be div64_u64 since cache->full_stripe_len is (unsigend long).

Building errors might not be true, it's from my memory.
But in runtime, it could end up with 'divide error'.

Thanks,

-liubo

> Thanks,
> Qu
> > 
> > Reviewed-by: Liu Bo 
> > 
> > Thanks,
> > 
> > -liubo
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] btrfs: use simpler readahead zone lookups

2017-03-31 Thread David Sterba
On Wed, Mar 15, 2017 at 05:02:26PM +0100, David Sterba wrote:
> No point using radix_tree_gang_lookup if we're looking up just one slot.
> 
> Signed-off-by: David Sterba 

I've bisected to this patch, causes a hang in btrfs/011. I'll revert it
for until I find out the cause.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56

2017-03-31 Thread David Sterba
On Fri, Mar 31, 2017 at 10:03:28AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/30/2017 06:31 PM, David Sterba wrote:
> > On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote:
>  +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
>  +{
>  +struct btrfs_block_group_cache *bg_cache;
>  +struct btrfs_full_stripe_locks_tree *locks_root;
>  +struct full_stripe_lock *existing;
>  +u64 fstripe_start;
>  +int ret = 0;
>  +
>  +bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>  +if (!bg_cache)
>  +return -ENOENT;
>  +
> >>>
> >>> When starting to scrub a chunk, we've already increased a ref for block 
> >>> group,
> >>> could you please put a ASSERT to catch it?
> >>
> >> Personally I prefer WARN_ON() than ASSERT().
> >>
> >> ASSERT() always panic the modules and forces us to reset the system.
> >> Wiping out any possibility to check the system.
> >
> > I think the sematnics of WARN_ON and ASSERT are different, so it should
> > be decided case by case which one to use. Assert is good for 'never
> > happens' or catch errors at development time (wrong API use, invariant
> > condition that must always match).
> >
> > Also the asserts are gone if the config option is unset, while WARN_ON
> > will stay in some form (verbose or not). Both are suitable for catching
> > problems, but the warning is for less critical errors so we want to know
> > when it happens but still can continue.
> >
> > The above case looks like a candidate for ASSERT as the refcounts must
> > be correct, continuing with the warning could lead to other unspecified
> > problems.
> 
> I'm OK to use ASSERT() here, but current ASSERT() in btrfs can hide real 
> problem if CONFIG_BTRFS_ASSERT is not set.
> 
> When CONFIG_BTRFS_ASSERT is not set, ASSERT() just does thing, and 
> *continue* executing.
> 
> This forces us to build a fallback method.
> 
> For above case, if we simply do "ASSERT(bg_cache);" then for 
> BTRFS_CONFIG_ASSERT not set case (which is quite common for most 
> distributions) we will cause NULL pointer deference.
> 
> So here, we still need to do bg_cache return value check, but just 
> change "WARN_ON(1);" to "ASSERT(0);" like:
> --
>   bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>   if (!bg_cache) {
>   ASSERT(0); /* WARN_ON(1); */
>   return -ENOENT;
>   }
> --
> 
> Can we make ASSERT() really catch problem no matter kernel config?
> Current ASSERT() behavior is in fact forcing us to consider both 
> situation, which makes it less handy.

All agreed, I'm not very happy about how the current ASSERT is
implemented. We want to add more and potentially expensive checks during
debugging builds, but also want to make sure that code does not proceed
pass some points if the invariants and expected values do not hold.

BUG_ON does that but then we have tons of them already and some of them
are just a temporary error handling, while at some other places it
serves as the sanity checker. We'd probably need 3rd option, that would
behave like BUG_ON but named differently, so we can clearly see that
it's intentional, or we can annotate the BUG_ON by comments.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>>> My guess is that very complex risky slow operations like
>>> that are provided by "clever" filesystem developers for
>>> "marketing" purposes, to win box-ticking competitions.

>>> That applies to those system developers who do know better;
>>> I suspect that even some filesystem developers are
>>> "optimistic" as to what they can actually achieve.

>>> There are cases where there really is no other sane
>>> option. Not everyone has the kind of budget needed for
>>> proper HA setups,

>> Thnaks for letting me know, that must have never occurred to
>> me, just as it must have never occurred to me that some
>> people expect extremely advanced features that imply
>> big-budget high-IOPS high-reliability storage to be fast and
>> reliable on small-budget storage too :-)

> You're missing my point (or intentionally ignoring it).

In "Thanks for letting me know" I am not missing your point, I
am simply pointing out that I do know that people try to run
high-budget workloads on low-budget storage.

The argument as to whether "very complex risky slow operations"
should be provided in the filesystem itself is a very different
one, and I did not develop it fully. But is quite "optimistic"
to simply state "there really is no other sane option", even
when for people that don't have "proper HA setups".

Let'a start by assuming for the time being. that "very complex
risky slow operations" are indeed feasible on very reliable high
speed storage layers. Then the questions become:

* Is it really true that "there is no other sane option" to
  running "very complex risky slow operations" even on storage
  that is not "big-budget high-IOPS high-reliability"?

* Is is really true that it is a good idea to run "very complex
  risky slow operations" even on ¨big-budget high-IOPS
  high-reliability storage"?

> Those types of operations are implemented because there are
> use cases that actually need them, not because some developer
> thought it would be cool. [ ... ]

And this is the really crucial bit, I'll disregard without
agreeing too much (but in part I do) with the rest of the
response, as those are less important matters, and this is going
to be londer than a twitter message.

First, I agree that "there are use cases that actually need
them", and I need to explain what I am agreeing to: I believe
that computer systems, "system" in a wide sense, have what I
call "inewvitable functionality", that is functionality that is
not optional, but must be provided *somewhere*: for example
print spooling is "inevitable functionality" as long as there
are multuple users, and spell checking is another example.

The only choice as to "inevitable functionality" is *where* to
provide it. For example spooling can be done among two users by
queuing jobs manually with one saying "I am going to print now",
and the other user waits until the print is finished, or by
using a spool program that queues jobs on the source system, or
by using a spool program that queues jobs on the target
printer. Spell checking can be done on the fly in the document
processor, batch with a tool, or manually by the document
author. All these are valid implementations of "inevitable
functionality", just with very different performance envelope,
where the "system" includes the users as "peripherals" or
"plugins" :-) in the manual implementations.

There is no dispute from me that multiple devices,
adding/removing block devices, data compression, structural
repair, balancing, growing/shrinking, defragmentation, quota
groups, integrity checking, deduplication, ...a are all in the
general case "inevitably functionality", and every non-trivial
storage system *must* implement them.

The big question is *where*: for example when I started using
UNIX the 'fsck' tool was several years away, and when the system
crashed I did like everybody filetree integrity checking and
structure recovery myself (with the help of 'ncheck' and
'icheck' and 'adb'), that is 'fsck' was implemented in my head.

In the general case there are three places where such
"inevitable functionality" can be implemented:

* In the filesystem module in the kernel, for example Btrfs
  scrubbing.
* In a tool that uses hook provided by the filesystem module in
  the kernel, for example Btrfs deduplication, 'send'/'receive'.
* In a tool, for example 'btrfsck'.
* In the system administrator.

Consider the "very complex risky slow" operation of
defragmentation; the system administrator can implement it by
dumping and reloading the volume, or a tool ban implement it by
running on the unmounted filesystem, or a tool and the kernel
can implement it by using kernel module hooks, or it can be
provided entirely in the kernel module.

My argument is that providing "very complex risky slow"
maintenance operations as filesystem primitives looks awesomely
convenient, a good way to "win box-ticking competitions" for
"marketing" purposes, but is rather bad idea for several
reasons, of varying strengths:

* Most system 

Re: [PATCH] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread David Sterba
On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote:
> On 03/31/2017 05:19 PM, Adam Borowski wrote:
> > The opposite case was already handled right in the very next switch entry.
> > 
> > Reported-by: Hans van Kranenburg 
> > Signed-off-by: Adam Borowski 
> > ---
> > Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently
> > no way to disable that option once set.

Missing inverse of ssd_spread is probably unintentional, as we once
added all complementary no* options, this one was forgotten.

And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without
ssd does not nothing anyway.

> > 
> >  fs/btrfs/super.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 06bd9b332e18..7342399951ad 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_ssd:
> > btrfs_set_and_info(info, SSD,
> >"use ssd allocation scheme");
> > +   btrfs_clear_opt(info->mount_opt, NOSSD);
> > break;
> > case Opt_ssd_spread:
> > btrfs_set_and_info(info, SSD_SPREAD,
> >"use spread ssd allocation scheme");
> > btrfs_set_opt(info->mount_opt, SSD);
> > +   btrfs_clear_opt(info->mount_opt, NOSSD);
> > break;
> > case Opt_nossd:
> > btrfs_set_and_info(info, NOSSD,
> 
> How did you test this?
> 
> This was also my first thought, but here's a weird thing:
> 
> -# mount -o nossd /dev/sdx /mnt/btrfs/
> 
> BTRFS info (device sdx): not using ssd allocation scheme
> 
> -# mount -o remount,ssd /mnt/btrfs/
> 
> BTRFS info (device sdx): use ssd allocation scheme
> 
> -# mount -o remount,nossd /mnt/btrfs/
> 
> BTRFS info (device sdx): use ssd allocation scheme
> 
> That means that the case Opt_nossd: is never reached when doing this?
> 
> And... what should be the result of doing:
> -# mount -o remount,nossd,ssd /mnt/btrfs/
> 
> I guess it should be that the last one in the sequence wins?

The last one wins.

> The fact that nossd,ssd,ssd_spread are different options complicates the
> whole thing, compared to e.g. autodefrag, noautodefrag.

I think the the ssd flags reflect the autodetection of ssd, unlike
autodefrag and others.

The ssd options says "enable the ssd mode", but it could be also
auto-detected if the non-rotational device is detected.

nossd says, "do not do the autodetection, even if it's a non-rot
device, also disable all ssd modes".

The manual page is not entirely clear about that, I'll update it.

So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD.
Adding the 'nossd_spread' would be good to have, even if it might be
just a marginal usecase.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Hans van Kranenburg
On 03/31/2017 05:19 PM, Adam Borowski wrote:
> The opposite case was already handled right in the very next switch entry.
> 
> Reported-by: Hans van Kranenburg 
> Signed-off-by: Adam Borowski 
> ---
> Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently
> no way to disable that option once set.
> 
>  fs/btrfs/super.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 06bd9b332e18..7342399951ad 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> char *options,
>   case Opt_ssd:
>   btrfs_set_and_info(info, SSD,
>  "use ssd allocation scheme");
> + btrfs_clear_opt(info->mount_opt, NOSSD);
>   break;
>   case Opt_ssd_spread:
>   btrfs_set_and_info(info, SSD_SPREAD,
>  "use spread ssd allocation scheme");
>   btrfs_set_opt(info->mount_opt, SSD);
> + btrfs_clear_opt(info->mount_opt, NOSSD);
>   break;
>   case Opt_nossd:
>   btrfs_set_and_info(info, NOSSD,

How did you test this?

This was also my first thought, but here's a weird thing:

-# mount -o nossd /dev/sdx /mnt/btrfs/

BTRFS info (device sdx): not using ssd allocation scheme

-# mount -o remount,ssd /mnt/btrfs/

BTRFS info (device sdx): use ssd allocation scheme

-# mount -o remount,nossd /mnt/btrfs/

BTRFS info (device sdx): use ssd allocation scheme

That means that the case Opt_nossd: is never reached when doing this?

And... what should be the result of doing:
-# mount -o remount,nossd,ssd /mnt/btrfs/

I guess it should be that the last one in the sequence wins?

The fact that nossd,ssd,ssd_spread are different options complicates the
whole thing, compared to e.g. autodefrag, noautodefrag.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.10.2

2017-03-31 Thread David Sterba
Hi,

btrfs-progs version 4.10.2 have been released.  More build breakages fixed and
some minor updates.

Changes:

 * check: lowmem mode fix for false alert about lost backrefs
 * convert: minor bugfix
 * library: fix build, misisng symbols, added tests

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (4):
  btrfs-progs: library-test: add all exported headers
  btrfs-progs: add prefix to message helpers
  btrfs-progs: update CHANGES for v4.10.2
  Btrfs progs v4.10.2

Qu Wenruo (4):
  btrfs-progs: Cleanup kernel-shared dir when execute make clean
  btrfs-progs: convert: Add missing return for HOLE mode when checking 
convert image
  btrfs-progs: check: lowmem, fix false alert about backref lost for 
SHARED_DATA_REF
  btrfs-progs: tests: Add SHARED_DATA_REF test image for check lowmem mode

Sergei Trofimovich (1):
  btrfs-progs: fix missing __error symbol in libbtrfs.so.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: drop the nossd flag when remounting with -o ssd

2017-03-31 Thread Adam Borowski
The opposite case was already handled right in the very next switch entry.

Reported-by: Hans van Kranenburg 
Signed-off-by: Adam Borowski 
---
Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently
no way to disable that option once set.

 fs/btrfs/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 06bd9b332e18..7342399951ad 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_ssd:
btrfs_set_and_info(info, SSD,
   "use ssd allocation scheme");
+   btrfs_clear_opt(info->mount_opt, NOSSD);
break;
case Opt_ssd_spread:
btrfs_set_and_info(info, SSD_SPREAD,
   "use spread ssd allocation scheme");
btrfs_set_opt(info->mount_opt, SSD);
+   btrfs_clear_opt(info->mount_opt, NOSSD);
break;
case Opt_nossd:
btrfs_set_and_info(info, NOSSD,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARN splat fs/btrfs/qgroup.c

2017-03-31 Thread Davidlohr Bueso

Hi,

While doing a regular kernel build I triggered the following splat
on a vanilla v4.11-rc4 kernel.

[73253.814880] WARNING: CPU: 20 PID: 631 at fs/btrfs/qgroup.c:2472 
btrfs_qgroup_free_refroot+0x154/0x180 [btrfs]
[73253.814880] Modules linked in: st(E) sr_mod(E) cdrom(E) nfsv3(E) nfs_acl(E) 
rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) 
grace(E) fscache(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) 
ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) ext4(E) crc16(E) jbd2(E) mbcache(E) 
intel_rapl(E) sb_edac(E) edac_core(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) 
crct10dif_pclmul(E) crc32_pclmul(E) igb(E) ghash_clmulni_intel(E) iTCO_wdt(E) 
joydev(E) pcbc(E) aesni_intel(E) ipmi_ssif(E) aes_x86_64(E) ptp(E) 
iTCO_vendor_support(E) crypto_simd(E) glue_helper(E) pps_core(E) lpc_ich(E) 
ioatdma(E) pcspkr(E) dca(E) mfd_core(E) cryptd(E) i2c_i801(E) ipmi_si(E) 
ipmi_devintf(E) ipmi_msghandler(E)
[73253.814893]  wmi(E) shpchp(E) button(E) sunrpc(E) btrfs(E) hid_generic(E) 
xor(E) usbhid(E) raid6_pq(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) 
drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) 
ttm(E) isci(E) ehci_pci(E) ahci(E) ehci_hcd(E) libsas(E) crc32c_intel(E) 
scsi_transport_sas(E) libahci(E) drm(E) usbcore(E) libata(E) sg(E) 
dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
scsi_mod(E) autofs4(E)
[73253.814901] CPU: 20 PID: 631 Comm: btrfs-transacti Tainted: GW   E   
4.11.0-rc4-92.11-default+ #2
[73253.814901] Hardware name: Intel Corporation SandyBridge Platform/To be 
filled by O.E.M., BIOS RMLCRB.86I.R1.25.D670.1303141058 03/14/2013
[73253.814902] Call Trace:
[73253.814903]  dump_stack+0x63/0x87
[73253.814905]  __warn+0xd1/0xf0
[73253.814906]  warn_slowpath_null+0x1d/0x20
[73253.814915]  btrfs_qgroup_free_refroot+0x154/0x180 [btrfs]
[73253.814923]  __btrfs_run_delayed_refs.constprop.73+0x309/0x1300 [btrfs]
[73253.814932]  btrfs_run_delayed_refs+0x7e/0x2e0 [btrfs]
[73253.814941]  btrfs_commit_transaction+0x39/0x950 [btrfs]
[73253.814948]  ? start_transaction+0xaa/0x490 [btrfs]
[73253.814956]  transaction_kthread+0x18a/0x1c0 [btrfs]
[73253.814958]  kthread+0x101/0x140
[73253.814965]  ? btrfs_cleanup_transaction+0x4f0/0x4f0 [btrfs]
[73253.814966]  ? kthread_park+0x90/0x90
[73253.814967]  ret_from_fork+0x2c/0x40

Any ideas?

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Austin S. Hemmelgarn

On 2017-03-30 11:55, Peter Grandi wrote:

My guess is that very complex risky slow operations like that are
provided by "clever" filesystem developers for "marketing" purposes,
to win box-ticking competitions. That applies to those system
developers who do know better; I suspect that even some filesystem
developers are "optimistic" as to what they can actually achieve.



There are cases where there really is no other sane option. Not
everyone has the kind of budget needed for proper HA setups,


Thnaks for letting me know, that must have never occurred to me, just as
it must have never occurred to me that some people expect extremely
advanced features that imply big-budget high-IOPS high-reliability
storage to be fast and reliable on small-budget storage too :-)
You're missing my point (or intentionally ignoring it).  Those types of 
operations are implemented because there are use cases that actually 
need them, not because some developer thought it would be cool.  The one 
possible counter-example of this is XFS, which doesn't support shrinking 
the filesystem at all, but that was a conscious decision because their 
target use case (very large scale data storage) does not need that 
feature and not implementing it allows them to make certain other parts 
of the filesystem faster.



and if you need maximal uptime and as a result have to reprovision the
system online, then you pretty much need a filesystem that supports
online shrinking.


That's a bigger topic than we can address here. The topic used to be
known in one related domain as "Very Large Databases", which were
defined as databases so large and critical that they the time needed for
maintenance and backup were too slow for taking them them offline etc.;
that is a topics that has largely vanished for discussion, I guess
because most management just don't want to hear it :-).
No, it's mostly vanished because of changes in best current practice. 
That was a topic in an era where the only platform that could handle 
high-availability was VMS, and software wasn't routinely written to 
handle things like load balancing.  As a result, people ran a single 
system which hosted the database, and if that went down, everything went 
down.  By contrast, it's rare these days outside of small companies to 
see singly hosted databases that aren't specific to the local system, 
and once you start parallelizing on the system level, backup and 
maintenance times generally go down.



Also, it's not really all that slow on most filesystem, BTRFS is just
hurt by it's comparatively poor performance, and the COW metadata
updates that are needed.


Btrfs in realistic situations has pretty good speed *and* performance,
and COW actually helps, as it often results in less head repositioning
than update-in-place. What makes it a bit slower with metadata is having
'dup' by default to recover from especially damaging bitflips in
metadata, but then that does not impact performance, only speed.
I and numerous other people have done benchmarks running single metadata 
and single data profiles on BTRFS, and it consistently performs worse 
than XFS and ext4 even under those circumstances.  It's not horrible 
performance (it's better for example than trying the same workload on 
NTFS on Windows), but it's still not what most people would call 'high' 
performance or speed.



That feature set is arguably not appropriate for VM images, but
lots of people know better :-).



That depends on a lot of factors.  I have no issues personally running
small VM images on BTRFS, but I'm also running on decent SSD's
(>500MB/s read and write speeds), using sparse files, and keeping on
top of managing them. [ ... ]


Having (relatively) big-budget high-IOPS storage for high-IOPS workloads
helps, that must have never occurred to me either :-).
It's not big budget, the SSD's in question are at best mid-range 
consumer SSD's that cost only marginally more than a decent hard drive, 
and they really don't get all that great performance in terms of IOPS 
because they're all on the same cheap SATA controller.  The point I was 
trying to make (which I should have been clearer about) is that they 
have good bulk throughput, which means that the OS can do much more 
aggressive writeback caching, which in turn means that COW and 
fragmentation have less impact.



XFS and 'ext4' are essentially equivalent, except for the fixed-size
inode table limitation of 'ext4' (and XFS reportedly has finer
grained locking). Btrfs is nearly as good as either on most workloads
is single-device mode [ ... ]



No, if you look at actual data, [ ... ]


Well, I have looked at actual data in many published but often poorly
made "benchmarks", and to me they seem they seem quite equivalent
indeed, within somewhat differently shaped performance envelopes, so the
results depend on the testing point within that envelope. I have been
done my own simplistic actual data gathering, most recently here:

  

Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>> [ ... ] CentOS, Redhat, and Oracle seem to take the position
>> that very large data subvolumes using btrfs should work
>> fine. But I would be curious what the rest of the list thinks
>> about 20 TiB in one volume/subvolume.

> To be sure I'm a biased voice here, as I have multiple
> independent btrfs on multiple partitions here, with no btrfs
> over 100 GiB in size, and that's on ssd so maintenance
> commands normally return in minutes or even seconds,

That's a bit extreme I think, as there are downsides to have
many too small volumes too.

> not the hours to days or even weeks it takes on multi-TB btrfs
> on spinning rust.

Or months :-).

> But FWIW... 1) Don't put all your data eggs in one basket,
> especially when that basket isn't yet entirely stable and
> mature.

Really good point here.

> A mantra commonly repeated on this list is that btrfs is still
> stabilizing,

My impression is that most 4.x and later versions are very
reliable for "base" functionality, that is excluding
multi-device, compression, qgroups, ... Put another way, what
scratches the Facebook itches works well :-).

> [ ... ] the time/cost/hassle-factor of the backup, and being
> practically prepared to use them, is even *MORE* important
> than it is on fully mature and stable filesystems.

Indeed, or at least *different* filesystems. I backup JFS
filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones,
for example.

> 2) Don't make your filesystems so large that any maintenance
> on them, including both filesystem maintenance like btrfs
> balance/scrub/check/ whatever, and normal backup and restore
> operations, takes impractically long,

As per my preceding post, that's the big deal, but so many
people "know better" :-).

> where "impractically" can be reasonably defined as so long it
> discourages you from doing them in the first place and/or so
> long that it's going to cause unwarranted downtime.

That's the "Very Large DataBase" level of trouble.

> Some years ago, before I started using btrfs and while I was
> using mdraid, I learned this one the hard way. I had a bunch
> of rather large mdraids setup, [ ... ]

I have recently seen another much "funnier" example: people who
"know better" and follow every cool trend decide to consolidate
their server farm on VMs, backed by a storage server with a
largish single pool of storage holding the virtual disk images
of all the server VMs. They look like geniuses until the storage
pool system crashes, and a minimal integrity check on restart
takes two days during which the whole organization is without
access to any email, files, databases, ...

> [ ... ] And there was a good chance it was /not/ active and
> mounted at the time of the crash and thus didn't need
> repaired, saving that time entirely! =:^)

As to that I have switched to using 'autofs' to mount volumes
only on access, using a simple script that turns '/etc/fstab'
into an automounter dynamic map, which means that most of the
time most volumes on my (home) systems are not mounted:

  http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928

> Eventually I arranged things so I could keep root mounted
> read-only unless I was updating it, and that's still the way I
> run it today.

The ancient way, instead of having '/' RO and '/var' RW, to have
'/' RW and '/usr' RO (so for example it could be shared across
many systems via NFS etc.), and while both are good ideas, I
prefer the ancient way. But then some people who know better are
moving to merge '/' with '/usr' without understanding what's the
history and the advantages.

> [ ... ] If it's multiple TBs, chances are it's going to be
> faster to simply blow away and recreate from backup, than it
> is to try to repair... [ ... ]

Or to shrink or defragment or dedup etc., except on very high
IOPS-per-TB storage.

> [ ... ] how much simpler it would have been had they had an
> independent btrfs of say a TB or two for each system they were
> backing up.

That is the general alternative to a single large pool/volume:
sharding/chunking of filetrees, sometimes, like with Lustre or
Ceph etc. with a "metafilesystem" layer on top.

Done manually my suggestion is to do the sharding per-week (or
other suitable period) rather than per-system, in a circular
"crop rotation" scheme. So that once a volume has been filled,
it becomes read-only and can even be unmounted until it needs
to be reused:

  http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b

Then there is the problem that "a TB or two" is less easy with
increasing disk capacities, but then I think that disks with a
capacity larger than 1TB are not suitable for ordinary
workloads, and more for tape-cartridge like usage.

> What would they have done had the btrfs gone bad and needed
> repaired? [ ... ]

In most cases I have seen of designs aimed at achieving the
lowest cost and highest flexibility "low IOPS single poool" at
the expense of scalability and maintainability, the "clever"
designer had been promoted or had 

Re: Do different btrfs volumes compete for CPU?

2017-03-31 Thread Marat Khalili
Thank you very much for reply and suggestions, more comments below. 
Still, is there a definite answer on root question: are different btrfs 
volumes independent in terms of CPU, or are there some shared workers 
that can be point of contention?



What would have been interesting would have been if you had any reports
from for instance htop during that time, showing wait percentage on the
various cores and status (probably D, disk-wait) of the innocent
process.  iotop output would of course have been even better, but also
rather more special-case so less commonly installed.
Curiously, I have had iotop but not htop running. [btrfs-transacti] had 
some low-level activity in iotop (I still assume it was CPU-limited), 
the innocent process did not have any activity anywhere. Next time I'll 
also take notice of process state in ps (sadly, my omission).



I believe you will find that the problem isn't btrfs, but rather, I/O
contention
This possibility did not come to my mind. Can USB drivers be still that 
bad in 4.4? Is there any way to discriminate these two situations (btrfs 
vs usb load)?


BTW, USB adapter used is this one (though storage array only supports 
USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/



and that if you try the same thing with one of the
filesystems being for instance ext4, you'll see the same problem there as
well
Not sure if it's possible to reproduce the problem with ext4, since it's 
not possible to perform such extensive metadata operations there, and 
simply moving large amount of data never created any problems for me 
regardless of filesystem.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-03-31 Thread Duncan
Marat Khalili posted on Fri, 31 Mar 2017 10:05:20 +0300 as excerpted:

> Approximately 16 hours ago I've run a script that deleted >~100
> snapshots and started quota rescan on a large USB-connected btrfs volume
> (5.4 of 22 TB occupied now). Quota rescan only completed just now, with
> 100% load from [btrfs-transacti] throughout this period, which is
> probably ~ok depending on your view on things.
> 
> What worries me is innocent process using _another_, SATA-connected
> btrfs volume that hung right after I started my script and took >30
> minutes to be sigkilled. There's nothing interesting in the kernel log,
> and attempts to attach strace to the process output nothing, but I of
> course suspect that it freezed on disk operation.
> 
> I wonder:
> 1) Can there be a contention for CPU or some mutexes between kernel
> btrfs threads belonging to different volumes?
> 2) If yes, can anything be done about it other than mounting volumes
> from (different) VMs?
> 
> 
>> $ uname -a; btrfs --version
>> Linux host 4.4.0-66-generic #87-Ubuntu SMP
>> Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>> btrfs-progs v4.4

What would have been interesting would have been if you had any reports 
from for instance htop during that time, showing wait percentage on the 
various cores and status (probably D, disk-wait) of the innocent 
process.  iotop output would of course have been even better, but also 
rather more special-case so less commonly installed.

I believe you will find that the problem isn't btrfs, but rather, I/O 
contention, and that if you try the same thing with one of the 
filesystems being for instance ext4, you'll see the same problem there as 
well, which because the two filesystems are then not the same type should 
well demonstrate that it's not a problem at the filesystem level, but 
rather elsewhere.

USB is infamous for being an I/O bottleneck, slowing things down both for 
it, and on less than perfectly configured systems, often for data access 
on other devices as well.  SATA can and does do similar things too, but 
because it tends to be more efficient in general, it doesn't tend to make 
things as drastically bad for as long as USB can.

There's some knobs you can twist for better interactivity, but I need to 
be up to go to work in a couple hours so will leave it to other posters 
to make suggestions in that regard at this point.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
> Can you try to first dedup the btrfs volume?  This is probably
> out of date, but you could try one of these: [ ... ] Yep,
> that's probably a lot of work. [ ... ] My recollection is that
> btrfs handles deduplication differently than zfs, but both of
> them can be very, very slow

But the big deal there is that dedup is indeed a very expensive
operation, even worse than 'balance'. A balanced, deduped volume
will shrink faster in most cases, but the time taken simply
moved from shrinking to preparing.

> Again, I'm not an expert in btrfs, but in most cases a full
> balance and scrub takes care of any problems on the root
> partition, but that is a relatively small partition.  A full
> balance (without the options) and scrub on 20 TiB must take a
> very long time even with robust hardware, would it not?

There have been reports of several months for volumes of that
size subject to ordinary workload.

> CentOS, Redhat, and Oracle seem to take the position that very
> large data subvolumes using btrfs should work fine.

This is a long standing controvery, and for example there have
been "interesting" debates in the XFS mailing list. Btrfs in
this is not really different from others, with one major
difference in context, that many Btrfs developers work for a
company that relies of large numbers of small servers, to the
point that fixing multidevice issues has not been a priority.

The controversy of large volumes is that while no doubt the
logical structures of recent filesystem types can support single
volumes of many petabytes (or even much larger), and such
volumes have indeed been created and "work"-ish, so they are
unquestionably "syntactically valid", the tradeoffs involved
especially as to maintainability may mean that they don't "work"
well and sustainably so.

The fundamental issue is metadata: while the logical structures,
using 48-64 bit pointers, unquestionably scale "syntactically",
they don't scale pragmatically when considering whole-volume
maintenance like checking, repair, balancing, scrubbing,
indexing (which includes making incremental backups etc.).

Note: large volumes don't have just a speed problem for
whole-volume operations, they also have a memory problem, as
most tools hold in memory copy of the metadata. There have been
cases where indexing or repair of a volume requires a lot more
RAM (many hundreds GiB or some TiB of RAM) than the system on
which the volume was being used.

The problem is of course smaller if the large volume contains
mostly large files, and bigger if the volume is stored on low
IOPS-per-TB devices and used on small-memory systems. But even
with large files even if filetree object metadata (inodes etc.)
are relatively few eventually space metadata must at least
potentially resolve down to single sectors, and that can be a
lot of metadata unless both used and free space are very
unfragmented.

The fundamental technological issue is: *data* IO rates, in both
random IOPS and sequential ones, can be scaled "almost" linearly
by parallelizing them using RAID or equivalent, allowing large
volumes to serve scalably large and parallel *data* workloads,
but *metadata* IO rates cannot be easily parallelized, because
metadata structures are graphs, not arrays of bytes like files.

So a large volume on 100 storage devices can serve in parallel a
significant percentage of 100 times the data workload of a small
volume on 1 storage device, but not so much for the metadata
workload.

For example, I have never seen a parallel 'fsck' tool that can
take advantage of 100 storage devices to complete a scan of a
single volume on 100 storage devices in not much longer time
than the scan of a volume on 1 of the storage device.

> But I would be curious what the rest of the list thinks about
> 20 TiB in one volume/subvolume.

Personally I think that while volumes of many petabytes "work"
syntactically, there are serious maintainability problem (which
I have seen happen at a number of sites) with volumes larger
than 4TB-8TB with any current local filesystem design.

That depends also on number/size of storage devices, and their
nature, that is IOPS, as after all metadata workloads do scale a
bit with number of available IOPS, even if far more slowly than
data workloads.

For for example I think that an 8TB volume is not desirable on a
single 8TB disk for ordinary workloads (but then I think that
disks above 1-2TB are just not suitable for ordinary filesystem
workloads), but with lots of smaller/faster disks a 12TB volume
would probably be acceptable, and maybe a number of flash SSDs
might make acceptable even a 20TB volume.

Of course there are lots of people who know better. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>>> The way btrfs is designed I'd actually expect shrinking to
>>> be fast in most cases. [ ... ]

>> The proposed "move whole chunks" implementation helps only if
>> there are enough unallocated chunks "below the line". If regular
>> 'balance' is done on the filesystem there will be some, but that
>> just spreads the cost of the 'balance' across time, it does not
>> by itself make a «risky, difficult, slow operation» any less so,
>> just spreads the risk, difficulty, slowness across time.

> Isn't that too pessimistic?

Maybe, it depends on the workload impacting the volume and how
much it churns the free/unallocated situation.

> Most of my filesystems have 90+% of free space unallocated,
> even those I never run balance on.

That seems quite lucky to me, as definitely is not my experience
or even my expectation in the general case: in my laptop and
desktop with relatively few updates I have to run 'balance'
fairly frequently, and "Knorrie" has produced a nice tools that
produces a graphical map of free vs. unallocated space and most
examples and users find quite a bit of balancing needs to be
done

> For me it wouldn't just spread the cost, it would reduce it
> considerably.

In your case the cost of the implicit or explicit 'balance'
simply does not arise because 'balance' is not necessary, and
then moving whole chunks is indeed cheap. The argument here is
in part whether used space (extents) or allocated space (chunks)
is more fragmented as well as the amount of metadata to update
in either case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Confusion about snapshots containers

2017-03-31 Thread Austin S. Hemmelgarn

On 2017-03-30 09:07, Tim Cuthbertson wrote:

On Wed, Mar 29, 2017 at 10:46 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Tim Cuthbertson posted on Wed, 29 Mar 2017 18:20:52 -0500 as excerpted:


So, another question...

Do I then leave the top level mounted all the time for snapshots, or
should I create them, send them to external storage, and umount until
next time?


Keep in mind that because snapshots contain older versions of whatever
they're snapshotting, they're a potential security issue, at least if
some of those older versions are libs or binaries.  Consider the fact
that you may have had security-updates since the snapshot, thus leaving
your working copies unaffected by whatever security vulns the updates
fixed.  If the old versions remain around where normal users have access
to them, particularly if they're setuid or similar, they have access to
those old and now known vulns in setuid executables!  (Of course users
can grab vulnerable versions elsewhere or build them themselves, but they
can't set them setuid root unless they /are/ root, so finding an existing
setuid-root executable with known vulns is finding the keys to the
kingdom.)

So keeping snapshots unmounted and out of the normally accessible
filesystem tree by default is recommended, at least if you're at all
concerned about someone untrusted getting access to a normal user account
and being able to use snapshots with known vulns of setuid executables as
root-escalation methods.

Another possibility is setting the snapshot subdir 700 perms, so non-
super-users can't normally access anything in it anyway.  Of course
that's a problem if you want them to have access to snapshots of their
own stuff for recovery purposes, but it's useful if you can do it.

Good admins will do both of these at once if possible as they know and
observe the defense-in-depth mantra, knowing all too well how easy a
single layer of defense yields to fat-fingering or previously unknown
vulns.

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thank you, Duncan. I will try to take all that into consideration.
These are really just fairly simple personal home systems, but
security is still important.
On the note of the old binaries and libraries bit, nodev, noexec, and 
nosuid are all per-mountpoint, not per-volume, so you can mitigate some 
of the rsik by always mounting with those flags.  Despite that, it's 
still a good idea to not have anything more than you need mounted at any 
given time (it's a lot harder to screw up a filesystem which isn't mounted).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4 V2] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-31 Thread Anand Jain
The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Signed-off-by: Anand Jain 
---
 V2: Now the flush error return is saved and checked instead of the
 checkpoint of the dev_stat method earlier.

 fs/btrfs/disk-io.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f8f534a32c2f..b6d047250ce2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3535,6 +3535,23 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev || dev->last_flush_error)
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3572,8 +3589,19 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (write_dev_flush(dev, 1))
dropouts++;
}
-   if (dropouts > info->num_tolerated_disk_barrier_failures)
-   return -EIO;
+
+   /*
+* A slight optimization, we check for dropouts here which avoids
+* a dev list loop when disks are healthy.
+*/
+   if (dropouts) {
+   /*
+* As we need holistic view of the failed disks, so
+* error checking is pushed to a separate loop.
+*/
+   return check_barrier_error(info->fs_devices);
+   }
+
return 0;
 }
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4 V2] btrfs: cleanup barrier_all_devices() unify dev error count

2017-03-31 Thread Anand Jain
Now when counting number of error devices we don't need to count
them separately once during send and wait, as because device error
counted during send is more of static check.

Also kindly note that as of now there is no code which would set
dev->bdev = NULL unless device is missing. However I still kept
bdev == NULL counted towards error device in view of future
enhancements. And as the device_list_mutex is held when
barrier_all_devices() is called, I don't expect a new bdev to null
in between send and wait.

Now in this process I also rename error_wait to dropouts.

Signed-off-by: Anand Jain 
---
 V2: As the write_dev_flush with wait=0 is always successful,
 from the previous patch, ret is now removed.

 fs/btrfs/disk-io.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 42bcf98794ec..f8f534a32c2f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3543,19 +3543,15 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
-   int ret;
+   int dropouts = 0;
 
/* send down all the barriers */
head = >fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   errors_send++;
+   if (!dev->bdev)
continue;
-   }
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
@@ -3567,18 +3563,16 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dropouts++;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 1);
-   if (ret)
-   errors_wait++;
+   if (write_dev_flush(dev, 1))
+   dropouts++;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (dropouts > info->num_tolerated_disk_barrier_failures)
return -EIO;
return 0;
 }
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4 V2] btrfs: use blkdev_issue_flush to flush the device cache

2017-03-31 Thread Anand Jain
As of now we do alloc an empty bio and then use the flag REQ_PREFLUSH
to flush the device cache, instead we can use blkdev_issue_flush()
for this puspose.

Also now no need to check the return when write_dev_flush() is called
with wait = 0

Signed-off-by: Anand Jain 
---
V2
  Title of this patch is changed from
   btrfs: communicate back ENOMEM when it occurs
  And its entirely a new patch, which now use blkdev_issue_flush()

 fs/btrfs/disk-io.c | 64 +++---
 fs/btrfs/volumes.h |  3 ++-
 2 files changed, 19 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9de35bca1f67..42bcf98794ec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3498,67 +3498,39 @@ static int write_dev_supers(struct btrfs_device *device,
return errors < i ? 0 : -1;
 }
 
-/*
- * endio for the write_dev_flush, this will wake anyone waiting
- * for the barrier when it is done
- */
-static void btrfs_end_empty_barrier(struct bio *bio)
+static void btrfs_dev_issue_flush(struct work_struct *work)
 {
-   if (bio->bi_private)
-   complete(bio->bi_private);
-   bio_put(bio);
+   int ret;
+   struct btrfs_device *device;
+
+   device = container_of(work, struct btrfs_device, flush_work);
+
+   /* we are in the commit thread */
+   ret = blkdev_issue_flush(device->bdev, GFP_NOFS, NULL);
+   device->last_flush_error = ret;
+   complete(>flush_wait);
 }
 
 /*
  * trigger flushes for one the devices.  If you pass wait == 0, the flushes are
  * sent down.  With wait == 1, it waits for the previous flush.
- *
- * any device where the flush fails with eopnotsupp are flagged as not-barrier
- * capable
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   struct bio *bio;
-   int ret = 0;
-
if (wait) {
-   bio = device->flush_bio;
-   if (!bio)
-   return 0;
+   int ret;
 
wait_for_completion(>flush_wait);
-
-   if (bio->bi_error) {
-   ret = bio->bi_error;
+   ret = device->last_flush_error;
+   if (ret)
btrfs_dev_stat_inc_and_print(device,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
-   }
-
-   /* drop the reference from the wait == 0 run */
-   bio_put(bio);
-   device->flush_bio = NULL;
-
+   BTRFS_DEV_STAT_FLUSH_ERRS);
return ret;
}
 
-   /*
-* one reference for us, and we leave it for the
-* caller
-*/
-   device->flush_bio = NULL;
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
-   if (!bio)
-   return -ENOMEM;
-
-   bio->bi_end_io = btrfs_end_empty_barrier;
-   bio->bi_bdev = device->bdev;
-   bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
init_completion(>flush_wait);
-   bio->bi_private = >flush_wait;
-   device->flush_bio = bio;
-
-   bio_get(bio);
-   btrfsic_submit_bio(bio);
+   INIT_WORK(>flush_work, btrfs_dev_issue_flush);
+   schedule_work(>flush_work);
 
return 0;
 }
@@ -3587,9 +3559,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 0);
-   if (ret)
-   errors_send++;
+   write_dev_flush(dev, 0);
}
 
/* wait for all the barriers */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fa0b79422695..1168b78c5f1d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -123,8 +123,9 @@ struct btrfs_device {
struct list_head resized_list;
 
/* for sending down flush barriers */
-   struct bio *flush_bio;
struct completion flush_wait;
+   struct work_struct flush_work;
+   int last_flush_error;
 
/* per-device scrub information */
struct scrub_ctx *scrub_device;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: delete unused member nobarriers

2017-03-31 Thread Anand Jain
Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 3 ---
 fs/btrfs/volumes.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74daf35d0..9de35bca1f67 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3521,9 +3521,6 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
struct bio *bio;
int ret = 0;
 
-   if (device->nobarriers)
-   return 0;
-
if (wait) {
bio = device->flush_bio;
if (!bio)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..fa0b79422695 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -123,7 +123,6 @@ struct btrfs_device {
struct list_head resized_list;
 
/* for sending down flush barriers */
-   int nobarriers;
struct bio *flush_bio;
struct completion flush_wait;
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Do different btrfs volumes compete for CPU?

2017-03-31 Thread Marat Khalili
Approximately 16 hours ago I've run a script that deleted >~100 
snapshots and started quota rescan on a large USB-connected btrfs volume 
(5.4 of 22 TB occupied now). Quota rescan only completed just now, with 
100% load from [btrfs-transacti] throughout this period, which is 
probably ~ok depending on your view on things.


What worries me is innocent process using _another_, SATA-connected 
btrfs volume that hung right after I started my script and took >30 
minutes to be sigkilled. There's nothing interesting in the kernel log, 
and attempts to attach strace to the process output nothing, but I of 
course suspect that it freezed on disk operation.


I wonder:
1) Can there be a contention for CPU or some mutexes between kernel 
btrfs threads belonging to different volumes?
2) If yes, can anything be done about it other than mounting volumes 
from (different) VMs?




$ uname -a; btrfs --version
Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html