force btrfs to release underlying block device(s)
I've run into a frustrating problem with a btrfs volume just now. I have a USB drive which has many partitions, two of which are luks encrypted, which can be unlocked as a single, multi-device btrfs volume. For some reason the drive logically disconnected at the USB protocol level, but not physically. Then it reconnected. This caused the mount point to be removed at the vfs layer, however I could not close the luks devices. When looking in /sys/fs/btrfs, I see a directory with the UUID of the offending volume, which shows the luks devices under the devices directory. So I presume the btrfs module is still holding references to the block devices, not allowing them to be closed. I know I can do a "dmsetup remove --force" to force closing the luks devices, but I doubt that will cause the btrfs module to release the offending block devices. So if I do that and then open the luks devices again and try to remount the btrfs volume, I'm guessing insanity will ensue. I can't unload/reload the btrfs module because the root fs among others are using it. Obviously, I can reboot, but that's a windows solution. Anyone have a solution to this issue? Is anyone looking into ways to prevent this from happening? I think this situation should be trivial to reproduce. Any help would be welcome, Glenn PS. I'm on a 4.10 kernel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
Indeed, that does make sense. It's the output of the size command in the Berkeley format of "text", not decimal, octal or hex. Out of curiosity about kernel module sizes, I dug up some old MacBooks and looked around in: /System/Library/Extensions/[modulename].kext/Content/MacOS: udf is 637K on Mac OS 10.6 exfat is 75K on Mac OS 10.9 msdosfs is 79K on Mac OS 10.9 ntfs is 394K (That must be Paragon's ntfs for Mac) And here's the kernel extension sizes for zfs (From OpenZFS): /Library/Extensions/[modulename].kext/Content/MacOS: zfs is 1.7M (10.9) spl is 247K (10.9) Different kernel from linux, of course (evidently a "mish mash" of NextStep, BSD, Mach and Apple's own code), but that is one large kernel extension for zfs. If they are somehow comparable even with the differences, 833K is not bad for btrfs compared to zfs. I did not look at the format of the file; it must be binary, but compression may be optional for third party kexts. So the kernel module sizes are large for both btrfs and zfs. Given the feature sets of both, is that surprising? My favourite kernel extension in Mac OS X is: /System/Library/Extensions/Dont Steal Mac OS X.kext/ Subtle, very subtle. Gordon On Fri, Mar 31, 2017 at 9:42 PM, Duncan <1i5t5.dun...@cox.net> wrote: > GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted: > >> It is confusing, and now that I look at it, more than a little funny. >> Your use of xargs returns the size of the kernel module for each of the >> filesystem types. I think I get it now: you are pointing to how large >> the kernel module for btrfs is compared to other file system kernel >> modules, 833 megs (piping find through xargs to sed). That does not >> mean the btrfs kernel module can accommodate an upper limit of a command >> line length that is 833 megs. It is just a very big loadable kernel >> module. > > Umm... 833 K, not M, I believe. (The unit is bytes not KiB.) > > Because if just one kernel module is nearing a gigabyte, then the kernel > must be many gigabytes either monolithic or once assembled in memory, and > it just ain't so. > > But FWIW megs was my first-glance impression too, until my brain said "No > way! Doesn't work!" and I took a second look. > > The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still > got a ways to go before it's multiple GiB! =:^) While they're XZ- > compressed, I'm still fitting several monolithic-build kernels including > their appended initramfs, along with grub, its config and modules, and a > few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB > capacity, including the 16 MiB system chunk so 112 MiB for data and > metadata. That simply wouldn't be possible if the kernel itself were > multi-GB, even uncompressed. Even XZ isn't /that/ good! > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted: > It is confusing, and now that I look at it, more than a little funny. > Your use of xargs returns the size of the kernel module for each of the > filesystem types. I think I get it now: you are pointing to how large > the kernel module for btrfs is compared to other file system kernel > modules, 833 megs (piping find through xargs to sed). That does not > mean the btrfs kernel module can accommodate an upper limit of a command > line length that is 833 megs. It is just a very big loadable kernel > module. Umm... 833 K, not M, I believe. (The unit is bytes not KiB.) Because if just one kernel module is nearing a gigabyte, then the kernel must be many gigabytes either monolithic or once assembled in memory, and it just ain't so. But FWIW megs was my first-glance impression too, until my brain said "No way! Doesn't work!" and I took a second look. The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still got a ways to go before it's multiple GiB! =:^) While they're XZ- compressed, I'm still fitting several monolithic-build kernels including their appended initramfs, along with grub, its config and modules, and a few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB capacity, including the 16 MiB system chunk so 112 MiB for data and metadata. That simply wouldn't be possible if the kernel itself were multi-GB, even uncompressed. Even XZ isn't /that/ good! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
Marat Khalili posted on Fri, 31 Mar 2017 15:28:20 +0300 as excerpted: >> and that if you try the same thing with one of the filesystems being >> for instance ext4, you'll see the same problem there as well > Not sure if it's possible to reproduce the problem with ext4, since it's > not possible to perform such extensive metadata operations there, and > simply moving large amount of data never created any problems for me > regardless of filesystem. Try ext4 as the one hosting the innocent process... And you said moving large amounts of data never triggered problems, but were you doing that over USB? As for knobs I mentioned... I'm not particularly sure about the knobs on USB, but... For instance on my old PCI-X (pre-PCIE) server board, the BIOS had a setting for size of PCI transfer. Given that each transfer has an effectively fixed overhead and the bus itself has a maximum bandwidth, the actually reasonably common elsewhere as well tradeoff was between high thruput (due to lower transfer overhead) at larger transfer sizes, but at the expense of interactivity and other processes having to wait for the transfer to complete, and better interactivity and shorter waits on a full bus at lower transfer sizes, at the expense of thruput due to higher transfer overhead. I was having trouble with music cutouts and tried various Linux and ALSA settings to no avail, but once I set the BIOS to a much lower PCI transfer size, everything functioned much more smoothly, not just the music, but the mouse, less waiting on disk reads (because the writes were shorter), etc. I /think/ the USB knobs are all in the kernel, but believe there's similar transfer size knobs there, if you know where to look. Beyond that, there's more generic IO knobs as listed below, but if it was CPU not IO blocking, then they might not help in this context, but it's worth knowing about them, particularly the dirty_* stuff mentioned last, anyway. (USB is much more CPU intensive than most transfer buses, one reason Intel pushed it so hard as opposed to say firewire, which offloads far more to the bus hardware and thus isn't as CPU intensive. So the USB knobs may well be worth investigating even if it was CPU. I just wish I knew more about them.) There's also the IO-scheduler. CFQ has long been the default, but you might try deadline, and there's now multiqueue-deadline (aka MQ deadline) as well. NoOp is occasionally recommended for certain SSD use-cases, but it's not appropriate for spinning rust. Of course most of the schedulers have detail knobs you can twist too, but I'm not sufficiently knowledgeable about those to say much about them. And 4.10 introduced the block-device writeback throttling global option (BLK_WBT) along with separate options underneath it for single-queue and multi-queue writeback throttling. I turned those on here, but as most of my system's on fast ssd, I didn't notice, nor did I expect to notice, much difference. However, in theory it could make quite some difference with USB-based storage, particularly slow thumb-drives and spinning rust. Last but certainly not least as it can make quite a difference, and indeed did make a difference here back when I was on spinning rust, there's the dirty-data write-caching typically configured via the distro's sysctrl mechanism, but which can be manually configured via the /proc/sys/vm/dirty_* files. The writeback-throttling features mentioned above may eventually reduce the need to tweak these, but until they're in commonly deployed kernels, tweaking these settings can make QUITE a big difference, because the percentage-of-RAM defaults were configured back in the day when 64 MB of RAM was big, and they simply aren't appropriate to modern systems with often double-digit GiB RAM. I'll skip the details here as there's plenty of writeups on the web about tweaking these, as well as kernel text-file documentation, but you may want to look into this if you haven't, because as I said it can make a HUGE difference in effective system interactivity. That's what I know of. I'd be a lot more comfortable with things if someone else had confirmed my original post as I'm not a dev, just a btrfs user and list regular, but I do know we've not had a lot of reports of this sort of problem posted, and when we have in the past and it was actually separate btrfss, it turned out it was /not/ btrfs, so I'm /reasonably/ sure about it. I also run multiple btrfs here and haven't seen the issue, but they're all on the same pair of partitioned quite fast ssds on SATA, so the comparison is admittedly of highly limited value. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Shrinking a device - performance?
It is confusing, and now that I look at it, more than a little funny. Your use of xargs returns the size of the kernel module for each of the filesystem types. I think I get it now: you are pointing to how large the kernel module for btrfs is compared to other file system kernel modules, 833 megs (piping find through xargs to sed). That does not mean the btrfs kernel module can accommodate an upper limit of a command line length that is 833 megs. It is just a very big loadable kernel module. So same question, but different expression: what is the signifigance of the large size of the btrfs kernel module? Is it that the larger the module, the more complex, the more prone to breakage, and more difficult to debug? Is the hfsplus kernel module less complex, and more robust? What did the file system designers of hfsplus (or udf) know better (or worse?) than the file system designers of btrfs? VAX/VMS clusters just aren't happy outside of a deeply hidden bunker running 9 machines in a cluster from one storage device connected by myranet over 500 miles to the next cluster. I applaud the move to x86, but like I wrote earlier, time has moved on. I suppose weird is in the eye of the beholder, but yes, when dial up was king and disco pants roamed the earth, they were nice. I don't think x86 is a viable use case even for OpenVMS. If you really need a VAX/VMS cluster, chances are you have already have had one running with a continuous uptime of more than a decade and you have already upgraded and changed out every component several times by cycling down one machine in the cluster at a time. Gordon On Fri, Mar 31, 2017 at 3:27 PM, Peter Grandiwrote: >> [ ... ] what the signifigance of the xargs size limits of >> btrfs might be. [ ... ] So what does it mean that btrfs has a >> higher xargs size limit than other file systems? [ ... ] Or >> does the lower capacity for argument length for hfsplus >> demonstrate it is the superior file system for avoiding >> breakage? [ ... ] > > That confuses, as my understanding of command argument size > limit is that it is a system, not filesystem, property, and for > example can be obtained with 'getconf _POSIX_ARG_MAX'. > >> Personally, I would go back to fossil and venti on Plan 9 for >> an archival data server (using WORM drives), > > In an ideal world we would be using Plan 9. Not necessarily with > Fossil and Venti. As a to storage/backup/archival Linux based > options are not bad, even if the platform is far messier than > Plan 9 (or some other alternatives). BTW I just noticed with a > search that AWS might be offering Plan 9 hosts :-). > >> and VAX/VMS cluster for an HA server. [ ... ] > > Uhmmm, however nice it was, it was fairly weird. An IA32 or > AMD64 port has been promised however :-). > > https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd
On 03/31/2017 10:43 PM, Adam Borowski wrote: > On Fri, Mar 31, 2017 at 10:24:57PM +0200, Hans van Kranenburg wrote: >> >> Yes, but we're not doing the same thing here. >> >> You have a file via a loop mount. If I do that, I get the same output as >> you show, the right messages when I remount ssd and nossd. >> >> My test was lvm based on an ssd. When I mount that, I get the "detected >> SSD devices, enabling SSD mode", and everytime I remount, being it ssd >> or nossd, it *always* says "use ssd allocation scheme". >> >> So, this needs some more research I guess. It doesn't feel right. > > I can't reproduce: > > [~]# cat /proc/swaps > Filename TypeSizeUsedPriority > /dev/sda2 partition 8822780 0 -1 > [~]# swapoff /dev/sda2 > [~]# mkfs.btrfs -f /dev/sda2 > ... > [ 2459.856819] BTRFS info (device sda2): detected SSD devices, enabling SSD > mode > [ 2459.857699] BTRFS info (device sda2): creating UUID tree > [ 2477.234868] BTRFS info (device sda2): not using ssd allocation scheme > [ 2477.234873] BTRFS info (device sda2): disk space caching is enabled > [ 2482.306649] BTRFS info (device sda2): use ssd allocation scheme > [ 2482.306654] BTRFS info (device sda2): disk space caching is enabled > [ 2483.618578] BTRFS info (device sda2): not using ssd allocation scheme > [ 2483.618583] BTRFS info (device sda2): disk space caching is enabled > > Same partition on lvm: > [ 2813.259749] BTRFS info (device dm-0): detected SSD devices, enabling SSD > mode > [ 2813.260586] BTRFS info (device dm-0): creating UUID tree > [ 2827.131076] BTRFS info (device dm-0): not using ssd allocation scheme > [ 2827.131081] BTRFS info (device dm-0): disk space caching is enabled > [ 2828.618841] BTRFS info (device dm-0): use ssd allocation scheme > [ 2828.618845] BTRFS info (device dm-0): disk space caching is enabled > [ 2829.546796] BTRFS info (device dm-0): not using ssd allocation scheme > [ 2829.546801] BTRFS info (device dm-0): disk space caching is enabled > [ 2833.770787] BTRFS info (device dm-0): use ssd allocation scheme > [ 2833.770792] BTRFS info (device dm-0): disk space caching is enabled > > Seems to flip back and forth correctly for me. > > Are you sure you have this patch applied? Oh ok, that's with the patch. The output I show is without the patch. If it does my output without the patch instead and the right output with it applied, then the puzzle pieces are in the right place again. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd
On Fri, Mar 31, 2017 at 10:24:57PM +0200, Hans van Kranenburg wrote: > >>> How did you test this? > >>> > >>> This was also my first thought, but here's a weird thing: > >>> > >>> -# mount -o nossd /dev/sdx /mnt/btrfs/ > >>> > >>> BTRFS info (device sdx): not using ssd allocation scheme > >>> > >>> -# mount -o remount,ssd /mnt/btrfs/ > >>> > >>> BTRFS info (device sdx): use ssd allocation scheme > >>> > >>> -# mount -o remount,nossd /mnt/btrfs/ > >>> > >>> BTRFS info (device sdx): use ssd allocation scheme > >>> > >>> That means that the case Opt_nossd: is never reached when doing this? > > > > Seems to work for me: > > > > [/tmp]# mount -onoatime foo /mnt/vol1 > > [ 619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 > > devid 1 transid 5 /dev/loop0 > > [ 619.438625] BTRFS info (device loop0): disk space caching is enabled > > [ 619.438627] BTRFS info (device loop0): has skinny extents > > [ 619.438629] BTRFS info (device loop0): flagging fs with big metadata > > feature > > [ 619.441989] BTRFS info (device loop0): creating UUID tree > > [/tmp]# mount -oremount,ssd /mnt/vol1 > > [ 629.755584] BTRFS info (device loop0): use ssd allocation scheme > > [ 629.755589] BTRFS info (device loop0): disk space caching is enabled > > [/tmp]# mount -oremount,nossd /mnt/vol1 > > [ 633.675867] BTRFS info (device loop0): not using ssd allocation scheme > > [ 633.675872] BTRFS info (device loop0): disk space caching is enabled > > Yes, but we're not doing the same thing here. > > You have a file via a loop mount. If I do that, I get the same output as > you show, the right messages when I remount ssd and nossd. > > My test was lvm based on an ssd. When I mount that, I get the "detected > SSD devices, enabling SSD mode", and everytime I remount, being it ssd > or nossd, it *always* says "use ssd allocation scheme". > > So, this needs some more research I guess. It doesn't feel right. I can't reproduce: [~]# cat /proc/swaps FilenameTypeSizeUsedPriority /dev/sda2 partition 8822780 0 -1 [~]# swapoff /dev/sda2 [~]# mkfs.btrfs -f /dev/sda2 ... [ 2459.856819] BTRFS info (device sda2): detected SSD devices, enabling SSD mode [ 2459.857699] BTRFS info (device sda2): creating UUID tree [ 2477.234868] BTRFS info (device sda2): not using ssd allocation scheme [ 2477.234873] BTRFS info (device sda2): disk space caching is enabled [ 2482.306649] BTRFS info (device sda2): use ssd allocation scheme [ 2482.306654] BTRFS info (device sda2): disk space caching is enabled [ 2483.618578] BTRFS info (device sda2): not using ssd allocation scheme [ 2483.618583] BTRFS info (device sda2): disk space caching is enabled Same partition on lvm: [ 2813.259749] BTRFS info (device dm-0): detected SSD devices, enabling SSD mode [ 2813.260586] BTRFS info (device dm-0): creating UUID tree [ 2827.131076] BTRFS info (device dm-0): not using ssd allocation scheme [ 2827.131081] BTRFS info (device dm-0): disk space caching is enabled [ 2828.618841] BTRFS info (device dm-0): use ssd allocation scheme [ 2828.618845] BTRFS info (device dm-0): disk space caching is enabled [ 2829.546796] BTRFS info (device dm-0): not using ssd allocation scheme [ 2829.546801] BTRFS info (device dm-0): disk space caching is enabled [ 2833.770787] BTRFS info (device dm-0): use ssd allocation scheme [ 2833.770792] BTRFS info (device dm-0): disk space caching is enabled Seems to flip back and forth correctly for me. Are you sure you have this patch applied? > >> Adding the 'nossd_spread' would be good to have, even if it might be > >> just a marginal usecase. > > Please no, don't make it more complex if not needed. > > > Not sure if there's much point. In any case, that's a separate patch. > > Should I add one while we're here? > > Since the whole ssd thing is a bit of a joke actually, I'd rather see it > replaces with an option to choose an extent allocator algorithm. > > The amount of if statements using this SSD things in btrfs in the kernel > can be counted on one hand, and what they actually do is quite > questionable (food for another mail thread). Ok, let's fix only existing options for now then. -- ⢀⣴⠾⠻⢶⣦⠀ Meow! ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second ⠈⠳⣄ preimage for double rot13! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] what the signifigance of the xargs size limits of > btrfs might be. [ ... ] So what does it mean that btrfs has a > higher xargs size limit than other file systems? [ ... ] Or > does the lower capacity for argument length for hfsplus > demonstrate it is the superior file system for avoiding > breakage? [ ... ] That confuses, as my understanding of command argument size limit is that it is a system, not filesystem, property, and for example can be obtained with 'getconf _POSIX_ARG_MAX'. > Personally, I would go back to fossil and venti on Plan 9 for > an archival data server (using WORM drives), In an ideal world we would be using Plan 9. Not necessarily with Fossil and Venti. As a to storage/backup/archival Linux based options are not bad, even if the platform is far messier than Plan 9 (or some other alternatives). BTW I just noticed with a search that AWS might be offering Plan 9 hosts :-). > and VAX/VMS cluster for an HA server. [ ... ] Uhmmm, however nice it was, it was fairly weird. An IA32 or AMD64 port has been promised however :-). https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd
On 03/31/2017 10:08 PM, Adam Borowski wrote: > And when turning on nossd, drop ssd_spread. > > Reported-by: Hans van Kranenburg> Signed-off-by: Adam Borowski > --- > On Fri, Mar 31, 2017 at 07:10:16PM +0200, David Sterba wrote: >> On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote: >>> On 03/31/2017 05:19 PM, Adam Borowski wrote: Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently no way to disable that option once set. >> >> Missing inverse of ssd_spread is probably unintentional, as we once >> added all complementary no* options, this one was forgotten. >> >> And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without >> ssd does not nothing anyway. > > Added that. > >>> How did you test this? >>> >>> This was also my first thought, but here's a weird thing: >>> >>> -# mount -o nossd /dev/sdx /mnt/btrfs/ >>> >>> BTRFS info (device sdx): not using ssd allocation scheme >>> >>> -# mount -o remount,ssd /mnt/btrfs/ >>> >>> BTRFS info (device sdx): use ssd allocation scheme >>> >>> -# mount -o remount,nossd /mnt/btrfs/ >>> >>> BTRFS info (device sdx): use ssd allocation scheme >>> >>> That means that the case Opt_nossd: is never reached when doing this? > > Seems to work for me: > > [/tmp]# mount -onoatime foo /mnt/vol1 > [ 619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 devid > 1 transid 5 /dev/loop0 > [ 619.438625] BTRFS info (device loop0): disk space caching is enabled > [ 619.438627] BTRFS info (device loop0): has skinny extents > [ 619.438629] BTRFS info (device loop0): flagging fs with big metadata > feature > [ 619.441989] BTRFS info (device loop0): creating UUID tree > [/tmp]# mount -oremount,ssd /mnt/vol1 > [ 629.755584] BTRFS info (device loop0): use ssd allocation scheme > [ 629.755589] BTRFS info (device loop0): disk space caching is enabled > [/tmp]# mount -oremount,nossd /mnt/vol1 > [ 633.675867] BTRFS info (device loop0): not using ssd allocation scheme > [ 633.675872] BTRFS info (device loop0): disk space caching is enabled Yes, but we're not doing the same thing here. You have a file via a loop mount. If I do that, I get the same output as you show, the right messages when I remount ssd and nossd. My test was lvm based on an ssd. When I mount that, I get the "detected SSD devices, enabling SSD mode", and everytime I remount, being it ssd or nossd, it *always* says "use ssd allocation scheme". So, this needs some more research I guess. It doesn't feel right. >>> The fact that nossd,ssd,ssd_spread are different options complicates the >>> whole thing, compared to e.g. autodefrag, noautodefrag. >> >> I think the the ssd flags reflect the autodetection of ssd, unlike >> autodefrag and others. > > The autodetection works for /dev/sd* and /dev/mmcblk*, but not for most > other devices. > > Two examples: > nbd to a piece of rotating rust says: > [45697.575192] BTRFS info (device nbd0): detected SSD devices, enabling SSD > mode > loop on tmpfs (and in case it spills, all swap is on ssd): > claims it's rotational > >> The ssd options says "enable the ssd mode", but it could be also >> auto-detected if the non-rotational device is detected. >> >> nossd says, "do not do the autodetection, even if it's a non-rot >> device, also disable all ssd modes". > > These two options are nice whenever the autodetection goes wrong. > >> So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD. Ack. > M'kay, updated this patch. > >> Adding the 'nossd_spread' would be good to have, even if it might be >> just a marginal usecase. Please no, don't make it more complex if not needed. > Not sure if there's much point. In any case, that's a separate patch. > Should I add one while we're here? Since the whole ssd thing is a bit of a joke actually, I'd rather see it replaces with an option to choose an extent allocator algorithm. The amount of if statements using this SSD things in btrfs in the kernel can be counted on one hand, and what they actually do is quite questionable (food for another mail thread). > > Meow! > > fs/btrfs/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 06bd9b332e18..ac1ca22d0c34 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -549,16 +549,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > char *options, > case Opt_ssd: > btrfs_set_and_info(info, SSD, > "use ssd allocation scheme"); > + btrfs_clear_opt(info->mount_opt, NOSSD); > break; > case Opt_ssd_spread: > btrfs_set_and_info(info, SSD_SPREAD, > "use spread ssd allocation scheme"); > btrfs_set_opt(info->mount_opt, SSD); > +
[PATCH v2] btrfs: drop the nossd flag when remounting with -o ssd
And when turning on nossd, drop ssd_spread. Reported-by: Hans van KranenburgSigned-off-by: Adam Borowski --- On Fri, Mar 31, 2017 at 07:10:16PM +0200, David Sterba wrote: > On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote: > > On 03/31/2017 05:19 PM, Adam Borowski wrote: > > > Not sure if setting NOSSD should also disable SSD_SPREAD, there's > > > currently > > > no way to disable that option once set. > > Missing inverse of ssd_spread is probably unintentional, as we once > added all complementary no* options, this one was forgotten. > > And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without > ssd does not nothing anyway. Added that. > > How did you test this? > > > > This was also my first thought, but here's a weird thing: > > > > -# mount -o nossd /dev/sdx /mnt/btrfs/ > > > > BTRFS info (device sdx): not using ssd allocation scheme > > > > -# mount -o remount,ssd /mnt/btrfs/ > > > > BTRFS info (device sdx): use ssd allocation scheme > > > > -# mount -o remount,nossd /mnt/btrfs/ > > > > BTRFS info (device sdx): use ssd allocation scheme > > > > That means that the case Opt_nossd: is never reached when doing this? Seems to work for me: [/tmp]# mount -onoatime foo /mnt/vol1 [ 619.436745] BTRFS: device fsid 954fd6c3-b3ce-4355-b79a-60ece7a6a4e0 devid 1 transid 5 /dev/loop0 [ 619.438625] BTRFS info (device loop0): disk space caching is enabled [ 619.438627] BTRFS info (device loop0): has skinny extents [ 619.438629] BTRFS info (device loop0): flagging fs with big metadata feature [ 619.441989] BTRFS info (device loop0): creating UUID tree [/tmp]# mount -oremount,ssd /mnt/vol1 [ 629.755584] BTRFS info (device loop0): use ssd allocation scheme [ 629.755589] BTRFS info (device loop0): disk space caching is enabled [/tmp]# mount -oremount,nossd /mnt/vol1 [ 633.675867] BTRFS info (device loop0): not using ssd allocation scheme [ 633.675872] BTRFS info (device loop0): disk space caching is enabled > > The fact that nossd,ssd,ssd_spread are different options complicates the > > whole thing, compared to e.g. autodefrag, noautodefrag. > > I think the the ssd flags reflect the autodetection of ssd, unlike > autodefrag and others. The autodetection works for /dev/sd* and /dev/mmcblk*, but not for most other devices. Two examples: nbd to a piece of rotating rust says: [45697.575192] BTRFS info (device nbd0): detected SSD devices, enabling SSD mode loop on tmpfs (and in case it spills, all swap is on ssd): claims it's rotational > The ssd options says "enable the ssd mode", but it could be also > auto-detected if the non-rotational device is detected. > > nossd says, "do not do the autodetection, even if it's a non-rot > device, also disable all ssd modes". These two options are nice whenever the autodetection goes wrong. > So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD. M'kay, updated this patch. > Adding the 'nossd_spread' would be good to have, even if it might be > just a marginal usecase. Not sure if there's much point. In any case, that's a separate patch. Should I add one while we're here? Meow! fs/btrfs/super.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 06bd9b332e18..ac1ca22d0c34 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -549,16 +549,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_ssd: btrfs_set_and_info(info, SSD, "use ssd allocation scheme"); + btrfs_clear_opt(info->mount_opt, NOSSD); break; case Opt_ssd_spread: btrfs_set_and_info(info, SSD_SPREAD, "use spread ssd allocation scheme"); btrfs_set_opt(info->mount_opt, SSD); + btrfs_clear_opt(info->mount_opt, NOSSD); break; case Opt_nossd: btrfs_set_and_info(info, NOSSD, "not using ssd allocation scheme"); btrfs_clear_opt(info->mount_opt, SSD); + btrfs_clear_opt(info->mount_opt, SSD_SPREAD); break; case Opt_barrier: btrfs_clear_and_info(info, NOBARRIER, -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
Well, now I am curious. Until we hear back from Christiane on the progress of the never ending file system shrinkage, I suppose it can't hurt to ask what the signifigance of the xargs size limits of btrfs might be. Or, again, if Christiane is already happily on his way to an xfs server running over lvm, skip, ignore, delete. Here is the output of xargs --size-limits on my laptop: << $ xargs --show-limits Your environment variables take up 4830 bytes POSIX upper limit on argument length (this system): 2090274 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2085444 Size of command buffer we are actually using: 131072 Execution of xargs will continue now... >> That is for a laptop system. So what does it mean that btrfs has a higher xargs size limit than other file systems? Could I theoretically use 40% of the total allowed argument length of the system for btrfs arguments alone? Would that make balance, shrinkage, etc., faster? Does the higher capacity for argument length mean btrfs is overly complex and therefore more prone to breakage? Or does the lower capacity for argument length for hfsplus demonstrate it is the superior file system for avoiding breakage? Or does it means that hfsplus is very old (and reflects older xargs limits), and that btrfs is newer code? I am relatively new to btrfs, and would like to find out. I am also attracted to the idea that it is better to leave some operations to the system itself, and not code them into the file system. For example, I think deduplication "off line" or "out of band" is an advantage for btrfs over zfs. But that's only for what I do. For other uses deduplication "in line", while writing the file, is preferred, and that is what zfs does (preferably with lots of memory, at least one ssd to run zil, caches, etc.). I use btrfs now because Ubuntu has it as a default in the kernel, and I assume that when (not "if") I have to use a system rescue disk (USB or CD) it will have some capacity to repair btrfs. Along the way, btrfs has been quite good as a general purpose file system on root; it makes and sends snapshots, and so far only needs an occasional scrub and balance. My earlier experience with btrfs on a 2TB drive was more complicated, but I expected that for a file system with a lot of potential but less maturity. Personally, I would go back to fossil and venti on Plan 9 for an archival data server (using WORM drives), and VAX/VMS cluster for an HA server. But of course that no longer makes sense except for a very few usage cases. Time has moved on, prices have dropped drastically, and hardware can do a lot more per penny than it used to. Gordon On Fri, Mar 31, 2017 at 12:25 PM, Peter Grandiwrote: My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. > That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. > There are cases where there really is no other sane option. Not everyone has the kind of budget needed for proper HA setups, > >>> Thnaks for letting me know, that must have never occurred to >>> me, just as it must have never occurred to me that some >>> people expect extremely advanced features that imply >>> big-budget high-IOPS high-reliability storage to be fast and >>> reliable on small-budget storage too :-) > >> You're missing my point (or intentionally ignoring it). > > In "Thanks for letting me know" I am not missing your point, I > am simply pointing out that I do know that people try to run > high-budget workloads on low-budget storage. > > The argument as to whether "very complex risky slow operations" > should be provided in the filesystem itself is a very different > one, and I did not develop it fully. But is quite "optimistic" > to simply state "there really is no other sane option", even > when for people that don't have "proper HA setups". > > Let'a start by assuming for the time being. that "very complex > risky slow operations" are indeed feasible on very reliable high > speed storage layers. Then the questions become: > > * Is it really true that "there is no other sane option" to > running "very complex risky slow operations" even on storage > that is not "big-budget high-IOPS high-reliability"? > > * Is is really true that it is a good idea to run "very complex > risky slow operations" even on ¨big-budget high-IOPS > high-reliability storage"? > >> Those types of operations are implemented because there are >> use cases that actually need them, not because some developer >> thought it would be cool. [ ... ] > > And this is the really crucial bit, I'll disregard without > agreeing too much (but in part I do) with the rest of the >
Re: Confusion about snapshots containers
Am Wed, 29 Mar 2017 16:27:30 -0500 schrieb Tim Cuthbertson: > I have recently switched from multiple partitions with multiple > btrfs's to a flat layout. I will try to keep my question concise. > > I am confused as to whether a snapshots container should be a normal > directory or a mountable subvolume. I do not understand how it can be > a normal directory while being at the same level as, for example, a > rootfs subvolume. This is with the understanding that the rootfs is > NOT at the btrfs top level. > > Which should it be, a normal directory or a mountable subvolume > directly under btrfs top level? If either way can work, what are the > pros and cons of each? I think there is no exact standard you could follow. Many distributions seems to go for the standard to prepend subvolumes with "@" if they are meant to be mounted. However, I'm not doing so. Generally speaking, subvolumes organize your volume into logical containers which make sense to be snapshotted on their own. Snapshots won't propagate to sub-subvolumes, it is important to keep that in mind while designing your idea of a structure. I'm using it like this: In subvol=0 I have the following subvolumes: /* - contains distribution specific file systems /home - contains home directories /snapshots - contains snapshots I want to keep /other - misc stuff, i.e. a dump of the subvol structure in a txt - a copy of my restore script - some other supporting docs for restore - this subvolume is kept in sync with my backup volume This means: If I mount one of the rootfs, my home will not be part of this mount automatically because that subvolume is out of scope of the rootfs. Now I have the following subvolumes below these: /gentoo/rootfs - rootfs of my main distribution Note 1: Everything below (except subvolumes) should be maintained by the package manager. Note 2: currently I installed no other distributions Note 3: I could have called it main-system-rootfs /gentoo/usr - actually not a subvolume but a directory for volumes shareable with other distribution instances /gentoo/usr/portage - portage, shareable by other gentoo instances /gentoo/usr/src - the gentoo linux kernel sources, shareable The following are put below /gentoo/rootfs so they not need to be mounted separately: /gentoo/rootfs/var/log - log volume because I don't want it to be snapshotted /gentoo/rootfs/var/tmp - tmp volume because it makes no sense to be snapshotted /gentoo/rootfs/var/lib/machines - subvolume for keeping nspawn containers /gentoo/rootfs/var/lib/machines/* - different machines cloned from each other /gentoo/rootfs/usr/local - non-package manager stuff /home/myuser - my user home /home/myuser/.VirtualBox - VirtualBox machines because I want them snapshotted separately /etc/fstab now only mounts subvolumes outside of the scope of /gentoo/rootfs: LABEL=system /homebtrfs compress=lzo,subvol=home,noatime LABEL=system /usr/portage btrfs noauto,compress=lzo,subvol=gentoo/usr/portage,noatime,x-systemd.automount LABEL=system /usr/src btrfs noauto,compress=lzo,subvol=gentoo/usr/src,noatime,x-systemd.automount Additionally, I mount the subvol=0 for two special purposes: LABEL=system /mnt/btrfs-pool btrfs noauto,compress=lzo,subvolid=0,x-systemd.automount,noatime First: For managing all the subvolumes and have an untampered view (without tmpfs or special purpose mounts) to the volumes. Second: To take a clean backup of the whole system. Now, I can give the bootloader subvol=gentoo/rootfs to select which system to boot (or make it the default subvolume). Maybe you get the idea and find that idea helpful. PS: It can make sense to have var/lib/machines outside of the rootfs scope if you want to share it with other distributions. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
On Fri, Mar 31, 2017 at 09:29:20AM +0800, Qu Wenruo wrote: > > > At 03/31/2017 12:49 AM, Liu Bo wrote: > > On Thu, Mar 30, 2017 at 02:32:47PM +0800, Qu Wenruo wrote: > > > Unlike mirror based profiles, RAID5/6 recovery needs to read out the > > > whole full stripe. > > > > > > And if we don't do proper protect, it can easily cause race condition. > > > > > > Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() > > > for RAID5/6. > > > Which stores a rb_tree of mutex for full stripes, so scrub callers can > > > use them to lock a full stripe to avoid race. > > > > > > Signed-off-by: Qu Wenruo> > > Reviewed-by: Liu Bo > > > --- > > > fs/btrfs/ctree.h | 17 > > > fs/btrfs/extent-tree.c | 11 +++ > > > fs/btrfs/scrub.c | 217 > > > + > > > 3 files changed, 245 insertions(+) > > > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > > index 29b7fc28c607..9fe56da21fed 100644 > > > --- a/fs/btrfs/ctree.h > > > +++ b/fs/btrfs/ctree.h [...] > > > +/* > > > + * Helper to get full stripe logical from a normal bytenr. > > > + * > > > + * Caller must ensure @cache is a RAID56 block group. > > > + */ > > > +static u64 get_full_stripe_logical(struct btrfs_block_group_cache *cache, > > > +u64 bytenr) > > > +{ > > > + u64 ret; > > > + > > > + /* > > > + * round_down() can only handle power of 2, while RAID56 full > > > + * stripe len can be 64KiB * n, so need manual round down. > > > + */ > > > + ret = (bytenr - cache->key.objectid) / cache->full_stripe_len * > > > + cache->full_stripe_len + cache->key.objectid; > > > > Can you please use div_u64 instead? '/' would cause building errors. > > No problem, but I'm still curious about under which arch/compiler it would > cause build error? > Sorry, it should be div64_u64 since cache->full_stripe_len is (unsigend long). Building errors might not be true, it's from my memory. But in runtime, it could end up with 'divide error'. Thanks, -liubo > Thanks, > Qu > > > > Reviewed-by: Liu Bo > > > > Thanks, > > > > -liubo > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] btrfs: use simpler readahead zone lookups
On Wed, Mar 15, 2017 at 05:02:26PM +0100, David Sterba wrote: > No point using radix_tree_gang_lookup if we're looking up just one slot. > > Signed-off-by: David SterbaI've bisected to this patch, causes a hang in btrfs/011. I'll revert it for until I find out the cause. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
On Fri, Mar 31, 2017 at 10:03:28AM +0800, Qu Wenruo wrote: > > > At 03/30/2017 06:31 PM, David Sterba wrote: > > On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote: > +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr) > +{ > +struct btrfs_block_group_cache *bg_cache; > +struct btrfs_full_stripe_locks_tree *locks_root; > +struct full_stripe_lock *existing; > +u64 fstripe_start; > +int ret = 0; > + > +bg_cache = btrfs_lookup_block_group(fs_info, bytenr); > +if (!bg_cache) > +return -ENOENT; > + > >>> > >>> When starting to scrub a chunk, we've already increased a ref for block > >>> group, > >>> could you please put a ASSERT to catch it? > >> > >> Personally I prefer WARN_ON() than ASSERT(). > >> > >> ASSERT() always panic the modules and forces us to reset the system. > >> Wiping out any possibility to check the system. > > > > I think the sematnics of WARN_ON and ASSERT are different, so it should > > be decided case by case which one to use. Assert is good for 'never > > happens' or catch errors at development time (wrong API use, invariant > > condition that must always match). > > > > Also the asserts are gone if the config option is unset, while WARN_ON > > will stay in some form (verbose or not). Both are suitable for catching > > problems, but the warning is for less critical errors so we want to know > > when it happens but still can continue. > > > > The above case looks like a candidate for ASSERT as the refcounts must > > be correct, continuing with the warning could lead to other unspecified > > problems. > > I'm OK to use ASSERT() here, but current ASSERT() in btrfs can hide real > problem if CONFIG_BTRFS_ASSERT is not set. > > When CONFIG_BTRFS_ASSERT is not set, ASSERT() just does thing, and > *continue* executing. > > This forces us to build a fallback method. > > For above case, if we simply do "ASSERT(bg_cache);" then for > BTRFS_CONFIG_ASSERT not set case (which is quite common for most > distributions) we will cause NULL pointer deference. > > So here, we still need to do bg_cache return value check, but just > change "WARN_ON(1);" to "ASSERT(0);" like: > -- > bg_cache = btrfs_lookup_block_group(fs_info, bytenr); > if (!bg_cache) { > ASSERT(0); /* WARN_ON(1); */ > return -ENOENT; > } > -- > > Can we make ASSERT() really catch problem no matter kernel config? > Current ASSERT() behavior is in fact forcing us to consider both > situation, which makes it less handy. All agreed, I'm not very happy about how the current ASSERT is implemented. We want to add more and potentially expensive checks during debugging builds, but also want to make sure that code does not proceed pass some points if the invariants and expected values do not hold. BUG_ON does that but then we have tons of them already and some of them are just a temporary error handling, while at some other places it serves as the sanity checker. We'd probably need 3rd option, that would behave like BUG_ON but named differently, so we can clearly see that it's intentional, or we can annotate the BUG_ON by comments. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>>> My guess is that very complex risky slow operations like >>> that are provided by "clever" filesystem developers for >>> "marketing" purposes, to win box-ticking competitions. >>> That applies to those system developers who do know better; >>> I suspect that even some filesystem developers are >>> "optimistic" as to what they can actually achieve. >>> There are cases where there really is no other sane >>> option. Not everyone has the kind of budget needed for >>> proper HA setups, >> Thnaks for letting me know, that must have never occurred to >> me, just as it must have never occurred to me that some >> people expect extremely advanced features that imply >> big-budget high-IOPS high-reliability storage to be fast and >> reliable on small-budget storage too :-) > You're missing my point (or intentionally ignoring it). In "Thanks for letting me know" I am not missing your point, I am simply pointing out that I do know that people try to run high-budget workloads on low-budget storage. The argument as to whether "very complex risky slow operations" should be provided in the filesystem itself is a very different one, and I did not develop it fully. But is quite "optimistic" to simply state "there really is no other sane option", even when for people that don't have "proper HA setups". Let'a start by assuming for the time being. that "very complex risky slow operations" are indeed feasible on very reliable high speed storage layers. Then the questions become: * Is it really true that "there is no other sane option" to running "very complex risky slow operations" even on storage that is not "big-budget high-IOPS high-reliability"? * Is is really true that it is a good idea to run "very complex risky slow operations" even on ¨big-budget high-IOPS high-reliability storage"? > Those types of operations are implemented because there are > use cases that actually need them, not because some developer > thought it would be cool. [ ... ] And this is the really crucial bit, I'll disregard without agreeing too much (but in part I do) with the rest of the response, as those are less important matters, and this is going to be londer than a twitter message. First, I agree that "there are use cases that actually need them", and I need to explain what I am agreeing to: I believe that computer systems, "system" in a wide sense, have what I call "inewvitable functionality", that is functionality that is not optional, but must be provided *somewhere*: for example print spooling is "inevitable functionality" as long as there are multuple users, and spell checking is another example. The only choice as to "inevitable functionality" is *where* to provide it. For example spooling can be done among two users by queuing jobs manually with one saying "I am going to print now", and the other user waits until the print is finished, or by using a spool program that queues jobs on the source system, or by using a spool program that queues jobs on the target printer. Spell checking can be done on the fly in the document processor, batch with a tool, or manually by the document author. All these are valid implementations of "inevitable functionality", just with very different performance envelope, where the "system" includes the users as "peripherals" or "plugins" :-) in the manual implementations. There is no dispute from me that multiple devices, adding/removing block devices, data compression, structural repair, balancing, growing/shrinking, defragmentation, quota groups, integrity checking, deduplication, ...a are all in the general case "inevitably functionality", and every non-trivial storage system *must* implement them. The big question is *where*: for example when I started using UNIX the 'fsck' tool was several years away, and when the system crashed I did like everybody filetree integrity checking and structure recovery myself (with the help of 'ncheck' and 'icheck' and 'adb'), that is 'fsck' was implemented in my head. In the general case there are three places where such "inevitable functionality" can be implemented: * In the filesystem module in the kernel, for example Btrfs scrubbing. * In a tool that uses hook provided by the filesystem module in the kernel, for example Btrfs deduplication, 'send'/'receive'. * In a tool, for example 'btrfsck'. * In the system administrator. Consider the "very complex risky slow" operation of defragmentation; the system administrator can implement it by dumping and reloading the volume, or a tool ban implement it by running on the unmounted filesystem, or a tool and the kernel can implement it by using kernel module hooks, or it can be provided entirely in the kernel module. My argument is that providing "very complex risky slow" maintenance operations as filesystem primitives looks awesomely convenient, a good way to "win box-ticking competitions" for "marketing" purposes, but is rather bad idea for several reasons, of varying strengths: * Most system
Re: [PATCH] btrfs: drop the nossd flag when remounting with -o ssd
On Fri, Mar 31, 2017 at 06:00:08PM +0200, Hans van Kranenburg wrote: > On 03/31/2017 05:19 PM, Adam Borowski wrote: > > The opposite case was already handled right in the very next switch entry. > > > > Reported-by: Hans van Kranenburg> > Signed-off-by: Adam Borowski > > --- > > Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently > > no way to disable that option once set. Missing inverse of ssd_spread is probably unintentional, as we once added all complementary no* options, this one was forgotten. And yes, nossd should turn off ssd and ssd_spread, as ssd_spread without ssd does not nothing anyway. > > > > fs/btrfs/super.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > > index 06bd9b332e18..7342399951ad 100644 > > --- a/fs/btrfs/super.c > > +++ b/fs/btrfs/super.c > > @@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > > char *options, > > case Opt_ssd: > > btrfs_set_and_info(info, SSD, > >"use ssd allocation scheme"); > > + btrfs_clear_opt(info->mount_opt, NOSSD); > > break; > > case Opt_ssd_spread: > > btrfs_set_and_info(info, SSD_SPREAD, > >"use spread ssd allocation scheme"); > > btrfs_set_opt(info->mount_opt, SSD); > > + btrfs_clear_opt(info->mount_opt, NOSSD); > > break; > > case Opt_nossd: > > btrfs_set_and_info(info, NOSSD, > > How did you test this? > > This was also my first thought, but here's a weird thing: > > -# mount -o nossd /dev/sdx /mnt/btrfs/ > > BTRFS info (device sdx): not using ssd allocation scheme > > -# mount -o remount,ssd /mnt/btrfs/ > > BTRFS info (device sdx): use ssd allocation scheme > > -# mount -o remount,nossd /mnt/btrfs/ > > BTRFS info (device sdx): use ssd allocation scheme > > That means that the case Opt_nossd: is never reached when doing this? > > And... what should be the result of doing: > -# mount -o remount,nossd,ssd /mnt/btrfs/ > > I guess it should be that the last one in the sequence wins? The last one wins. > The fact that nossd,ssd,ssd_spread are different options complicates the > whole thing, compared to e.g. autodefrag, noautodefrag. I think the the ssd flags reflect the autodetection of ssd, unlike autodefrag and others. The ssd options says "enable the ssd mode", but it could be also auto-detected if the non-rotational device is detected. nossd says, "do not do the autodetection, even if it's a non-rot device, also disable all ssd modes". The manual page is not entirely clear about that, I'll update it. So Adam's patch needs to be updated so NOSSD also disables SSD_SPREAD. Adding the 'nossd_spread' would be good to have, even if it might be just a marginal usecase. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: drop the nossd flag when remounting with -o ssd
On 03/31/2017 05:19 PM, Adam Borowski wrote: > The opposite case was already handled right in the very next switch entry. > > Reported-by: Hans van Kranenburg> Signed-off-by: Adam Borowski > --- > Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently > no way to disable that option once set. > > fs/btrfs/super.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 06bd9b332e18..7342399951ad 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > char *options, > case Opt_ssd: > btrfs_set_and_info(info, SSD, > "use ssd allocation scheme"); > + btrfs_clear_opt(info->mount_opt, NOSSD); > break; > case Opt_ssd_spread: > btrfs_set_and_info(info, SSD_SPREAD, > "use spread ssd allocation scheme"); > btrfs_set_opt(info->mount_opt, SSD); > + btrfs_clear_opt(info->mount_opt, NOSSD); > break; > case Opt_nossd: > btrfs_set_and_info(info, NOSSD, How did you test this? This was also my first thought, but here's a weird thing: -# mount -o nossd /dev/sdx /mnt/btrfs/ BTRFS info (device sdx): not using ssd allocation scheme -# mount -o remount,ssd /mnt/btrfs/ BTRFS info (device sdx): use ssd allocation scheme -# mount -o remount,nossd /mnt/btrfs/ BTRFS info (device sdx): use ssd allocation scheme That means that the case Opt_nossd: is never reached when doing this? And... what should be the result of doing: -# mount -o remount,nossd,ssd /mnt/btrfs/ I guess it should be that the last one in the sequence wins? The fact that nossd,ssd,ssd_spread are different options complicates the whole thing, compared to e.g. autodefrag, noautodefrag. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs release 4.10.2
Hi, btrfs-progs version 4.10.2 have been released. More build breakages fixed and some minor updates. Changes: * check: lowmem mode fix for false alert about lost backrefs * convert: minor bugfix * library: fix build, misisng symbols, added tests Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: David Sterba (4): btrfs-progs: library-test: add all exported headers btrfs-progs: add prefix to message helpers btrfs-progs: update CHANGES for v4.10.2 Btrfs progs v4.10.2 Qu Wenruo (4): btrfs-progs: Cleanup kernel-shared dir when execute make clean btrfs-progs: convert: Add missing return for HOLE mode when checking convert image btrfs-progs: check: lowmem, fix false alert about backref lost for SHARED_DATA_REF btrfs-progs: tests: Add SHARED_DATA_REF test image for check lowmem mode Sergei Trofimovich (1): btrfs-progs: fix missing __error symbol in libbtrfs.so.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: drop the nossd flag when remounting with -o ssd
The opposite case was already handled right in the very next switch entry. Reported-by: Hans van KranenburgSigned-off-by: Adam Borowski --- Not sure if setting NOSSD should also disable SSD_SPREAD, there's currently no way to disable that option once set. fs/btrfs/super.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 06bd9b332e18..7342399951ad 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -549,11 +549,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_ssd: btrfs_set_and_info(info, SSD, "use ssd allocation scheme"); + btrfs_clear_opt(info->mount_opt, NOSSD); break; case Opt_ssd_spread: btrfs_set_and_info(info, SSD_SPREAD, "use spread ssd allocation scheme"); btrfs_set_opt(info->mount_opt, SSD); + btrfs_clear_opt(info->mount_opt, NOSSD); break; case Opt_nossd: btrfs_set_and_info(info, NOSSD, -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARN splat fs/btrfs/qgroup.c
Hi, While doing a regular kernel build I triggered the following splat on a vanilla v4.11-rc4 kernel. [73253.814880] WARNING: CPU: 20 PID: 631 at fs/btrfs/qgroup.c:2472 btrfs_qgroup_free_refroot+0x154/0x180 [btrfs] [73253.814880] Modules linked in: st(E) sr_mod(E) cdrom(E) nfsv3(E) nfs_acl(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) ext4(E) crc16(E) jbd2(E) mbcache(E) intel_rapl(E) sb_edac(E) edac_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) igb(E) ghash_clmulni_intel(E) iTCO_wdt(E) joydev(E) pcbc(E) aesni_intel(E) ipmi_ssif(E) aes_x86_64(E) ptp(E) iTCO_vendor_support(E) crypto_simd(E) glue_helper(E) pps_core(E) lpc_ich(E) ioatdma(E) pcspkr(E) dca(E) mfd_core(E) cryptd(E) i2c_i801(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) [73253.814893] wmi(E) shpchp(E) button(E) sunrpc(E) btrfs(E) hid_generic(E) xor(E) usbhid(E) raid6_pq(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) isci(E) ehci_pci(E) ahci(E) ehci_hcd(E) libsas(E) crc32c_intel(E) scsi_transport_sas(E) libahci(E) drm(E) usbcore(E) libata(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) [73253.814901] CPU: 20 PID: 631 Comm: btrfs-transacti Tainted: GW E 4.11.0-rc4-92.11-default+ #2 [73253.814901] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS RMLCRB.86I.R1.25.D670.1303141058 03/14/2013 [73253.814902] Call Trace: [73253.814903] dump_stack+0x63/0x87 [73253.814905] __warn+0xd1/0xf0 [73253.814906] warn_slowpath_null+0x1d/0x20 [73253.814915] btrfs_qgroup_free_refroot+0x154/0x180 [btrfs] [73253.814923] __btrfs_run_delayed_refs.constprop.73+0x309/0x1300 [btrfs] [73253.814932] btrfs_run_delayed_refs+0x7e/0x2e0 [btrfs] [73253.814941] btrfs_commit_transaction+0x39/0x950 [btrfs] [73253.814948] ? start_transaction+0xaa/0x490 [btrfs] [73253.814956] transaction_kthread+0x18a/0x1c0 [btrfs] [73253.814958] kthread+0x101/0x140 [73253.814965] ? btrfs_cleanup_transaction+0x4f0/0x4f0 [btrfs] [73253.814966] ? kthread_park+0x90/0x90 [73253.814967] ret_from_fork+0x2c/0x40 Any ideas? Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
On 2017-03-30 11:55, Peter Grandi wrote: My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. There are cases where there really is no other sane option. Not everyone has the kind of budget needed for proper HA setups, Thnaks for letting me know, that must have never occurred to me, just as it must have never occurred to me that some people expect extremely advanced features that imply big-budget high-IOPS high-reliability storage to be fast and reliable on small-budget storage too :-) You're missing my point (or intentionally ignoring it). Those types of operations are implemented because there are use cases that actually need them, not because some developer thought it would be cool. The one possible counter-example of this is XFS, which doesn't support shrinking the filesystem at all, but that was a conscious decision because their target use case (very large scale data storage) does not need that feature and not implementing it allows them to make certain other parts of the filesystem faster. and if you need maximal uptime and as a result have to reprovision the system online, then you pretty much need a filesystem that supports online shrinking. That's a bigger topic than we can address here. The topic used to be known in one related domain as "Very Large Databases", which were defined as databases so large and critical that they the time needed for maintenance and backup were too slow for taking them them offline etc.; that is a topics that has largely vanished for discussion, I guess because most management just don't want to hear it :-). No, it's mostly vanished because of changes in best current practice. That was a topic in an era where the only platform that could handle high-availability was VMS, and software wasn't routinely written to handle things like load balancing. As a result, people ran a single system which hosted the database, and if that went down, everything went down. By contrast, it's rare these days outside of small companies to see singly hosted databases that aren't specific to the local system, and once you start parallelizing on the system level, backup and maintenance times generally go down. Also, it's not really all that slow on most filesystem, BTRFS is just hurt by it's comparatively poor performance, and the COW metadata updates that are needed. Btrfs in realistic situations has pretty good speed *and* performance, and COW actually helps, as it often results in less head repositioning than update-in-place. What makes it a bit slower with metadata is having 'dup' by default to recover from especially damaging bitflips in metadata, but then that does not impact performance, only speed. I and numerous other people have done benchmarks running single metadata and single data profiles on BTRFS, and it consistently performs worse than XFS and ext4 even under those circumstances. It's not horrible performance (it's better for example than trying the same workload on NTFS on Windows), but it's still not what most people would call 'high' performance or speed. That feature set is arguably not appropriate for VM images, but lots of people know better :-). That depends on a lot of factors. I have no issues personally running small VM images on BTRFS, but I'm also running on decent SSD's (>500MB/s read and write speeds), using sparse files, and keeping on top of managing them. [ ... ] Having (relatively) big-budget high-IOPS storage for high-IOPS workloads helps, that must have never occurred to me either :-). It's not big budget, the SSD's in question are at best mid-range consumer SSD's that cost only marginally more than a decent hard drive, and they really don't get all that great performance in terms of IOPS because they're all on the same cheap SATA controller. The point I was trying to make (which I should have been clearer about) is that they have good bulk throughput, which means that the OS can do much more aggressive writeback caching, which in turn means that COW and fragmentation have less impact. XFS and 'ext4' are essentially equivalent, except for the fixed-size inode table limitation of 'ext4' (and XFS reportedly has finer grained locking). Btrfs is nearly as good as either on most workloads is single-device mode [ ... ] No, if you look at actual data, [ ... ] Well, I have looked at actual data in many published but often poorly made "benchmarks", and to me they seem they seem quite equivalent indeed, within somewhat differently shaped performance envelopes, so the results depend on the testing point within that envelope. I have been done my own simplistic actual data gathering, most recently here:
Re: Shrinking a device - performance?
>> [ ... ] CentOS, Redhat, and Oracle seem to take the position >> that very large data subvolumes using btrfs should work >> fine. But I would be curious what the rest of the list thinks >> about 20 TiB in one volume/subvolume. > To be sure I'm a biased voice here, as I have multiple > independent btrfs on multiple partitions here, with no btrfs > over 100 GiB in size, and that's on ssd so maintenance > commands normally return in minutes or even seconds, That's a bit extreme I think, as there are downsides to have many too small volumes too. > not the hours to days or even weeks it takes on multi-TB btrfs > on spinning rust. Or months :-). > But FWIW... 1) Don't put all your data eggs in one basket, > especially when that basket isn't yet entirely stable and > mature. Really good point here. > A mantra commonly repeated on this list is that btrfs is still > stabilizing, My impression is that most 4.x and later versions are very reliable for "base" functionality, that is excluding multi-device, compression, qgroups, ... Put another way, what scratches the Facebook itches works well :-). > [ ... ] the time/cost/hassle-factor of the backup, and being > practically prepared to use them, is even *MORE* important > than it is on fully mature and stable filesystems. Indeed, or at least *different* filesystems. I backup JFS filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones, for example. > 2) Don't make your filesystems so large that any maintenance > on them, including both filesystem maintenance like btrfs > balance/scrub/check/ whatever, and normal backup and restore > operations, takes impractically long, As per my preceding post, that's the big deal, but so many people "know better" :-). > where "impractically" can be reasonably defined as so long it > discourages you from doing them in the first place and/or so > long that it's going to cause unwarranted downtime. That's the "Very Large DataBase" level of trouble. > Some years ago, before I started using btrfs and while I was > using mdraid, I learned this one the hard way. I had a bunch > of rather large mdraids setup, [ ... ] I have recently seen another much "funnier" example: people who "know better" and follow every cool trend decide to consolidate their server farm on VMs, backed by a storage server with a largish single pool of storage holding the virtual disk images of all the server VMs. They look like geniuses until the storage pool system crashes, and a minimal integrity check on restart takes two days during which the whole organization is without access to any email, files, databases, ... > [ ... ] And there was a good chance it was /not/ active and > mounted at the time of the crash and thus didn't need > repaired, saving that time entirely! =:^) As to that I have switched to using 'autofs' to mount volumes only on access, using a simple script that turns '/etc/fstab' into an automounter dynamic map, which means that most of the time most volumes on my (home) systems are not mounted: http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928 > Eventually I arranged things so I could keep root mounted > read-only unless I was updating it, and that's still the way I > run it today. The ancient way, instead of having '/' RO and '/var' RW, to have '/' RW and '/usr' RO (so for example it could be shared across many systems via NFS etc.), and while both are good ideas, I prefer the ancient way. But then some people who know better are moving to merge '/' with '/usr' without understanding what's the history and the advantages. > [ ... ] If it's multiple TBs, chances are it's going to be > faster to simply blow away and recreate from backup, than it > is to try to repair... [ ... ] Or to shrink or defragment or dedup etc., except on very high IOPS-per-TB storage. > [ ... ] how much simpler it would have been had they had an > independent btrfs of say a TB or two for each system they were > backing up. That is the general alternative to a single large pool/volume: sharding/chunking of filetrees, sometimes, like with Lustre or Ceph etc. with a "metafilesystem" layer on top. Done manually my suggestion is to do the sharding per-week (or other suitable period) rather than per-system, in a circular "crop rotation" scheme. So that once a volume has been filled, it becomes read-only and can even be unmounted until it needs to be reused: http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b Then there is the problem that "a TB or two" is less easy with increasing disk capacities, but then I think that disks with a capacity larger than 1TB are not suitable for ordinary workloads, and more for tape-cartridge like usage. > What would they have done had the btrfs gone bad and needed > repaired? [ ... ] In most cases I have seen of designs aimed at achieving the lowest cost and highest flexibility "low IOPS single poool" at the expense of scalability and maintainability, the "clever" designer had been promoted or had
Re: Do different btrfs volumes compete for CPU?
Thank you very much for reply and suggestions, more comments below. Still, is there a definite answer on root question: are different btrfs volumes independent in terms of CPU, or are there some shared workers that can be point of contention? What would have been interesting would have been if you had any reports from for instance htop during that time, showing wait percentage on the various cores and status (probably D, disk-wait) of the innocent process. iotop output would of course have been even better, but also rather more special-case so less commonly installed. Curiously, I have had iotop but not htop running. [btrfs-transacti] had some low-level activity in iotop (I still assume it was CPU-limited), the innocent process did not have any activity anywhere. Next time I'll also take notice of process state in ps (sadly, my omission). I believe you will find that the problem isn't btrfs, but rather, I/O contention This possibility did not come to my mind. Can USB drivers be still that bad in 4.4? Is there any way to discriminate these two situations (btrfs vs usb load)? BTW, USB adapter used is this one (though storage array only supports USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/ and that if you try the same thing with one of the filesystems being for instance ext4, you'll see the same problem there as well Not sure if it's possible to reproduce the problem with ext4, since it's not possible to perform such extensive metadata operations there, and simply moving large amount of data never created any problems for me regardless of filesystem. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
Marat Khalili posted on Fri, 31 Mar 2017 10:05:20 +0300 as excerpted: > Approximately 16 hours ago I've run a script that deleted >~100 > snapshots and started quota rescan on a large USB-connected btrfs volume > (5.4 of 22 TB occupied now). Quota rescan only completed just now, with > 100% load from [btrfs-transacti] throughout this period, which is > probably ~ok depending on your view on things. > > What worries me is innocent process using _another_, SATA-connected > btrfs volume that hung right after I started my script and took >30 > minutes to be sigkilled. There's nothing interesting in the kernel log, > and attempts to attach strace to the process output nothing, but I of > course suspect that it freezed on disk operation. > > I wonder: > 1) Can there be a contention for CPU or some mutexes between kernel > btrfs threads belonging to different volumes? > 2) If yes, can anything be done about it other than mounting volumes > from (different) VMs? > > >> $ uname -a; btrfs --version >> Linux host 4.4.0-66-generic #87-Ubuntu SMP >> Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux >> btrfs-progs v4.4 What would have been interesting would have been if you had any reports from for instance htop during that time, showing wait percentage on the various cores and status (probably D, disk-wait) of the innocent process. iotop output would of course have been even better, but also rather more special-case so less commonly installed. I believe you will find that the problem isn't btrfs, but rather, I/O contention, and that if you try the same thing with one of the filesystems being for instance ext4, you'll see the same problem there as well, which because the two filesystems are then not the same type should well demonstrate that it's not a problem at the filesystem level, but rather elsewhere. USB is infamous for being an I/O bottleneck, slowing things down both for it, and on less than perfectly configured systems, often for data access on other devices as well. SATA can and does do similar things too, but because it tends to be more efficient in general, it doesn't tend to make things as drastically bad for as long as USB can. There's some knobs you can twist for better interactivity, but I need to be up to go to work in a couple hours so will leave it to other posters to make suggestions in that regard at this point. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> Can you try to first dedup the btrfs volume? This is probably > out of date, but you could try one of these: [ ... ] Yep, > that's probably a lot of work. [ ... ] My recollection is that > btrfs handles deduplication differently than zfs, but both of > them can be very, very slow But the big deal there is that dedup is indeed a very expensive operation, even worse than 'balance'. A balanced, deduped volume will shrink faster in most cases, but the time taken simply moved from shrinking to preparing. > Again, I'm not an expert in btrfs, but in most cases a full > balance and scrub takes care of any problems on the root > partition, but that is a relatively small partition. A full > balance (without the options) and scrub on 20 TiB must take a > very long time even with robust hardware, would it not? There have been reports of several months for volumes of that size subject to ordinary workload. > CentOS, Redhat, and Oracle seem to take the position that very > large data subvolumes using btrfs should work fine. This is a long standing controvery, and for example there have been "interesting" debates in the XFS mailing list. Btrfs in this is not really different from others, with one major difference in context, that many Btrfs developers work for a company that relies of large numbers of small servers, to the point that fixing multidevice issues has not been a priority. The controversy of large volumes is that while no doubt the logical structures of recent filesystem types can support single volumes of many petabytes (or even much larger), and such volumes have indeed been created and "work"-ish, so they are unquestionably "syntactically valid", the tradeoffs involved especially as to maintainability may mean that they don't "work" well and sustainably so. The fundamental issue is metadata: while the logical structures, using 48-64 bit pointers, unquestionably scale "syntactically", they don't scale pragmatically when considering whole-volume maintenance like checking, repair, balancing, scrubbing, indexing (which includes making incremental backups etc.). Note: large volumes don't have just a speed problem for whole-volume operations, they also have a memory problem, as most tools hold in memory copy of the metadata. There have been cases where indexing or repair of a volume requires a lot more RAM (many hundreds GiB or some TiB of RAM) than the system on which the volume was being used. The problem is of course smaller if the large volume contains mostly large files, and bigger if the volume is stored on low IOPS-per-TB devices and used on small-memory systems. But even with large files even if filetree object metadata (inodes etc.) are relatively few eventually space metadata must at least potentially resolve down to single sectors, and that can be a lot of metadata unless both used and free space are very unfragmented. The fundamental technological issue is: *data* IO rates, in both random IOPS and sequential ones, can be scaled "almost" linearly by parallelizing them using RAID or equivalent, allowing large volumes to serve scalably large and parallel *data* workloads, but *metadata* IO rates cannot be easily parallelized, because metadata structures are graphs, not arrays of bytes like files. So a large volume on 100 storage devices can serve in parallel a significant percentage of 100 times the data workload of a small volume on 1 storage device, but not so much for the metadata workload. For example, I have never seen a parallel 'fsck' tool that can take advantage of 100 storage devices to complete a scan of a single volume on 100 storage devices in not much longer time than the scan of a volume on 1 of the storage device. > But I would be curious what the rest of the list thinks about > 20 TiB in one volume/subvolume. Personally I think that while volumes of many petabytes "work" syntactically, there are serious maintainability problem (which I have seen happen at a number of sites) with volumes larger than 4TB-8TB with any current local filesystem design. That depends also on number/size of storage devices, and their nature, that is IOPS, as after all metadata workloads do scale a bit with number of available IOPS, even if far more slowly than data workloads. For for example I think that an 8TB volume is not desirable on a single 8TB disk for ordinary workloads (but then I think that disks above 1-2TB are just not suitable for ordinary filesystem workloads), but with lots of smaller/faster disks a 12TB volume would probably be acceptable, and maybe a number of flash SSDs might make acceptable even a 20TB volume. Of course there are lots of people who know better. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>>> The way btrfs is designed I'd actually expect shrinking to >>> be fast in most cases. [ ... ] >> The proposed "move whole chunks" implementation helps only if >> there are enough unallocated chunks "below the line". If regular >> 'balance' is done on the filesystem there will be some, but that >> just spreads the cost of the 'balance' across time, it does not >> by itself make a «risky, difficult, slow operation» any less so, >> just spreads the risk, difficulty, slowness across time. > Isn't that too pessimistic? Maybe, it depends on the workload impacting the volume and how much it churns the free/unallocated situation. > Most of my filesystems have 90+% of free space unallocated, > even those I never run balance on. That seems quite lucky to me, as definitely is not my experience or even my expectation in the general case: in my laptop and desktop with relatively few updates I have to run 'balance' fairly frequently, and "Knorrie" has produced a nice tools that produces a graphical map of free vs. unallocated space and most examples and users find quite a bit of balancing needs to be done > For me it wouldn't just spread the cost, it would reduce it > considerably. In your case the cost of the implicit or explicit 'balance' simply does not arise because 'balance' is not necessary, and then moving whole chunks is indeed cheap. The argument here is in part whether used space (extents) or allocated space (chunks) is more fragmented as well as the amount of metadata to update in either case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Confusion about snapshots containers
On 2017-03-30 09:07, Tim Cuthbertson wrote: On Wed, Mar 29, 2017 at 10:46 PM, Duncan <1i5t5.dun...@cox.net> wrote: Tim Cuthbertson posted on Wed, 29 Mar 2017 18:20:52 -0500 as excerpted: So, another question... Do I then leave the top level mounted all the time for snapshots, or should I create them, send them to external storage, and umount until next time? Keep in mind that because snapshots contain older versions of whatever they're snapshotting, they're a potential security issue, at least if some of those older versions are libs or binaries. Consider the fact that you may have had security-updates since the snapshot, thus leaving your working copies unaffected by whatever security vulns the updates fixed. If the old versions remain around where normal users have access to them, particularly if they're setuid or similar, they have access to those old and now known vulns in setuid executables! (Of course users can grab vulnerable versions elsewhere or build them themselves, but they can't set them setuid root unless they /are/ root, so finding an existing setuid-root executable with known vulns is finding the keys to the kingdom.) So keeping snapshots unmounted and out of the normally accessible filesystem tree by default is recommended, at least if you're at all concerned about someone untrusted getting access to a normal user account and being able to use snapshots with known vulns of setuid executables as root-escalation methods. Another possibility is setting the snapshot subdir 700 perms, so non- super-users can't normally access anything in it anyway. Of course that's a problem if you want them to have access to snapshots of their own stuff for recovery purposes, but it's useful if you can do it. Good admins will do both of these at once if possible as they know and observe the defense-in-depth mantra, knowing all too well how easy a single layer of defense yields to fat-fingering or previously unknown vulns. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thank you, Duncan. I will try to take all that into consideration. These are really just fairly simple personal home systems, but security is still important. On the note of the old binaries and libraries bit, nodev, noexec, and nosuid are all per-mountpoint, not per-volume, so you can mitigate some of the rsik by always mounting with those flags. Despite that, it's still a good idea to not have anything more than you need mounted at any given time (it's a lot harder to screw up a filesystem which isn't mounted). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4 V2] btrfs: cleanup barrier_all_devices() to check dev stat flush error
The objective of this patch is to cleanup barrier_all_devices() so that the error checking is in a separate loop independent of of the loop which submits and waits on the device flush requests. By doing this it helps to further develop patches which would tune the error-actions as needed. Signed-off-by: Anand Jain--- V2: Now the flush error return is saved and checked instead of the checkpoint of the dev_stat method earlier. fs/btrfs/disk-io.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f8f534a32c2f..b6d047250ce2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3535,6 +3535,23 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +static int check_barrier_error(struct btrfs_fs_devices *fsdevs) +{ + int dropouts = 0; + struct btrfs_device *dev; + + list_for_each_entry_rcu(dev, >devices, dev_list) { + if (!dev->bdev || dev->last_flush_error) + dropouts++; + } + + if (dropouts > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) + return -EIO; + + return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3572,8 +3589,19 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (write_dev_flush(dev, 1)) dropouts++; } - if (dropouts > info->num_tolerated_disk_barrier_failures) - return -EIO; + + /* +* A slight optimization, we check for dropouts here which avoids +* a dev list loop when disks are healthy. +*/ + if (dropouts) { + /* +* As we need holistic view of the failed disks, so +* error checking is pushed to a separate loop. +*/ + return check_barrier_error(info->fs_devices); + } + return 0; } -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4 V2] btrfs: cleanup barrier_all_devices() unify dev error count
Now when counting number of error devices we don't need to count them separately once during send and wait, as because device error counted during send is more of static check. Also kindly note that as of now there is no code which would set dev->bdev = NULL unless device is missing. However I still kept bdev == NULL counted towards error device in view of future enhancements. And as the device_list_mutex is held when barrier_all_devices() is called, I don't expect a new bdev to null in between send and wait. Now in this process I also rename error_wait to dropouts. Signed-off-by: Anand Jain--- V2: As the write_dev_flush with wait=0 is always successful, from the previous patch, ret is now removed. fs/btrfs/disk-io.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 42bcf98794ec..f8f534a32c2f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3543,19 +3543,15 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int errors_send = 0; - int errors_wait = 0; - int ret; + int dropouts = 0; /* send down all the barriers */ head = >fs_devices->devices; list_for_each_entry_rcu(dev, head, dev_list) { if (dev->missing) continue; - if (!dev->bdev) { - errors_send++; + if (!dev->bdev) continue; - } if (!dev->in_fs_metadata || !dev->writeable) continue; @@ -3567,18 +3563,16 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_wait++; + dropouts++; continue; } if (!dev->in_fs_metadata || !dev->writeable) continue; - ret = write_dev_flush(dev, 1); - if (ret) - errors_wait++; + if (write_dev_flush(dev, 1)) + dropouts++; } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) + if (dropouts > info->num_tolerated_disk_barrier_failures) return -EIO; return 0; } -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4 V2] btrfs: use blkdev_issue_flush to flush the device cache
As of now we do alloc an empty bio and then use the flag REQ_PREFLUSH to flush the device cache, instead we can use blkdev_issue_flush() for this puspose. Also now no need to check the return when write_dev_flush() is called with wait = 0 Signed-off-by: Anand Jain--- V2 Title of this patch is changed from btrfs: communicate back ENOMEM when it occurs And its entirely a new patch, which now use blkdev_issue_flush() fs/btrfs/disk-io.c | 64 +++--- fs/btrfs/volumes.h | 3 ++- 2 files changed, 19 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 9de35bca1f67..42bcf98794ec 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3498,67 +3498,39 @@ static int write_dev_supers(struct btrfs_device *device, return errors < i ? 0 : -1; } -/* - * endio for the write_dev_flush, this will wake anyone waiting - * for the barrier when it is done - */ -static void btrfs_end_empty_barrier(struct bio *bio) +static void btrfs_dev_issue_flush(struct work_struct *work) { - if (bio->bi_private) - complete(bio->bi_private); - bio_put(bio); + int ret; + struct btrfs_device *device; + + device = container_of(work, struct btrfs_device, flush_work); + + /* we are in the commit thread */ + ret = blkdev_issue_flush(device->bdev, GFP_NOFS, NULL); + device->last_flush_error = ret; + complete(>flush_wait); } /* * trigger flushes for one the devices. If you pass wait == 0, the flushes are * sent down. With wait == 1, it waits for the previous flush. - * - * any device where the flush fails with eopnotsupp are flagged as not-barrier - * capable */ static int write_dev_flush(struct btrfs_device *device, int wait) { - struct bio *bio; - int ret = 0; - if (wait) { - bio = device->flush_bio; - if (!bio) - return 0; + int ret; wait_for_completion(>flush_wait); - - if (bio->bi_error) { - ret = bio->bi_error; + ret = device->last_flush_error; + if (ret) btrfs_dev_stat_inc_and_print(device, - BTRFS_DEV_STAT_FLUSH_ERRS); - } - - /* drop the reference from the wait == 0 run */ - bio_put(bio); - device->flush_bio = NULL; - + BTRFS_DEV_STAT_FLUSH_ERRS); return ret; } - /* -* one reference for us, and we leave it for the -* caller -*/ - device->flush_bio = NULL; - bio = btrfs_io_bio_alloc(GFP_NOFS, 0); - if (!bio) - return -ENOMEM; - - bio->bi_end_io = btrfs_end_empty_barrier; - bio->bi_bdev = device->bdev; - bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; init_completion(>flush_wait); - bio->bi_private = >flush_wait; - device->flush_bio = bio; - - bio_get(bio); - btrfsic_submit_bio(bio); + INIT_WORK(>flush_work, btrfs_dev_issue_flush); + schedule_work(>flush_work); return 0; } @@ -3587,9 +3559,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (!dev->in_fs_metadata || !dev->writeable) continue; - ret = write_dev_flush(dev, 0); - if (ret) - errors_send++; + write_dev_flush(dev, 0); } /* wait for all the barriers */ diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fa0b79422695..1168b78c5f1d 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -123,8 +123,9 @@ struct btrfs_device { struct list_head resized_list; /* for sending down flush barriers */ - struct bio *flush_bio; struct completion flush_wait; + struct work_struct flush_work; + int last_flush_error; /* per-device scrub information */ struct scrub_ctx *scrub_device; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: delete unused member nobarriers
Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 3 --- fs/btrfs/volumes.h | 1 - 2 files changed, 4 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 08b74daf35d0..9de35bca1f67 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3521,9 +3521,6 @@ static int write_dev_flush(struct btrfs_device *device, int wait) struct bio *bio; int ret = 0; - if (device->nobarriers) - return 0; - if (wait) { bio = device->flush_bio; if (!bio) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 59be81206dd7..fa0b79422695 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -123,7 +123,6 @@ struct btrfs_device { struct list_head resized_list; /* for sending down flush barriers */ - int nobarriers; struct bio *flush_bio; struct completion flush_wait; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Do different btrfs volumes compete for CPU?
Approximately 16 hours ago I've run a script that deleted >~100 snapshots and started quota rescan on a large USB-connected btrfs volume (5.4 of 22 TB occupied now). Quota rescan only completed just now, with 100% load from [btrfs-transacti] throughout this period, which is probably ~ok depending on your view on things. What worries me is innocent process using _another_, SATA-connected btrfs volume that hung right after I started my script and took >30 minutes to be sigkilled. There's nothing interesting in the kernel log, and attempts to attach strace to the process output nothing, but I of course suspect that it freezed on disk operation. I wonder: 1) Can there be a contention for CPU or some mutexes between kernel btrfs threads belonging to different volumes? 2) If yes, can anything be done about it other than mounting volumes from (different) VMs? $ uname -a; btrfs --version Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html