Re: System unable to mount partition after a power loss
On 2018-12-07 01:43, Doni Crosby wrote: This is qemu-kvm? What's the cache mode being used? It's possible the usual write guarantees are thwarted by VM caching. Yes it is a proxmox host running the system so it is a qemu vm, I'm unsure on the caching situation. On the note of QEMU and the cache mode, the only cache mode I've seen to actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', but that causes problems for most filesystems, so it's probably not the issue here. OTOH, I've seen issues with most of the cache modes other than 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as the back-end storage on the host system, and most of the time such issues will manifest as both problems with the volume inside the VM _and_ the volume the disk images are being stored on.
Re: What if TRIM issued a wipe on devices that don't TRIM?
On 2018-12-06 23:09, Andrei Borzenkov wrote: 06.12.2018 16:04, Austin S. Hemmelgarn пишет: * On SCSI devices, a discard operation translates to a SCSI UNMAP command. As pointed out by Ronnie Sahlberg in his reply, this command is purely advisory, may not result in any actual state change on the target device, and is not guaranteed to wipe the data. To actually wipe things, you have to explicitly write bogus data to the given regions (using either regular writes, or a WRITESAME command with the desired pattern), and _then_ call UNMAP on them. WRITE SAME command has UNMAP bit and depending on device and kernel version kernel may actually issue either UNMAP or WRITE SAME with UNMAP bit set when doing discard. Good to know. I've not looked at the SCSI code much, and actually didn't know about the UNMAP bit for the WRITE SAME command, so I just assumed that the kernel only used the UNMAP command.
Re: What if TRIM issued a wipe on devices that don't TRIM?
On 2018-12-06 01:11, Robert White wrote: (1) Automatic and selective wiping of unused and previously used disk blocks is a good security measure, particularly when there is an encryption layer beneath the file system. (2) USB attached devices _never_ support TRIM and they are the most likely to fall into strangers hands. Not true on the first count. Some really nice UAS devices do support SCSI UNMAP and WRITESAME commands. (3) I vaguely recall that some flash chips will take bulk writhes of full sectors of 0x00 or 0xFF (I don't remember which) were second-best to TRIM for letting the flash controllers defragment their internals. So it would be dog-slow, but it would be neat if BTRFS had a mount option to convert any TRIM command from above into the write of a zero, 0xFF, or trash block to the device below if that device doesn't support TRIM. Real TRIM support would override the block write. Obviously doing an fstrim would involve a lot of slow device writes but only for people likely to do that sort of thing. For testing purposes the destruction of unused pages in this manner might catch file system failures or coding errors. (The other layer where this might be most appropriate is in cryptsetup et al, where it could lie about TRIM support, but that sort of stealth lag might be bad for filesystem-level operations. Doing it there would also loose the simpler USB use cases.) ...Just a thought... First off, TRIM is an ATA command, not the kernel term. `fstrim` inherited the ATA name, but in kernel it's called a discard operation, and it's kind of important to understand here that a discard operation can result in a number of different behaviors. In particular, you have at least the following implementations: * On SCSI devices, a discard operation translates to a SCSI UNMAP command. As pointed out by Ronnie Sahlberg in his reply, this command is purely advisory, may not result in any actual state change on the target device, and is not guaranteed to wipe the data. To actually wipe things, you have to explicitly write bogus data to the given regions (using either regular writes, or a WRITESAME command with the desired pattern), and _then_ call UNMAP on them. * On dm-thinp devices, a discard operation results in simply unmapping the blocks in the region it covers. The underlying blocks themselves are not wiped until they get reallocated (which may not happen when you write to that region of the dm-thinp device again), and may not even be wiped then (depending on how the dm-thinp device is configured). Thus, the same behavior as for SCSI is required here. * On SD/MMC devices, a discard operation results in an SD ERASE command being issued. This one is non-advisory (that is, it's guaranteed to happen), and is supposed to guarantee an overwrite of the region with zeroes or ones. * eMMC devices additionally define a discard operation independent of the SD ERASE command which unmaps the region in the translation layer, but does not wipe the blocks either on issuing the command or on re-allocating the low-level blocks. Essentially, it's just a hint for the wear-leveling algorithm. * NVMe provides two different discard operations, and I'm not sure which the kernel uses for NVMe block emulation. They correspond almost exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior. * For ATA devices, a discard operation translates to an ATA TRIM command. This command doesn't even require that the data read back from a region the command has been issued against be consistent between reads, let alone that it actually returns zeroes, and it is completely silent on how the device should actually implement the operation. In practice, most drives that implement it actually behave like dm-thinp devices, unmapping the low-level blocks in the region and only clearing them when they get reallocated, while returning any data they want on subsequent reads to that logical region until a write happens. * The MTD subsystem has support for discard operations in the various FTL's, and they appear from a cursory look at the code to behave like a non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the concept of a discard operation was originally implemented in Linux for). Notice that the only implementations that are actually guaranteed to clear out the low-level physical blocks are the SD ERASE and one of the two NVMe options, and all others require you to manually wipe the data before issuing the discard operation to guarantee that no data is retained. Given this, I don't think this should be done as a mechanism of intercepting or translating discard operations, but as something else entirely. Perhaps as a block-layer that wipes the region then issues a discard for it to the lower level device if the device supports it?
Re: btrfs progs always assume devid 1?
On 2018-12-05 14:50, Roman Mamedov wrote: Hello, To migrate my FS to a different physical disk, I have added a new empty device to the FS, then ran the remove operation on the original one. Now my FS has only devid 2: Label: 'p1' uuid: d886c190-b383-45ba-9272-9f00c6a10c50 Total devices 1 FS bytes used 36.63GiB devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1 And all the operations of btrfs-progs now fail to work in their default invocation, such as: # btrfs fi resize max . Resize '.' of 'max' ERROR: unable to resize '.': No such device [768813.414821] BTRFS info (device dm-5): resizer unable to find device 1 Of course this works: # btrfs fi resize 2:max . Resize '.' of '2:max' But this is inconvenient and seems to be a rather simple oversight. If what I got is normal (the device staying as ID 2 after such operation), then count that as a suggestion that btrfs-progs should use the first existing devid, rather than always looking for hard-coded devid 1. I've been meaning to try and write up a patch to special-case this for a while now, but have not gotten around to it yet. FWIW, this is one of multiple reasons that it's highly recommended to use `btrfs replace` instead of adding a new device and deleting the old one when replacing a device. Other benefits include: * It doesn't have to run in the foreground (and doesn't by default). * It usually takes less time. * Replace operations can be queried while running to get a nice indication of the completion percentage. The only disadvantage is that the new device has to be at least as large as the old one (though you can get around this to a limited degree by shrinking the old device), and it needs the old and new device to be plugged in at the same time (add/delete doesn't, if you flip the order of the add and delete commands).
Re: experiences running btrfs on external USB disks?
On 2018-12-04 08:37, Graham Cobb wrote: On 04/12/2018 12:38, Austin S. Hemmelgarn wrote: In short, USB is _crap_ for fixed storage, don't use it like that, even if you are using filesystems which don't appear to complain. That's useful advice, thanks. Do you (or anyone else) have any experience of using btrfs over iSCSI? I was thinking about this for three different use cases: 1) Giving my workstation a data disk that is actually a partition on a server -- keeping all the data on the big disks on the server and reducing power consumption (just a small boot SSD in the workstation). 2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI mirror to provide redundancy without putting more disks in the local system. Of course, this would mean that one of the RAID1 copies would have higher latency than the other. 3) Like case 1 but actually exposing an LVM logical volume from the server using iSCSI, rather than a simple disk partition. I would then put both encryption and RAID running on the server below that logical volume. NBD could also be an alternative to iSCSI in these cases as well. Any thoughts? I've not run it over iSCSI (I tend to avoid that overly-complicated mess), but I have done it over NBD and ATAoE, as well as some more exotic arrangements, and it's really not too bad. The important part is making sure your block layer and all the stuff under it are reliable, and USB is not.
Re: experiences running btrfs on external USB disks?
On 2018-12-04 00:37, Tomasz Chmielewski wrote: I'm trying to use btrfs on an external USB drive, without much success. When the drive is connected for 2-3+ days, the filesystem gets remounted readonly, with BTRFS saying "IO failure": [77760.444607] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 [77760.550933] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: errno=-5 IO failure [77760.550979] BTRFS info (device sdb1): forced readonly [77760.551003] BTRFS: error (device sdb1) in btrfs_run_delayed_refs:2935: errno=-5 IO failure [77760.553223] BTRFS error (device sdb1): pending csums is 4096 Note that there are no other kernel messages (i.e. that would indicate a problem with disk, cable disconnection etc.). The load on the drive itself can be quite heavy at times (i.e. 100% IO for 1-2 h and more) - can it contribute to the problem (i.e. btrfs thinks there is some timeout somewhere)? Running 4.19.6 right now, but was experiencing the issue also with 4.18 kernels. # btrfs device stats /data [/dev/sda1].write_io_errs 0 [/dev/sda1].read_io_errs 0 [/dev/sda1].flush_io_errs 0 [/dev/sda1].corruption_errs 0 [/dev/sda1].generation_errs 0 It looks to me like the typical USB issues that are present with almost all filesystems but only seem to be noticed by BTRFS because it does more rigorous checking of data. In short, USB is _crap_ for fixed storage, don't use it like that, even if you are using filesystems which don't appear to complain.
Re: BTRFS on production: NVR 16+ IP Cameras
On 2018-11-15 13:39, Juan Alberto Cirez wrote: Is BTRFS mature enough to be deployed on a production system to underpin the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? For NVR, I'd say no. BTRFS does pretty horribly with append-only workloads, even if they are WORM style. It also does a really bad job with most relational database systems that you would likely use for indexing. If you can suggest your reasoning for wanting to use BTRFS though, I can probably point you at alternatives that would work more reliably for your use case.
Re: [PATCH RFC] btrfs: harden agaist duplicate fsid
On 11/13/2018 10:31 AM, David Sterba wrote: On Mon, Oct 01, 2018 at 09:31:04PM +0800, Anand Jain wrote: + /* + * we are going to replace the device path, make sure its the + * same device if the device mounted + */ + if (device->bdev) { + struct block_device *path_bdev; + + path_bdev = lookup_bdev(path); + if (IS_ERR(path_bdev)) { + mutex_unlock(_devices->device_list_mutex); + return ERR_CAST(path_bdev); + } + + if (device->bdev != path_bdev) { + bdput(path_bdev); + mutex_unlock(_devices->device_list_mutex); + return ERR_PTR(-EEXIST); It would be _really_ nice to have an informative error message printed here. Aside from the possibility of an admin accidentally making a block-level copy of the volume, this code triggering could represent an attempted attack against the system, so it's arguably something that should be reported as happening. Personally, I think a WARN_ON_ONCE for this would make sense, ideally per-volume if possible. Ah. Will add an warn. Thanks, Anand The requested error message is not in the patch you posted or I have missed that (https://patchwork.kernel.org/patch/10641041/) . Austin, is the following ok for you? "BTRFS: duplicate device fsid:devid for %pU:%llu old:%s new:%s\n" BTRFS: duplicate device fsid:devid 7c667b96-59eb-43ad-9ae9-c878f6ad51d8:2 old:/dev/sda6 new:/dev/sdb6 As the UUID and paths are long I tried to squeeeze the rest so it's still comprehensible but this would be better confirmed. Thanks. Looks perfectly fine to me.
Re: BTRFS did it's job nicely (thanks!)
On 11/4/2018 11:44 AM, waxhead wrote: Sterling Windmill wrote: Out of curiosity, what led to you choosing RAID1 for data but RAID10 for metadata? I've flip flipped between these two modes myself after finding out that BTRFS RAID10 doesn't work how I would've expected. Wondering what made you choose your configuration. Thanks! Sure, The "RAID"1 profile for data was chosen to maximize disk space utilization since I got a lot of mixed size devices. The "RAID"10 profile for metadata was chosen simply because it *feels* a bit faster for some of my (previous) workload which was reading a lot of small files (which I guess was embedded in the metadata). While I never remembered that I got any measurable performance increase the system simply felt smoother (which is strange since "RAID"10 should hog more disks at once). I would love to try "RAID"10 for both data and metadata, but I have to delete some files first (or add yet another drive). Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 does not work as you expected? As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 replica) is striped over as many disks it can (as long as there is free space). So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe over (20/2) x 2 and if you run out of space on 10 of the devices it will continue to stripe over (5/2) x 2. So your stripe width vary with the available space essentially... I may be terribly wrong about this (until someones corrects me that is...) He's probably referring to the fact that instead of there being a roughly 50% chance of it surviving the failure of at least 2 devices like classical RAID10 is technically able to do, it's currently functionally 100% certain it won't survive more than one device failing.
Re: Understanding "btrfs filesystem usage"
On 10/30/2018 12:10 PM, Ulli Horlacher wrote: On Mon 2018-10-29 (17:57), Remi Gauvin wrote: On 2018-10-29 02:11 PM, Ulli Horlacher wrote: I want to know how many free space is left and have problems in interpreting the output of: btrfs filesystem usage btrfs filesystem df btrfs filesystem show In my not so humble opinion, the filesystem usage command has the easiest to understand output. It' lays out all the pertinent information. You can clearly see 825GiB is allocated, with 494GiB used, therefore, filesystem show is actually using the "Allocated" value as "Used". Allocated can be thought of "Reserved For". And what is "Device unallocated"? Not reserved? As the output of the Usage command and df command clearly show, you have almost 400GiB space available. This is the good part :-) The disparity between 498GiB used and 823Gib is pretty high. This is probably the result of using an SSD with an older kernel. If your kernel is not very recent, (sorry, I forget where this was fixed, somewhere around 4.14 or 4.15), then consider mounting with the nossd option. I am running kernel 4.4 (it is a Ubuntu 16.04 system) But /local is on a SSD. Should I really use nossd mount option?! Probably, and you may even want to use it on newer (patched) kernels. This requires some explanation though. SSD's are write limited media (write to them too much, and they stop working). This is generally a pretty well known fact, and while it is true, it's not anywhere near as much of an issue on modern SSD"s as people make it out to be (pretty much, if you've got an SSD made in the last 5 years, you almost certainly don't have to worry about this). The `ssd` code in BTRFS behaves as if this is still an issue (and does so in a way that doesn't even solve it well). Put simply, when BTRFS goes to look for space, it treats requests for space that ask for less than a certain size as if they are that minimum size, and only tries to look for smaller spots if it can't find one at least that minimum size. This has a couple of advantages in terms of write performance, especially in the common case of a mostly empty filesystem. For the default (`nossd`) case, that minimum size is 64kB. So, in most cases, the potentially wasted space actually doesn't matter much (most writes are bigger than 64k) unless you're doing certain things. For the old (`ssd`) case, that minimum size is 2MB. Even with the common cases that would normally not have an issue with the 64k default, this ends up wasting a _huge_ amount of space. For the new `ssd` behavior, the minimum is different for data and metadata (IIRC, metadata uses the 64k default, while data still uses the 2M size). This solves the biggest issues (which were seen with metadata), but doesn't completely remove the problem. Expanding on this further, some unusual workloads actually benefit from the old `ssd` behavior, so on newer kernels `ssd_spread` gives that behavior. However, many workloads actually do better with the `nossd` behavior (especially the pathological worst case stuff like databases and VM disk images), so if you have a recent SSD, you probably want to just use that. You can improve this by running a balance. Something like: btrfs balance start -dusage=55 I run balance via cron weekly (adapted https://software.opensuse.org/package/btrfsmaintenance)
Re: CRC mismatch
On 18/10/2018 08.02, Anton Shepelev wrote: I wrote: What may be the reason of a CRC mismatch on a BTRFS file in a virutal machine: csum failed ino 175524 off 1876295680 csum 451760558 expected csum 1446289185 Shall I seek the culprit in the host machine on in the guest one? Supposing the host machine healty, what operations on the gueest might have caused a CRC mismatch? Thank you, Austin and Chris, for your replies. While describing the problem for the client, I tried again to copy the corrupt file and this time it was copied without error, which is of course scary because errors that miraculously disappear may suddenly reappear in the same manner. If The filesystem was running some profile that supports repairs (pretty much, anything except single or raid0 profiles), then BTRFS will have fixed that particular block for you automatically. Of course, the other possibility is that it was a transient error in the block layer that caused it tor return bogus data when the data that was on-disk was in fact correct.
Re: CRC mismatch
On 2018-10-16 16:27, Chris Murphy wrote: On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn wrote: On 2018-10-16 11:30, Anton Shepelev wrote: Hello, all What may be the reason of a CRC mismatch on a BTRFS file in a virutal machine: csum failed ino 175524 off 1876295680 csum 451760558 expected csum 1446289185 Shall I seek the culprit in the host machine on in the guest one? Supposing the host machine healty, what operations on the gueest might have caused a CRC mismatch? Possible causes include: * On the guest side: - Unclean shutdown of the guest system (not likely even if this did happen). - A kernel bug on in the guest. - Something directly modifying the block device (also not very likely). * On the host side: - Unclean shutdown of the host system without properly flushing data from the guest. Not likely unless you're using an actively unsafe caching mode for the guest's storage back-end. - At-rest data corruption in the storage back-end. - A bug in the host-side storage stack. - A transient error in the host-side storage stack. - A bug in the hypervisor. - Something directly modifying the back-end storage. Of these, the statistically most likely location for the issue is probably the storage stack on the host. Is there still that O_DIRECT related "bug" (or more of a limitation) if the guest is using cache=none on the block device? I had actually forgotten about this, and I'm not quite sure if it's fixed or not. Anton what virtual machine tech are you using? qemu/kvm managed with virt-manager? The configuration affects host behavior; but the negative effect manifests inside the guest as corruption. If I remember correctly.
Re: CRC mismatch
On 2018-10-16 11:30, Anton Shepelev wrote: Hello, all What may be the reason of a CRC mismatch on a BTRFS file in a virutal machine: csum failed ino 175524 off 1876295680 csum 451760558 expected csum 1446289185 Shall I seek the culprit in the host machine on in the guest one? Supposing the host machine healty, what operations on the gueest might have caused a CRC mismatch? Possible causes include: * On the guest side: - Unclean shutdown of the guest system (not likely even if this did happen). - A kernel bug on in the guest. - Something directly modifying the block device (also not very likely). * On the host side: - Unclean shutdown of the host system without properly flushing data from the guest. Not likely unless you're using an actively unsafe caching mode for the guest's storage back-end. - At-rest data corruption in the storage back-end. - A bug in the host-side storage stack. - A transient error in the host-side storage stack. - A bug in the hypervisor. - Something directly modifying the back-end storage. Of these, the statistically most likely location for the issue is probably the storage stack on the host.
Re: Interpreting `btrfs filesystem show'
On 2018-10-15 10:42, Anton Shepelev wrote: Hugo Mills to Anton Shepelev: While trying to resolve free space problems, and found that I cannot interpret the output of: btrfs filesystem show Label: none uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8 Total devices 1 FS bytes used 34.06GiB devid1 size 40.00GiB used 37.82GiB path /dev/sda2 How come the total used value is less than the value listed for the only device? "Used" on the device is the mount of space allocated. "Used" on the FS is the total amount of actual data and metadata in that allocation. You will also need to look at the output of "btrfs fi df" to see the breakdown of the 37.82 GiB into data, metadata and currently unused. See https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools for the details Thank you, Hugo, understood. mount/amount is a very fitting typo :-) Does the standard `du' tool work correctly for btfrfs? For the default 'physical usage' mode, it functionally does not work correctly, because it does not know about reflinks. The easiest way to see this is to create a couple of snapshots of a subvolume alongside the subvolume, and then run `du -s --totals` on those snapshots and the subvolume. It will report the total space usage to be equal to the sum of the values reported for each snapshot and the subvolume, when it should instead only count the space usage for shared data once. For the 'apparent usage' mode provided by the GNU implementation, it does work correctly.
Re: reproducible builds with btrfs seed feature
On 2018-10-13 18:28, Chris Murphy wrote: Is it practical and desirable to make Btrfs based OS installation images reproducible? Or is Btrfs simply too complex and non-deterministic? [1] The main three problems with Btrfs right now for reproducibility are: a. many objects have uuids other than the volume uuid; and mkfs only lets us set the volume uuid b. atime, ctime, mtime, otime; and no way to make them all the same c. non-deterministic allocation of file extents, compression, inode assignment, logical and physical address allocation I'm imagining reproducible image creation would be a mkfs feature that builds on Btrfs seed and --rootdir concepts to constrain Btrfs features to maybe make reproducible Btrfs volumes possible: - No raid - Either all objects needing uuids can have those uuids specified by switch, or possibly a defined set of uuids expressly for this use case, or possibly all of them can just be zeros (eek? not sure) - A flag to set all times the same - Possibly require that target block device is zero filled before creation of the Btrfs - Possibly disallow subvolumes and snapshots - Require the resulting image is seed/ro and maybe also a new compat_ro flag to enforce that such Btrfs file systems cannot be modified after the fact. - Enforce a consistent means of allocation and compression The end result is creating two Btrfs volumes would yield image files with matching hashes. So in other words, you care about matching the block layout _exactly_. This is a great idea for paranoid people, but it's usually overkill. Realistically, almost nothing in userspace cares about the block layout, worrying about it just makes verifying the reproduced image a bit easier (there's no reason you can't verify all the relevant data without doing a checksum or HMAC of the image as a whole). If I had to guess, the biggest challenge would be allocation. But it's also possible that such an image may have problems with "sprouts". A non-removable sprout seems fairly straightforward and safe; but if a "reproducible build" type of seed is removed, it seems like removal needs to be smart enough to refresh *all* uuids found in the sprout: a hard break from the seed. Competing file systems, ext4 with make_ext4 fork, and squashfs. At the moment I'm thinking it might be easier to teach squashfs integrity checking than to make Btrfs reproducible. But then I also think restricting Btrfs features, and applying some requirements to constrain Btrfs to make it reproducible, really enhances the Btrfs seed-sprout feature. Any thoughts? Useful? Difficult to implement? Squashfs might be a better fit for this use case *if* it can be taught about integrity checking. It does per file checksums for the purpose of deduplication but those checksums aren't retained for later integrity checking. I've seen projects with SquashFS that store integrity data separately but leverage other infrastructure. Methods I've seen so far include: * GPG-signed SquashFS images, usually with detached signatures * SquashFS with PAR2 integrity checking data * SquashFS on top of dm-verity * SquashFS on top of dm-integrity The first two need to be externally checked prior to mount, but doing so is not hard. The fourth is tricky to set up right, but provides better integration with encrypted images. The third does exactly what's needed though. You just use the embedded data variant of dm-verity, bind the resultant image to a loop device, activate dm-verity on the loop device, and mount the resultant mapped device like any other SquashFS image. I've also seen some talk of using SquashFS with IMA and IMA appraisal, but I've not seen anybody actually _do_ that, and it wouldn't be on quite the level you seem to want (it verifies the files in the image, but not the image as a whole).
Re: BTRFS bad block management. Does it exist?
On 2018-10-14 07:08, waxhead wrote: In case BTRFS fails to WRITE to a disk. What happens? Does the bad area get mapped out somehow? Does it try again until it succeed or until it "times out" or reach a threshold counter? Does it eventually try to write to a different disk (in case of using the raid1/10 profile?) Building on Qu's answer (which is absolutely correct), BTRFS makes the perfectly reasonable assumption that you're not trying to use known bad hardware. It's not alone in this respect either, pretty much every Linux filesystem makes the exact same assumption (and almost all non-Linux ones too), because it really is a perfectly reasonable assumption. The only exception is ext[234], but they only support it statically (you can set the bad block list at mkfs time, but not afterwards, and they don't update it at runtime), and it's a holdover from earlier filesystems which originated at a time when storage was sufficiently expensive _and_ unreliable that you kept using disks until they were essentially completely dead. The reality is that with modern storage hardware, if you have persistently bad sectors the device is either defective (and should be returned under warranty), or it's beyond expected EOL (and should just be replaced). Most people know about SSD's doing block remapping to avoid bad blocks, but hard drives do it to, and they're actually rather good at it. In both cases, enough spare blocks are provided that the device can handle average rates of media errors through the entirety of it's average life expectancy without running out of spare blocks. On top of all of that though, it's fully possible to work around bad blocks in the block layer if you take the time to actually do it. With a bit of reasonably simple math, you can easily set up an LVM volume that actively avoids all the bad blocks on a disk while still fully utilizing the rest of the volume. Similarly, with a bit of work (and a partition table that supports _lots_ of partitions) you can work around bad blocks with an MD concatenated device.
Re: Monitoring btrfs with Prometheus (and soon OpenMonitoring)
On 2018-10-07 09:37, Holger Hoffstätte wrote: The Prometheus statistics collection/aggregation/monitoring/alerting system [1] is quite popular, easy to use and will probably be the basis for the upcoming OpenMetrics "standard" [2]. Prometheus collects metrics by polling host-local "exporters" that respond to http requests; many such exporters exist, from the generic node_exporter for OS metrics to all sorts of application-/service-specific varieties. Since btrfs already exposes quite a lot of monitorable and - more importantly - actionable runtime information in sysfs it only makes sense to expose these metrics for visualization & alerting. I noodled over the idea some time ago but got sidetracked, besides not being thrilled at all by the idea of doing this in golang (which I *really* dislike). However, exporters can be written in any language as long as they speak the standard response protocol, so an alternative would be to use one of the other official exporter clients. These provide language-native "mini-frameworks" where one only has to fill in the blanks (see [3] for examples). Since the issue just came up in the node_exporter bugtracker [3] I figured I ask if anyone here is interested in helping build a proper standalone btrfs_exporter in C++? :D ..just kidding, I'd probably use python (which I kind of don't really know either :) and build on Hans' python-btrfs library for anything not covered by sysfs. Anybody interested in helping? Apparently there are also golang libs for btrfs [5] but I don't know anything about them (if you do, please comment on the bug), and the idea of adding even more stuff into the monolithic, already creaky and somewhat bloated node_exporter is not appealing to me. Potential problems wrt. btrfs are access to root-only information, like e.g. the btrfs device stats/errors in the aforementioned bug, since exporters are really supposed to run unprivileged due to network exposure. The S.M.A.R.T. exporter [6] solves this with dual-process contortions; obviously it would be better if all relevant metrics were accessible directly in sysfs and not require privileged access, but forking a tiny privileged process every polling interval is probably not that bad. All ideas welcome! You might be interested in what Netdata [1] is doing. We've already got tracking of space allocations via the sysfs interface (fun fact, you actually don't have to be root on most systems to read that data), and also ship some per-defined alarms that will trigger when the device gets close to full at a low-level (more specifically, if total chunk allocations exceed 90% of the total space of all the devices in the volume). Actual data collection is being done in C (Netdata already has a lot of infrastructure for parsing things out of /proc or /sys), and there ahs been some discussion in the past of adding collection of device error counters (I've been working on and off on it myself, but I still don't have a good enough understanding of the C code to get anything actually working yet). [1] https://my-netdata.io/
Re: Understanding BTRFS RAID0 Performance
On 2018-10-05 20:34, Duncan wrote: Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 + as excerpted: Is there any tuning in BTRFS that limits the number of outstanding reads at a time to a small single-digit number, or something else that could be behind small queue depths? I can't otherwise imagine what the difference would be on the read path between ext4 vs btrfs when both are on mdraid. It seems I forgot to directly answer that question in my first reply. Thanks for restating it. Btrfs doesn't really expose much performance tuning (yet?), at least outside the code itself. There are a few very limited knobs, but they're just that, few and limited or broad-stroke. There are mount options like ssd/nossd, ssd_spread/nossd_spread, the space_cache set of options (see below), flushoncommit/noflushoncommit, commit=, etc (see the btrfs (5) manpage), but nothing really to influence stride length, etc, or to optimize chunk placement between ssd and non-ssd devices, for instance. And there's a few filesystem features, normally set at mkfs.btrfs time (and thus covered in the mkfs.btrfs manpage) but some of which can be tuned later, but generally, the defaults have changed over time to reflect the best case, and the older variants are there primarily to retain backward compatibility with old kernels and tools that didn't handle the newer variants. That said, as I think about it there are some tunables that may be worth experimenting with. Most or all of these are covered in the btrfs (5) manpage. * Given the large device numbers you mention and raid0, you're likely dealing with multi-TB-scale filesystems. At this level, the space_cache=v2 mount option may be useful. It's not the default yet as btrfs check, etc, don't yet handle it, but given your raid0 choice you may not be concerned about that. Need only be given once after which v2 is "on" for the filesystem until turned off. * Consider experimenting with the thread_pool=n mount option. I've seen very little discussion of this one, but given your interest in parallelization, it could make a difference. Probably not as much as you might think. I'll explain a bit more further down where this is being mentioned again. * Possibly the commit= (default 30) mount option. In theory, upping this may allow better write merging, tho your interest seems to be more on the read side, and the commit time has consequences at crash time. Based on my own experience, having a higher commit time doesn't impact read or write performance much or really help all that much with write merging. All it really helps with is minimizing overhead, but it's not even all that great at doing that. * The autodefrag mount option may be considered if you do a lot of existing file updates, as is common with database or VM image files. Due to COW this triggers high fragmentation on btrfs, and autodefrag should help control that. Note that autodefrag effectively increases the minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and doesn't operate at whole-file size, so larger repeatedly-modified files will still have some fragmentation, just not as much. Obviously, you wouldn't see the read-time effects of this until the filesystem has aged somewhat, so it may not show up on your benchmarks. (Another option for such files is setting them nocow or using the nodatacow mount option, but this turns off checksumming and if it's on, compression for those files, and has a few other non-obvious caveats as well, so isn't something I recommend. Instead of using nocow, I'd suggest putting such files on a dedicated traditional non-cow filesystem such as ext4, and I consider nocow at best a workaround option for those who prefer to use btrfs as a single big storage pool and thus don't want to do the dedicated non-cow filesystem for some subset of their files.) * Not really for reads but for btrfs and any cow-based filesystem, you almost certainly want the (not btrfs specific) noatime mount option. Actually... This can help a bit for some workloads. Just like the commit time, it comes down to a matter of overhead. Essentially, if you read a file regularly, than with the default of relatime, you've got a guaranteed write requiring a commit of the metadata tree once every 24 hours. It's not much to worry about for just one file, but if you're reading a very large number of files all the time, it can really add up. * While it has serious filesystem integrity implications and thus can't be responsibly recommended, there is the nobarrier mount option. But if you're already running raid0 on a large number of devices you're already gambling with device stability, and this /might/ be an additional risk you're willing to take, as it should increase performance. But for normal users it's simply not worth the risk, and if you do choose to use it, it's at your own risk. Agreed, if you're running RAID0 with this many drives, nobarrier may be worth it for a
Re: [PATCH RFC] btrfs: harden agaist duplicate fsid
On 2018-10-01 04:56, Anand Jain wrote: Its not that impossible to imagine that a device OR a btrfs image is been copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after its been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:00 29.2G 0 disk |-mmcblk0p4 179:404G 0 part / |-mmcblk0p2 179:20 500M 0 part /boot |-mmcblk0p3 179:30 256M 0 part [SWAP] `-mmcblk0p1 179:10 256M 0 part /boot/efi btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:01 14.9G 0 disk |-sda48:414G 0 part / |-sda28:21 500M 0 part |-sda38:31 256M 0 part `-sda18:11 256M 0 part mmcblk0 179:00 29.2G 0 disk |-mmcblk0p4 179:404G 0 part |-mmcblk0p2 179:20 500M 0 part /boot |-mmcblk0p3 179:30 256M 0 part [SWAP] `-mmcblk0p1 179:10 256M 0 part /boot/efi btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain --- Hi, There was previous attempt to fix this bug ref: www.spinics.net/lists/linux-btrfs/msg37466.html which broke the Ubuntu subvol mount at boot. The reason for that is, Ubuntu changes the device path in the boot process, and the earlier fix checked for the device-path instead of block_device as in here and so we failed the subvol mount request and thus the bootup process. I have tested this with Oracle Linux with btrfs as boot device with a subvol to be mounted at boot. And also have verified with new test case btrfs/173. It will be good if someone run this through Ubuntu boot test case. fs/btrfs/volumes.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f4405e430da6..62173a3abcc4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -850,6 +850,29 @@ static noinline struct btrfs_device *device_list_add(const char *path, return ERR_PTR(-EEXIST); } + /* +* we are going to replace the device path, make sure its the +* same device if the device mounted +*/ + if (device->bdev) { + struct block_device *path_bdev; + + path_bdev = lookup_bdev(path); + if (IS_ERR(path_bdev)) { + mutex_unlock(_devices->device_list_mutex); + return ERR_CAST(path_bdev); + } + + if (device->bdev != path_bdev) { + bdput(path_bdev); + mutex_unlock(_devices->device_list_mutex); + return ERR_PTR(-EEXIST); It would be _really_ nice to have an informative error message printed here. Aside from the possibility of an admin accidentally making a block-level copy of the volume, this code triggering could represent an attempted attack against the system, so it's arguably something that should be reported as happening. Personally, I think a WARN_ON_ONCE for this would make sense, ideally per-volume if possible. + } + bdput(path_bdev); + pr_info("BTRFS: device fsid:devid %pU:%llu old path:%s new path:%s\n", + disk_super->fsid, devid, rcu_str_deref(device->name), path); + } + name = rcu_string_strdup(path, GFP_NOFS); if (!name) { mutex_unlock(_devices->device_list_mutex);
Re: GRUB writing to grubenv outside of kernel fs code
On 2018-09-19 15:08, Goffredo Baroncelli wrote: On 18/09/2018 19.15, Goffredo Baroncelli wrote: b. The bootloader code, would have to have sophisticated enough Btrfs knowledge to know if the grubenv has been reflinked or snapshot, because even if +C, it may not be valid to overwrite, and COW must still happen, and there's no way the code in GRUB can do full blow COW and update a bunch of metadata. And what if GRUB ignore the possibility of COWing and overwrite the data ? Is it a so big problem that the data is changed in all the snapshots ? It would be interested if the same problem happens for a swap file. I gave a look to the Sandoval's patches about implementing swap on BTRFS. This patch set prevents the subvolume containing the swapfile to be snapshot-ted (and the file to be balanced and so on...); what if we would add the same constraint to the grubenv file ? We would need to have a generalized mechanism of doing this then, because there's no way in hell a patch special-casing a single filename is going to make it into mainline. Whatever mechanism is used, it should also: * Force the file to not be inlined in metadata. * Enforce the file having the NOCOW attribute being set.
Re: GRUB writing to grubenv outside of kernel fs code
On 2018-09-18 15:00, Chris Murphy wrote: On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn wrote: It actually is independent of /boot already. I've got it running just fine on my laptop off of the EFI system partition (which is independent of my /boot partition), and thus have no issues with handling of the grubenv file. The problem is that all the big distros assume you want it in /boot, so they have no option for putting it anywhere else. Actually installing it elsewhere is not hard though, you just pass `--boot-directory=/wherever` to the `grub-install` script and turn off your distributions automatic reinstall mechanism so it doesn't get screwed up by the package manager when the GRUB package gets updated. You can also make `/boot/grub` a symbolic link pointing to the real GRUB directory, so that you don't have to pass any extra options to tools like grub-reboot or grub-set-default. This is how Fedora builds their signed grubx64.efi to behave. But you cannot ever run grub-install on a Secure Boot enabled computer, or you now have to learn all about signing your own binaries. I don't even like doing that, let alone saner users. So for those distros that support Secure Boot, in practice you're stuck with the behavior of their prebuilt GRUB binary that goes on the ESP. Agreed, but that avoids the issues we're talking about here completely because the grubenv file ends up on the ESP too.
Re: GRUB writing to grubenv outside of kernel fs code
On 2018-09-18 14:57, Chris Murphy wrote: On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov wrote: 18.09.2018 08:37, Chris Murphy пишет: The patches aren't upstream yet? Will they be? I do not know. Personally I think much easier is to make grub location independent of /boot, allowing grub be installed in separate partition. This automatically covers all other cases (like MD, LVM etc). The only case where I'm aware of this happens is Fedora on UEFI where they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever seen it elsewhere (except Fedora on UEFI). I'm not sure this is much easier. Yet another volume that would be persistently mounted? Where? A nested mount at /boot/grub? I'm not liking that at all. Even Windows and macOS have saner and simpler to understand booting methods than this. On this front maybe, but Windows' boot sequence is insane in it's own way (fun fact, if you have the Windows 8/8.1/10 boot-loader set up to multi-boot and want it to boot to something other than the default, it has to essentially _reboot the machine_ to actually boot that alternative entry).
Re: GRUB writing to grubenv outside of kernel fs code
On 2018-09-18 14:38, Andrei Borzenkov wrote: 18.09.2018 21:25, Austin S. Hemmelgarn пишет: On 2018-09-18 14:16, Andrei Borzenkov wrote: 18.09.2018 08:37, Chris Murphy пишет: On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov wrote: 18.09.2018 07:21, Chris Murphy пишет: On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy wrote: ... There are a couple of reserve locations in Btrfs at the start and I think after the first superblock, for bootloader embedding. Possibly one or both of those areas could be used for this so it's outside the file system. But other implementations are going to run into this problem too. That's what SUSE grub2 version does - it includes patches to redirect writes on btrfs to reserved area. I am not sure how it behaves in case of multi-device btrfs though. The patches aren't upstream yet? Will they be? I do not know. Personally I think much easier is to make grub location independent of /boot, allowing grub be installed in separate partition. This automatically covers all other cases (like MD, LVM etc). It actually is independent of /boot already. I've got it running just fine on my laptop off of the EFI system partition (which is independent of my /boot partition), and thus have no issues with handling of the grubenv file. The problem is that all the big distros assume you want it in /boot, so they have no option for putting it anywhere else. This requires more than just explicit --boot-directory. With current monolithic configuration file listing all available kernels this file cannot be in the same location, it must be together with kernels (think about rollback to snapshot with completely different content). Or some different, more flexible configuration is needed. Uh, no, it doesn't need to be with the kernels. Fedora stores it on the ESP separate from the kernels (which are still on the boot partition) if you use Secure Boot, and I'm doing the same (without secure boot) without issue. You do have to explicitly set the `root` variable correctly in the config though to get it to work though, and the default upstream 'easy configuration' arrangement does not do this consistently. It's not too hard to hack in though, and it's positively trivial if you just write your own configuration files by hand like I do (no, I'm not crazy, the default configuration generator just produces a brobdingnagian monstrosity of a config that has tons of stuff I don't need and makes invalid assumptions about how I want things invoked, and the config syntax is actually not that hard). As is now grub silently assumes everything is under /boot. This turned out to be oversimplified. No, it assumes everything is under whatever you told GRUB to set the default value of the `prefix` variable to when you built the GRUB image, which is automatically set to the path you pass to `--boot-directory` when you use grub-install. This persists until you explicitly set that variable to a different location, or change the `root` variable (but GRUB still uses `prefix` for module look-ups if you just change the `root` variable).
Re: GRUB writing to grubenv outside of kernel fs code
On 2018-09-18 14:16, Andrei Borzenkov wrote: 18.09.2018 08:37, Chris Murphy пишет: On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov wrote: 18.09.2018 07:21, Chris Murphy пишет: On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy wrote: https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F Does anyone know if this is still a problem on Btrfs if grubenv has xattr +C set? In which case it should be possible to overwrite and there's no csums that are invalidated. I kinda wonder if in 2018 it's specious for, effectively out of tree code, to be making modifications to the file system, outside of the file system. a. The bootloader code (pre-boot, not user space setup stuff) would have to know how to read xattr and refuse to overwrite a grubenv lacking xattr +C. b. The bootloader code, would have to have sophisticated enough Btrfs knowledge to know if the grubenv has been reflinked or snapshot, because even if +C, it may not be valid to overwrite, and COW must still happen, and there's no way the code in GRUB can do full blow COW and update a bunch of metadata. So answering my own question, this isn't workable. And it seems the same problem for dm-thin. There are a couple of reserve locations in Btrfs at the start and I think after the first superblock, for bootloader embedding. Possibly one or both of those areas could be used for this so it's outside the file system. But other implementations are going to run into this problem too. That's what SUSE grub2 version does - it includes patches to redirect writes on btrfs to reserved area. I am not sure how it behaves in case of multi-device btrfs though. The patches aren't upstream yet? Will they be? I do not know. Personally I think much easier is to make grub location independent of /boot, allowing grub be installed in separate partition. This automatically covers all other cases (like MD, LVM etc). It actually is independent of /boot already. I've got it running just fine on my laptop off of the EFI system partition (which is independent of my /boot partition), and thus have no issues with handling of the grubenv file. The problem is that all the big distros assume you want it in /boot, so they have no option for putting it anywhere else. Actually installing it elsewhere is not hard though, you just pass `--boot-directory=/wherever` to the `grub-install` script and turn off your distributions automatic reinstall mechanism so it doesn't get screwed up by the package manager when the GRUB package gets updated. You can also make `/boot/grub` a symbolic link pointing to the real GRUB directory, so that you don't have to pass any extra options to tools like grub-reboot or grub-set-default.
Re: Transactional btrfs
On 2018-09-06 03:23, Nathan Dehnel wrote: https://lwn.net/Articles/287289/ In 2008, HP released the source code for a filesystem called advfs so that its features could be incorporated into linux filesystems. Advfs had a feature where a group of file writes were an atomic transaction. https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf These guys used advfs to add a "syncv" system call that makes writes across multiple files atomic. https://lwn.net/Articles/715918/ A patch was later submitted based on the previous paper in some way. So I guess my question is, does btrfs support atomic writes across multiple files? Or is anyone interested in such a feature? I'm fairly certain that it does not currently, but in theory it would not be hard to add. Realistically, the only cases I can think of where cross-file atomic _writes_ would be of any benefit are database systems. However, if this were extended to include rename, unlink, touch, and a handful of other VFS operations, then I can easily think of a few dozen use cases. Package managers in particular would likely be very interested in being able to atomically rename a group of files as a single transaction, as it would make their job _much_ easier.
Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands
On 2018-08-30 13:13, Axel Burri wrote: On 29/08/2018 21.02, Austin S. Hemmelgarn wrote: On 2018-08-29 13:24, Axel Burri wrote: This patch allows to build distinct binaries for specific btrfs subcommands, e.g. "btrfs-subvolume-show" which would be identical to "btrfs subvolume show". Motivation: While btrfs-progs offer the all-inclusive "btrfs" command, it gets pretty cumbersome to restrict privileges to the subcommands [1]. Common approaches are to either setuid root for "/sbin/btrfs" (which is not recommended at all), or to write sudo rules for each subcommand. Separating the subcommands into distinct binaries makes it easy to set elevated privileges using capabilities(7) or setuid. A typical use case where this is needed is when it comes to automated scripts, e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh. Let me start by saying I think this is a great idea to have as an option, and that the motivation is a particularly good one. I've posted my opinions on your two open questions below, but there's two other comments I'd like to make: * Is there some particular reason that this only includes the commands it does, and _hard codes_ which ones it works with? if we just do everything instead of only the stuff we think needs certain capabilities, then we can auto-generate the list of commands to be processed based on function names in the C files, and it will automatically pick up any newly added commands. At the very least, it could still parse through the C files and look for tags in the comments for the functions to indicate which ones need to be processed this way. Either case will make it significantly easier to add new commands, and would also better justify the overhead of shipping all the files pre-generated (because there would be much more involved in pre-generating them). It includes the commands that are required by btrbk. It was quite painful to figure out the required capabilities (reading kernel code and some trial and error involved), and I did not get around to include other commands yet. Yeah, I can imagine that it was not an easy task. I've actually been thinking of writing a script to scan the kernel sources and assemble a summary of the permissions checks performed by each system call and ioctl so that stuff like this is a bit easier, but that's unfortunately way beyond my abilities right now (parsing C and building call graphs is not easy no matter what language you're doing it with). I like your idea of adding some tags in the C files, I'll try to implement this, and we'll see what it gets to. Something embedded in the comments is likely to be the easiest option in terms of making sure it doesn't break the regular build. Just the tagging in general would be useful as documentation though. It would be kind of neat to have the list of capabilities needed for each one auto-generated from what it calls, but that's getting into some particularly complex territory that would likely require call graphs to properly implement. * While not essential, it would be really neat to have the `btrfs` command detect if an associated binary exists for whatever command was just invoked, and automatically exec that (possibly with some verification) instead of calling the command directly so that desired permissions are enforced. This would mitigate the need for users to remember different command names depending on execution context. Hmm this sounds a bit too magic for me, and would probably be more confusing than useful. It would mean than running "btrfs" as user would work when splitted commands are available, and would not work if not. It would also mean scripts would not have to add special handling for the case of running as a non-root user and seeing if the split commands actually exist or not (and, for that matter, would not have to directly depend on having the split commands at all), and that users would not need to worry about how to call BTRFS based on who they were running as. Realistically, I'd expect the same error to show if the binary isn't available as if it's not executable, so that it just becomes a case of 'if you see this error, re-run the same thing as root and it should work'. Description: Patch 1 adds a template as well as a generator shell script for the splitted subcommands. Patch 2 adds the generated subcommand source files. Patch 3-5 adds a "install-splitcmd-setcap" make target, with different approaches (either hardcoded in Makefile, or more generically by including "Makefile.install_setcap" generated by "splitcmd-gen.sh"). Open Questions: 1. "make install-splitcmd-setcap" installs the binaries with hardcoded group "btrfs". This needs to be configurable (how?). Another approach would be to not set the group at all, and leave this to the user or distro packaging script. Leave it to the user or dis
Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands
On 2018-08-29 13:24, Axel Burri wrote: This patch allows to build distinct binaries for specific btrfs subcommands, e.g. "btrfs-subvolume-show" which would be identical to "btrfs subvolume show". Motivation: While btrfs-progs offer the all-inclusive "btrfs" command, it gets pretty cumbersome to restrict privileges to the subcommands [1]. Common approaches are to either setuid root for "/sbin/btrfs" (which is not recommended at all), or to write sudo rules for each subcommand. Separating the subcommands into distinct binaries makes it easy to set elevated privileges using capabilities(7) or setuid. A typical use case where this is needed is when it comes to automated scripts, e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh. Let me start by saying I think this is a great idea to have as an option, and that the motivation is a particularly good one. I've posted my opinions on your two open questions below, but there's two other comments I'd like to make: * Is there some particular reason that this only includes the commands it does, and _hard codes_ which ones it works with? if we just do everything instead of only the stuff we think needs certain capabilities, then we can auto-generate the list of commands to be processed based on function names in the C files, and it will automatically pick up any newly added commands. At the very least, it could still parse through the C files and look for tags in the comments for the functions to indicate which ones need to be processed this way. Either case will make it significantly easier to add new commands, and would also better justify the overhead of shipping all the files pre-generated (because there would be much more involved in pre-generating them). * While not essential, it would be really neat to have the `btrfs` command detect if an associated binary exists for whatever command was just invoked, and automatically exec that (possibly with some verification) instead of calling the command directly so that desired permissions are enforced. This would mitigate the need for users to remember different command names depending on execution context. Description: Patch 1 adds a template as well as a generator shell script for the splitted subcommands. Patch 2 adds the generated subcommand source files. Patch 3-5 adds a "install-splitcmd-setcap" make target, with different approaches (either hardcoded in Makefile, or more generically by including "Makefile.install_setcap" generated by "splitcmd-gen.sh"). Open Questions: 1. "make install-splitcmd-setcap" installs the binaries with hardcoded group "btrfs". This needs to be configurable (how?). Another approach would be to not set the group at all, and leave this to the user or distro packaging script. Leave it to the user or distro. It's likely to end up standardized on the name 'btrfs', but it should be agnostic of that. 2. Instead of the "install-splitcmd-setcap" make target, we could introduce a "configure --enable-splitted-subcommands" option, which would simply add all splitcmd binaries to the "all" and "install" targets without special treatment, and leave the setcap stuff to the user or distro packaging script (at least in gentoo, this needs to be specified using the "fcaps" eclass anyways [5]). A bit of a nitpick, but 'split' is the proper past tense of the word 'split', it's one of those exceptions that English has all over the place. Even aside from that though, I think `separate` sounds more natural for the configure option, or better yet, just make it `--enable-fscaps` like most other packages do. That aside, I think having a configure option is the best way to do this, it makes it very easy for distro build systems to handle it because this is what they're used to doing anyway. It also makes it a bit easier on the user, because it just becomes `make` to build whichever version you want installed.
Re: [PATCH 0/4] Userspace support for FSID change
On 2018-08-29 08:33, Nikolay Borisov wrote: On 29.08.2018 15:09, Qu Wenruo wrote: On 2018/8/29 下午4:35, Nikolay Borisov wrote: Here is the userspace tooling support for utilising the new metadata_uuid field, enabling the change of fsid without having to rewrite every metadata block. This patchset consists of adding support for the new field to various tools and files (Patch 1). The actual implementation of the new -m|-M options (which are described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) which exercises the new options and verifies certain invariants hold (these are also described in Patch2). Patch 4 is more or less copy of the kernel conuterpart just reducing some duplication between btrfs_fs_info and btrfs_fs_devices structures. So to my understand, now we have another layer of UUID. Before we have one fsid, both used in superblock and tree blocks. Now we have 2 fsid, the one used in tree blocks are kept the same, but changed its name to metadata_uuid in superblock. And superblock::fsid will become a new field, and although they are the same at mkfs time, they could change several times during its operation. This indeed makes uuid change super fast, only needs to update all superblocks of the fs, instead of all tree blocks. However I have one nitpick of the design. Unlike XFS, btrfs supports multiple devices. If we have a raid10 fs with 4 devices, and it has already gone through several UUID change (so its metadata uuid is already different from fsid). And during another UUID change procedure, we lost power while only updated 2 super blocks, what will happen for kernel device assembly? (Although considering how fast the UUID change would happen, such case should be super niche) Then I guess you will be fucked. I'm all ears for suggestion how to rectify this without skyrocketing the complexity. The current UUID rewrite method sets a flag int he superblock that FSID change is in progress and clears it once every metadatablock has been rewritten. I can piggyback on this mechanism but I'm not sure it provides 100% guarantee. Because by the some token you can set this flag, start writing the super blocks then lose power and then only some of the superblocks could have this flag set so we back at square 1. The intended usecase of this feature is to give the sysadmin the ability to create copies of filesystesm, change their uuid quickly and mount them alongside the original filesystem for, say, forensic purposes. One thing which still hasn't been set in stone is whether the new options will remain as -m|-M or whether they should subsume the current -u|-U - from the point of view of users nothing should change. Well, user would be surprised by how fast the new -m is, thus there is still something changed :) I prefer to subsume current -u/-U, and use the new one if the incompat feature is already set. Or fall back to original behavior. But I'm not a fan of using INCOMPAT flags as an indicator of changed fsid/metadata uuid. INCOMPAT feature should not change so easily nor acts as an indicator. That's to say, the flag should only be set at mkfs time, and then never change unlike the 2nd patch (I don't even like btrfstune to change incompat flags). E.g. mkfs.btrfs -O metadata_uuid , then we could use the new way to change fsid without touching metadata uuid. Or we could only use the old method. I disagree, I don't see any benefit in this but only added complexity. Can you elaborate more ? Same here, I see essentially zero benefit to this, and one _big_ drawback, namely that you can't convert an existing volume to use this approach if it's a feature that can only be set at mkfs time. That one drawback means that this is effectively useless for all existing BTRFS volumes, which is a pretty big limitation. I also do think an INCOMPAT feature bit is appropriate here. Volumes with this feature will potentially be enumerated with the wrong UUID on older kernels, which is a pretty big behavioral issue (on the level of completely breaking boot on some systems, keep in mind that almost all major distros use volume UUID's to identify volumes in /etc/fstab). Thanks, Qu So this is something which I'd like to hear from the community. Of course the alternative of rewriting the metadata blocks will be assigne new options - perhaps -m|M ? I've tested this with multiple xfstest runs with the new tools installed as well as running btrfs-progs test and have observed no regressions. Nikolay Borisov (4): btrfs-progs: Add support for metadata_uuid field. btrfstune: Add support for changing the user uuid btrfs-progs: tests: Add tests for changing fsid feature btrfs-progs: Remove fsid/metdata_uuid fields from fs_info btrfstune.c| 174 - check/main.c | 2 +- chunk-recover.c| 17 ++- cmds-filesystem.c |
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On 2018-08-28 15:14, Menion wrote: You are correct, indeed in order to cleanup you need 1) someone realize that snapshots have been created 2) apt-brtfs-snapshot is manually installed on the system Your second requirement is only needed if you want the nice automated cleanup. There's absolutely nothing preventing you from manually removing the snapshots. Assuming also that the snapshots created during do-release-upgrade are managed for auto cleanup Il martedì 28 agosto 2018, Noah Massey <mailto:noah.mas...@gmail.com>> ha scritto: On Tue, Aug 28, 2018 at 1:25 PM Menion mailto:men...@gmail.com>> wrote: > > Ok, I have removed the snapshot and the free expected space is here, thank you! > As a side note: apt-btrfs-snapshot was not installed, but it is > present in Ubuntu repository and I have used it (and I like the idea > of automatic snapshot during upgrade) > This means that the do-release-upgrade does it's own job on BTRFS, > silently which I believe is not good from the usability perspective, You are correct. DistUpgradeController.py from python3-distupgrade imports 'apt_btrfs_snapshot', which I read as coming from /usr/lib/python3/dist-packages/apt_btrfs_snapshot.py, supplied by apt-btrfs-snapshot, but I missed the fact that python3-distupgrade ships its own /usr/lib/python3/dist-packages/DistUpgrade/apt_btrfs_snapshot.py So now it looks like that cannot be easily disabled, and without the apt-btrfs-snapshot package scheduling cleanups it's not ever automatically removed? > just google it, there is no mention of this behaviour > Il giorno mar 28 ago 2018 alle ore 19:07 Austin S. Hemmelgarn > mailto:ahferro...@gmail.com>> ha scritto: > > > > On 2018-08-28 12:05, Noah Massey wrote: > > > On Tue, Aug 28, 2018 at 11:47 AM Austin S. Hemmelgarn > > > mailto:ahferro...@gmail.com>> wrote: > > >> > > >> On 2018-08-28 11:27, Noah Massey wrote: > > >>> On Tue, Aug 28, 2018 at 10:59 AM Menion mailto:men...@gmail.com>> wrote: > > >>>> > > >>>> [sudo] password for menion: > > >>>> ID gen top level path > > >>>> -- --- - > > >>>> 257 600627 5 /@ > > >>>> 258 600626 5 /@home > > >>>> 296 599489 5 > > >>>> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55 > > >>>> 297 599489 5 > > >>>> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08 > > >>>> 298 599489 5 > > >>>> /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30 > > >>>> > > >>>> So, there are snapshots, right? The time stamp is when I have launched > > >>>> do-release-upgrade, but it didn't ask anything about snapshot, neither > > >>>> I asked for it. > > >>> > > >>> This is an Ubuntu thing > > >>> `apt show apt-btrfs-snapshot` > > >>> which "will create a btrfs snapshot of the root filesystem each time > > >>> that apt installs/removes/upgrades a software package." > > >> Not Ubuntu, Debian. It's just that Ubuntu installs and configures the > > >> package by default, while Debian does not. > > > > > > Ubuntu also maintains the package, and I did not find it in Debian repositories. > > > I think it's also worth mentioning that these snapshots were created > > > by the do-release-upgrade script using the package directly, not as a > > > result of the apt configuration. Meaning if you do not want a snapshot > > > taken prior to upgrade, you have to remove the apt-btrfs-snapshot > > > package prior to running the upgrade script. You cannot just update > > > /etc/apt/apt.conf.d/80-btrfs-snapshot > > Hmm... I could have sworn that it was in the Debian repositories. > > > > That said, it's kind of stupid that the snapshot is not trivially > > optional for a release upgrade. Yes, that's where it's arguably the > > most important, but it's still kind of stupid to have to remove a > > package to get rid of that behavior and then reinstall it again afterwards.
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On 2018-08-28 12:05, Noah Massey wrote: On Tue, Aug 28, 2018 at 11:47 AM Austin S. Hemmelgarn wrote: On 2018-08-28 11:27, Noah Massey wrote: On Tue, Aug 28, 2018 at 10:59 AM Menion wrote: [sudo] password for menion: ID gen top level path -- --- - 257 600627 5 /@ 258 600626 5 /@home 296 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55 297 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08 298 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30 So, there are snapshots, right? The time stamp is when I have launched do-release-upgrade, but it didn't ask anything about snapshot, neither I asked for it. This is an Ubuntu thing `apt show apt-btrfs-snapshot` which "will create a btrfs snapshot of the root filesystem each time that apt installs/removes/upgrades a software package." Not Ubuntu, Debian. It's just that Ubuntu installs and configures the package by default, while Debian does not. Ubuntu also maintains the package, and I did not find it in Debian repositories. I think it's also worth mentioning that these snapshots were created by the do-release-upgrade script using the package directly, not as a result of the apt configuration. Meaning if you do not want a snapshot taken prior to upgrade, you have to remove the apt-btrfs-snapshot package prior to running the upgrade script. You cannot just update /etc/apt/apt.conf.d/80-btrfs-snapshot Hmm... I could have sworn that it was in the Debian repositories. That said, it's kind of stupid that the snapshot is not trivially optional for a release upgrade. Yes, that's where it's arguably the most important, but it's still kind of stupid to have to remove a package to get rid of that behavior and then reinstall it again afterwards.
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On 2018-08-28 11:27, Noah Massey wrote: On Tue, Aug 28, 2018 at 10:59 AM Menion wrote: [sudo] password for menion: ID gen top level path -- --- - 257 600627 5 /@ 258 600626 5 /@home 296 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55 297 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08 298 599489 5 /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30 So, there are snapshots, right? The time stamp is when I have launched do-release-upgrade, but it didn't ask anything about snapshot, neither I asked for it. This is an Ubuntu thing `apt show apt-btrfs-snapshot` which "will create a btrfs snapshot of the root filesystem each time that apt installs/removes/upgrades a software package." Not Ubuntu, Debian. It's just that Ubuntu installs and configures the package by default, while Debian does not. This behavior in general is not specific to Debian either, a lot of distributions are either working on or already have this type of functionality, because it's the only sane and correct way to handle updates short of rebuilding the entire system from scratch. During the do-release-upgrade I got some issues due to the (very) bad behaviour of the script in remote terminal, then I have fixed everything manually and now the filesystem is operational in bionic version If it is confirmed, how can I remove the unwanted snapshot, keeping the current "visible" filesystem contents By default, the package runs a weekly cron job to cleanup old snapshots. (Defaults to 90d, but you can configure that in APT::Snapshots::MaxAge) Alternatively, you can cleanup with the command yourself. Run `sudo apt-btrfs-snapshot list`, and then `sudo apt-btrfs-snapshot delete `
Re: corruption_errs
On 2018-08-27 18:53, John Petrini wrote: Hi List, I'm seeing corruption errors when running btrfs device stats but I'm not sure what that means exactly. I've just completed a full scrub and it reported no errors. I'm hoping someone here can enlighten me. Thanks! The first thing to understand here is that the error counters reported by `btrfs device stats` are cumulative. In other words, they count errors since the last time they were reset (which means that if you've never run `btrfs device stats -z` on this filesystem, then they will count errors since the filesystem was created). As a result, seeing a non-zero value there just means that errors of that type happened at some point in time since they were reset. Building on this a bit further, corruption errors are checksum mismatches. Each time a block is read and it's checksum does not match the stored checksum for it, a corruption error is recorded. The thing is though, if you are using a profile which can rebuild that block (dup, raid1, raid10, or one of the parity profiles), the error gets corrected automatically by the filesystem (it will attempt to rebuild that block, then write out the correct block). If that fix succeeds, there will be no errors there anymore, but the record of the error stays around (because there _was_ an error). Given this, my guess is that you _had_ checksum mismatches somewhere, but they were fixed before you ran scrub.
Re: BTRFS support per-subvolume compression, isn't it?
On 2018-08-27 17:05, Eugene Bright wrote: Greetings! BTRFS wiki says there is no per-subvolume compression option [1]. At the same time next command allow me to set properties per-subvolume: btrfs property set /volume compression zstd Corresponding get command shows distinct properties for every subvolume. Should wiki be updated? The wiki should be updated, but it's not technically wrong. What the wiki is talking about is per-subvolume mount options to control compression (so, mounting individual subvolumes from the same volume with different `compress=` or `compress-force=` mount options), which is not currently supported. You are correct though that properties can be used to achieve a similar result (compressing differently for different subvolumes.
Re: Device Delete Stalls
On 2018-08-23 10:04, Stefan Malte Schumacher wrote: Hallo, I originally had RAID with six 4TB drives, which was more than 80 percent full. So now I bought a 10TB drive, added it to the Array and gave the command to remove the oldest drive in the array. btrfs device delete /dev/sda /mnt/btrfs-raid I kept a terminal with "watch btrfs fi show" open and It showed that the size of /dev/sda had been set to zero and that data was being redistributed to the other drives. All seemed well, but now the process stalls at 8GB being left on /dev/sda/. It also seems that the size of the drive has been reset the original value of 3,64TiB. Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 Total devices 7 FS bytes used 8.07TiB devid1 size 3.64TiB used 8.00GiB path /dev/sda devid2 size 3.64TiB used 2.73TiB path /dev/sdc devid3 size 3.64TiB used 2.73TiB path /dev/sdd devid4 size 3.64TiB used 2.73TiB path /dev/sde devid5 size 3.64TiB used 2.73TiB path /dev/sdf devid6 size 3.64TiB used 2.73TiB path /dev/sdg devid7 size 9.10TiB used 2.50TiB path /dev/sdb I see no more btrfs worker processes and no more activity in iotop. How do I proceed? I am using a current debian stretch which uses Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. How should I proceed? I have a Backup but would prefer an easier and less time-comsuming way out of this mess. Not exactly what you asked for, but I do have some advice on how to avoid this situation in the future: If at all possible, use `btrfs device replace` instead of an add/delete cycle. The replace operation requires two things. First, you have to be able to connect the new device to the system while all the old ones except the device you are removing are present. Second, the new device has to be at least as big as the old one. Assuming both conditions are met and you can use replace, it's generally much faster and is a lot more reliable than an add/delete cycle (especially when the array is near full). This is because replace just copies the data that's on the old device directly (or rebuilds it directly if it's not present anymore or corrupted), whereas the add/delete method implicitly re-balances the entire array (which takes a long time and may fail if the array is mostly full). Now, as far as what's actually going on here, I'm unfortunately not quite sure, and therefore I'm really not the best person to be giving advice on how to fix it. I will comment that having info on the allocations for all the devices (not just /dev/sda) would be useful in debugging, but even with that I don't know that I personally can help.
Re: lazytime mount option—no support in Btrfs
On 2018-08-22 11:01, David Sterba wrote: On Wed, Aug 22, 2018 at 09:56:59AM -0400, Austin S. Hemmelgarn wrote: On 2018-08-22 09:48, David Sterba wrote: On Tue, Aug 21, 2018 at 01:01:00PM -0400, Austin S. Hemmelgarn wrote: On 2018-08-21 12:05, David Sterba wrote: On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote: On 2018-08-21 09:32, Janos Toth F. wrote: so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. It would be great if someone finally fixed this old bug then: https://bugzilla.kernel.org/show_bug.cgi?id=61601 Until then, it seems practically impossible to use both noatime (this can't be added as rootflag in the command line and won't apply if the kernel already mounted the root as RW) and space-cache-v2 (has to be added as a rootflag along with RW to take effect) for the root filesystem (at least without an init*fs, which I never use, so can't tell). Last I knew, it was fixed. Of course, it's been quite a while since I last tried this, as I run locally patched kernels that have `noatime` as the default instead of `relatime`. I'm using VMs without initrd, tested the rootflags=noatime and it still fails, the same way as in the bugreport. As the 'noatime' mount option is part of the mount(2) API (passed as a bit via mountflags), the remaining option in the filesystem is to whitelist the generic options and ignore them. But this brings some layering violation question. On the other hand, this would be come confusing as the user expectation is to see the effects of 'noatime'. Ideally there would be a way to get this to actually work properly. I think ext4 at least doesn't panic, though I'm not sure if it actually works correctly. No, ext4 also refuses to mount, the panic happens in VFS that tries either the rootfstype= or all available filesystems. [3.763602] EXT4-fs (sda): Unrecognized mount option "noatime" or missing value [3.761315] BTRFS info (device sda): unrecognized mount option 'noatime' Otherwise, the only option for people who want it set is to patch the kernel to get noatime as the default (instead of relatime). I would look at pushing such a patch upstream myself actually, if it weren't for the fact that I'm fairly certain that it would be immediately NACK'ed by at least Linus, and probably a couple of other people too. An acceptable solution could be to parse the rootflags and translate them to the MNT_* values, ie. what the commandline tool mount does before it calls the mount syscall. That would be helpful, but at that point you might as well update the CLI mount tool to just pass all the named options to the kernel and have it do the parsing (I mean, keep the old interface too obviously, but provide a new one and use that preferentially). The initial mount is not done by the mount tool but internally by kernel init sequence (files in init/): mount_block_root do_mount_root ksys_mount The mount options (as a string) is passed unchanged via variable root_mount_data (== rootflags). So before this step, the options would have to be filtered and all known generic options turned into bit flags. What I'm saying is that if there's going to be parsing for it in the kernel anyway, why not expose that interface to userspace too so that the regular `mount` tool can take advantage of it as well.
Re: lazytime mount option—no support in Btrfs
On 2018-08-22 09:48, David Sterba wrote: On Tue, Aug 21, 2018 at 01:01:00PM -0400, Austin S. Hemmelgarn wrote: On 2018-08-21 12:05, David Sterba wrote: On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote: On 2018-08-21 09:32, Janos Toth F. wrote: so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. It would be great if someone finally fixed this old bug then: https://bugzilla.kernel.org/show_bug.cgi?id=61601 Until then, it seems practically impossible to use both noatime (this can't be added as rootflag in the command line and won't apply if the kernel already mounted the root as RW) and space-cache-v2 (has to be added as a rootflag along with RW to take effect) for the root filesystem (at least without an init*fs, which I never use, so can't tell). Last I knew, it was fixed. Of course, it's been quite a while since I last tried this, as I run locally patched kernels that have `noatime` as the default instead of `relatime`. I'm using VMs without initrd, tested the rootflags=noatime and it still fails, the same way as in the bugreport. As the 'noatime' mount option is part of the mount(2) API (passed as a bit via mountflags), the remaining option in the filesystem is to whitelist the generic options and ignore them. But this brings some layering violation question. On the other hand, this would be come confusing as the user expectation is to see the effects of 'noatime'. Ideally there would be a way to get this to actually work properly. I think ext4 at least doesn't panic, though I'm not sure if it actually works correctly. No, ext4 also refuses to mount, the panic happens in VFS that tries either the rootfstype= or all available filesystems. [3.763602] EXT4-fs (sda): Unrecognized mount option "noatime" or missing value [3.761315] BTRFS info (device sda): unrecognized mount option 'noatime' Otherwise, the only option for people who want it set is to patch the kernel to get noatime as the default (instead of relatime). I would look at pushing such a patch upstream myself actually, if it weren't for the fact that I'm fairly certain that it would be immediately NACK'ed by at least Linus, and probably a couple of other people too. An acceptable solution could be to parse the rootflags and translate them to the MNT_* values, ie. what the commandline tool mount does before it calls the mount syscall. That would be helpful, but at that point you might as well update the CLI mount tool to just pass all the named options to the kernel and have it do the parsing (I mean, keep the old interface too obviously, but provide a new one and use that preferentially). I also like Duncan's suggestion to expose the default value for the atime options as a kconfig option (Chris Murphy emailed me directly about essentially the same thing).
Re: lazytime mount option—no support in Btrfs
On 2018-08-21 23:57, Duncan wrote: Austin S. Hemmelgarn posted on Tue, 21 Aug 2018 13:01:00 -0400 as excerpted: Otherwise, the only option for people who want it set is to patch the kernel to get noatime as the default (instead of relatime). I would look at pushing such a patch upstream myself actually, if it weren't for the fact that I'm fairly certain that it would be immediately NACK'ed by at least Linus, and probably a couple of other people too. What about making default-noatime a kconfig option, presumably set to default-relatime by default? That seems to be the way many legacy- incompatible changes work. Then for most it's up to the distro, which in fact it is already, only if the distro set noatime-default they'd at least be using an upstream option instead of patching it themselves, making it upstream code that could be accounted for instead of downstream code that... who knows? That's probably a lot more likely to make it upstream, but it's a bit beyond my skills when it comes to stuff like this. Meanwhile, I'd be interested in seeing your local patch. I'm local- patching noatime-default here too, but not being a dev, I'm not entirely sure I'm doing it "correctly", tho AFAICT it does seem to work. FWIW, here's what I'm doing (posting inline so may be white-space damaged, and IIRC I just recently manually updated the line numbers so they don't reflect the code at the 2014 date any more, but as I'm not sure of the "correctness" it's not intended to be applied in any case): --- fs/namespace.c.orig 2014-04-18 23:54:42.167666098 -0700 +++ fs/namespace.c 2014-04-19 00:19:08.622741946 -0700 @@ -2823,8 +2823,9 @@ long do_mount(const char *dev_name, cons goto dput_out; /* Default to relatime unless overriden */ - if (!(flags & MS_NOATIME)) - mnt_flags |= MNT_RELATIME; + /* JED: Make that noatime */ + if (!(flags & MS_RELATIME)) + mnt_flags |= MNT_NOATIME; /* Separate the per-mountpoint flags */ if (flags & MS_NOSUID) @@ -2837,6 +2837,8 @@ long do_mount(const char *dev_name, cons mnt_flags |= MNT_NOATIME; if (flags & MS_NODIRATIME) mnt_flags |= MNT_NODIRATIME; + if (flags & MS_RELATIME) + mnt_flags |= MNT_RELATIME; if (flags & MS_STRICTATIME) mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) Sane, or am I "doing it wrong!"(TM), or perhaps doing it correctly, but missing a chunk that should be applied elsewhere? Mine only has the first part, not the second, which seems to cover making sure it's noatime by default. I never use relatime though, so that may be broken with my patch because of me not having the second part. Meanwhile, since broken rootflags requiring an initr* came up let me take the opportunity to ask once again, does btrfs-raid1 root still require an initr*? It'd be /so/ nice to be able to supply the appropriate rootflags=device=...,device=... and actually have it work so I didn't need the initr* any longer! Last I knew, specifying appropriate `device=` options in rootflags works correctly without an initrd.
Re: lazytime mount option—no support in Btrfs
On 2018-08-21 12:05, David Sterba wrote: On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote: On 2018-08-21 09:32, Janos Toth F. wrote: so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. It would be great if someone finally fixed this old bug then: https://bugzilla.kernel.org/show_bug.cgi?id=61601 Until then, it seems practically impossible to use both noatime (this can't be added as rootflag in the command line and won't apply if the kernel already mounted the root as RW) and space-cache-v2 (has to be added as a rootflag along with RW to take effect) for the root filesystem (at least without an init*fs, which I never use, so can't tell). Last I knew, it was fixed. Of course, it's been quite a while since I last tried this, as I run locally patched kernels that have `noatime` as the default instead of `relatime`. I'm using VMs without initrd, tested the rootflags=noatime and it still fails, the same way as in the bugreport. As the 'noatime' mount option is part of the mount(2) API (passed as a bit via mountflags), the remaining option in the filesystem is to whitelist the generic options and ignore them. But this brings some layering violation question. On the other hand, this would be come confusing as the user expectation is to see the effects of 'noatime'. Ideally there would be a way to get this to actually work properly. I think ext4 at least doesn't panic, though I'm not sure if it actually works correctly. Otherwise, the only option for people who want it set is to patch the kernel to get noatime as the default (instead of relatime). I would look at pushing such a patch upstream myself actually, if it weren't for the fact that I'm fairly certain that it would be immediately NACK'ed by at least Linus, and probably a couple of other people too.
Re: Are the btrfs mount options inconsistent?
On 2018-08-21 09:43, David Howells wrote: Qu Wenruo wrote: But to be more clear, NOSSD shouldn't be a special case. In fact currently NOSSD only affects whether we will output the message "enabling ssd optimization", no real effect if I didn't miss anything. That's not quite true. In: if (!btrfs_test_opt(fs_info, NOSSD) && !fs_info->fs_devices->rotating) { btrfs_set_and_info(fs_info, SSD, "enabling ssd optimizations"); } the call to btrfs_set_and_info() will turn on SSD. What this seems to me is that, normally, SSD will be turned on automatically unless at least one of the devices is a rotating medium - but this appears to be explicitly suppressed by the NOSSD option. That's my understanding too (though I may be wrong, I'm not an expert on C). If this _isn't_ what's happening, then it needs to be changed so it is, that's what the documentation has pretty much always said, and is therefore how people expect it to work (also, it needs to work because there needs to be an option other than poking around at sysfs attributes to disable this on non-rotational media where it's not want4ed).
Re: lazytime mount option—no support in Btrfs
On 2018-08-21 09:32, Janos Toth F. wrote: so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. It would be great if someone finally fixed this old bug then: https://bugzilla.kernel.org/show_bug.cgi?id=61601 Until then, it seems practically impossible to use both noatime (this can't be added as rootflag in the command line and won't apply if the kernel already mounted the root as RW) and space-cache-v2 (has to be added as a rootflag along with RW to take effect) for the root filesystem (at least without an init*fs, which I never use, so can't tell). Last I knew, it was fixed. Of course, it's been quite a while since I last tried this, as I run locally patched kernels that have `noatime` as the default instead of `relatime`. Also, once you've got the space cache set up by mounting once writable with the appropriate flag and then waiting for it to initialize, you should not ever need to specify the `space_cache` option again.
Re: lazytime mount option—no support in Btrfs
On 2018-08-21 08:06, Adam Borowski wrote: On Mon, Aug 20, 2018 at 08:16:16AM -0400, Austin S. Hemmelgarn wrote: Also, slightly OT, but atimes are not where the real benefit is here for most people. No sane software other than mutt uses atimes (and mutt's use of them is not sane, but that's a different argument) Right. There are two competing forks of mutt: neomutt and vanilla: https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f https://gitlab.com/dops/mutt/commit/489a1c394c29e4b12b705b62da413f322406326f So this has already been taken care of. so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. atime updates (including relatime) are bad not only for performance, they also explode disk size used by snapshots (btrfs, LVM, ...) -- to the tune of ~5% per snapshot for some non-crafted loads. And, are bad for media with low write endurance (SD cards, as used by most SoCs). Thus, atime needs to die. The real benefit for most people is with mtimes, for which there is no other way to limit the impact they have on performance. With btrfs, any write already triggers metadata update (except nocow), thus there's little benefit of lazytime for mtimes. But does that actually propagate all the way up to the point of updating the inode itself? If so, then yes, there is not really any point. if not though, then there is still a benefit.
Re: lazytime mount option—no support in Btrfs
On 2018-08-19 06:25, Andrei Borzenkov wrote: Отправлено с iPhone 19 авг. 2018 г., в 11:37, Martin Steigerwald написал(а): waxhead - 18.08.18, 22:45: Adam Hunt wrote: Back in 2014 Ted Tso introduced the lazytime mount option for ext4 and shortly thereafter a more generic VFS implementation which was then merged into mainline. His early patches included support for Btrfs but those changes were removed prior to the feature being merged. His> changelog includes the following note about the removal: - Per Christoph's suggestion, drop support for btrfs and xfs for now, issues with how btrfs and xfs handle dirty inode tracking. We can add btrfs and xfs support back later or at the end of this series if we want to revisit this decision. My reading of the current mainline shows that Btrfs still lacks any support for lazytime. Has any thought been given to adding support for lazytime to Btrfs? […] Is there any new regarding this? I´d like to know whether there is any news about this as well. If I understand it correctly this could even help BTRFS performance a lot cause it is COW´ing metadata. I do not see how btrfs can support it exactly due to cow. Modified atime means checksum no more matches so you must update all related metadata. At which point you have kind of shadow in-memory metadata trees. And if this metadata is not written out, then some other metadata that refers to them becomes invalid. I think you might be misunderstanding something here, either how lazytime actually works, or how BTRFS checksumming works. Lazytime prevents timestamp updates from triggering writeback of a cached inode. Other changes will trigger writeback, as will anything that evicts the inode from the cache, and an automatic writeback will be triggered if the timestamp changed more than 24 hours ago, but until any of those situations happens, no writeback will be triggered. BTRFS checksumming only verifies checksums of blocks which are being read. If the inode is in the cache (which it has to be for lazytime to have _any_ effect on it), the block containing it on disk does not need to be read, so no checksum verification happens. Even if there was verification, we would not be verifying blocks that are in memory using the on-disk checksums (because that would break writeback caching, which we already do and already works correctly). So, given all this, the only inconsistency on-disk for BTRFS with this would be identical to the inconsistency it causes for other filesystems, namely that mtimes and atimes may not be accurate. Also, slightly OT, but atimes are not where the real benefit is here for most people. No sane software other than mutt uses atimes (and mutt's use of them is not sane, but that's a different argument), so pretty much everyone who wants to avoid the overhead from them can just use the `noatime` mount option. The real benefit for most people is with mtimes, for which there is no other way to limit the impact they have on performance. I suspect any file system that keeps checksums of metadata will run into the same issue. Nope, only if they verify checksums on stuff that's already cached _and_ they pull the checksums for verification from the block device and not the cache.
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
On 2018-08-17 08:50, Roman Mamedov wrote: On Fri, 17 Aug 2018 14:28:25 +0200 Martin Steigerwald wrote: First off, keep in mind that the SSD firmware doing compression only really helps with wear-leveling. Doing it in the filesystem will help not only with that, but will also give you more space to work with. While also reducing the ability of the SSD to wear-level. The more data I fit on the SSD, the less it can wear-level. And the better I compress that data, the less it can wear-level. Do not consider SSD "compression" as a factor in any of your calculations or planning. Modern controllers do not do it anymore, the last ones that did are SandForce, and that's 2010 era stuff. You can check for yourself by comparing write speeds of compressible vs incompressible data, it should be the same. At most, the modern ones know to recognize a stream of binary zeroes and have a special case for that. All that testing write speeds forz compressible versus incompressible data tells you is if the SSD is doing real-time compression of data, not if they are doing any compression at all.. Also, this test only works if you turn the write-cache on the device off. Besides, you can't prove 100% for certain that any manufacturer who does not sell their controller chips isn't doing this, which means there are a few manufacturers that may still be doing it.
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
On 2018-08-17 08:28, Martin Steigerwald wrote: Thanks for your detailed answer. Austin S. Hemmelgarn - 17.08.18, 13:58: On 2018-08-17 05:08, Martin Steigerwald wrote: […] I have seen a discussion about the limitation in point 2. That allowing to add a device and make it into RAID 1 again might be dangerous, cause of system chunk and probably other reasons. I did not completely read and understand it tough. So I still don´t get it, cause: Either it is a RAID 1, then, one disk may fail and I still have *all* data. Also for the system chunk, which according to btrfs fi df / btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see why it would need to disallow me to make it into an RAID 1 again after one device has been lost. Or it is no RAID 1 and then what is the point to begin with? As I was able to copy of all date of the degraded mount, I´d say it was a RAID 1. (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does two copies regardless of how many drives you use.) So, what's happening here is a bit complicated. The issue is entirely with older kernels that are missing a couple of specific patches, but it appears that not all distributions have their kernels updated to include those patches yet. In short, when you have a volume consisting of _exactly_ two devices using raid1 profiles that is missing one device, and you mount it writable and degraded on such a kernel, newly created chunks will be single-profile chunks instead of raid1 chunks with one half missing. Any write has the potential to trigger allocation of a new chunk, and more importantly any _read_ has the potential to trigger allocation of a new chunk if you don't use the `noatime` mount option (because a read will trigger an atime update, which results in a write). When older kernels then go and try to mount that volume a second time, they see that there are single-profile chunks (which can't tolerate _any_ device failures), and refuse to mount at all (because they can't guarantee that metadata is intact). Newer kernels fix this part by checking per-chunk if a chunk is degraded/complete/missing, which avoids this because all the single chunks are on the remaining device. How new the kernel needs to be for that to happen? Do I get this right that it would be the kernel used for recovery, i.e. the one on the live distro that needs to be new enough? To one on this laptop meanwhile is already 4.18.1. Yes, the kernel used for recovery is the important one here. I don't remember for certain when the patches went in, but I'm pretty sure it's been no eariler than 4.14. FWIW, I'm pretty sure SystemRescueCD has a new enough kernel, but they still (sadly) lack zstd support. I used latest GRML stable release 2017.05 which has an 4.9 kernel. While I don't know exactly when the patches went in, I'm fairly certain that 4.9 never got them. As far as avoiding this in the future: I hope that with the new Samsung Pro 860 together with the existing Crucial m500 I am spared from this for years to come. That Crucial SSD according to SMART status about lifetime used has still quite some time to go. Yes, hopefully. And the SMART status on that Crucial is probably right, they tend to do a very good job in my experience with accurately measuring life expectancy (that or they're just _really_ good at predicting failures, I've never had a Crucial SSD that did not indicate correctly in the SMART status that it would fail in the near future). * If you're just pulling data off the device, mark the device read-only in the _block layer_, not the filesystem, before you mount it. If you're using LVM, just mark the LV read-only using LVM commands This will make 100% certain that nothing gets written to the device, and thus makes sure that you won't accidentally cause issues like this. * If you're going to convert to a single device, just do it and don't stop it part way through. In particular, make sure that your system will not lose power. * Otherwise, don't mount the volume unless you know you're going to repair it. Thanks for those. Good to keep in mind. The last one is actually good advice in general, not just for BTRFS. I can't count how many stories I've heard of people who tried to run half an array simply to avoid downtime, and ended up making things far worse than they were as a result. For this laptop it was not all that important but I wonder about BTRFS RAID 1 in enterprise environment, cause restoring from backup adds a significantly higher downtime. Anyway, creating a new filesystem may have been better here anyway, cause it replaced an BTRFS that aged over several years with a new one. Due to the increased capacity and due to me thinking that Samsung 860 Pro compresses itself, I removed LZO compression. This would also give larger extents on files that are not fragmented or only slightly fragmented. I think that Intel SSD 320 did not compress, but Crucial m500 mSATA SSD does
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
On 2018-08-17 05:08, Martin Steigerwald wrote: Hi! This happened about two weeks ago. I already dealt with it and all is well. Linux hung on suspend so I switched off this ThinkPad T520 forcefully. After that it did not boot the operating system anymore. Intel SSD 320, latest firmware, which should patch this bug, but apparently does not, is only 8 MiB big. Those 8 MiB just contain zeros. Access via GRML and "mount -fo degraded" worked. I initially was even able to write onto this degraded filesystem. First I copied all data to a backup drive. I even started a balance to "single" so that it would work with one SSD. But later I learned that secure erase may recover the Intel SSD 320 and since I had no other SSD at hand, did that. And yes, it did. So I canceled the balance. I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But at that time I was not able to mount the degraded BTRFS on the other SSD as writable anymore, not even with "-f" "I know what I am doing". Thus I was not able to add a device to it and btrfs balance it to RAID 1. Even "btrfs replace" was not working. I thus formatted a new BTRFS RAID 1 and restored. A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again via one full backup and restore cycle. However, this time I was able to copy most of the data of the Intel SSD 320 with "mount -fo degraded" via eSATA and thus the copy operation was way faster. So conclusion: 1. Pro: BTRFS RAID 1 really protected my data against a complete SSD outage. Glad to hear I'm not the only one! 2. Con: It does not allow me to add a device and balance to RAID 1 or replace one device that is already missing at this time. See below where you comment about this more, I've replied regarding it there. 3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical data. 4. And yes, I know it does not replace a backup. As it was holidays and I was lazy backup was two weeks old already, so I was happy to have all my data still on the other SSD. 5. The error messages in kernel when mounting without "-o degraded" are less than helpful. They indicate a corrupted filesystem instead of just telling that one device is missing and "-o degraded" would help here. Agreed, the kernel error messages need significant improvement, not just for this case, but in general (I would _love_ to make sure that there are exactly zero exit paths for open_ctree that don't involve a proper error message being printed beyond the ubiquitous `open_ctree failed` message you get when it fails). I have seen a discussion about the limitation in point 2. That allowing to add a device and make it into RAID 1 again might be dangerous, cause of system chunk and probably other reasons. I did not completely read and understand it tough. So I still don´t get it, cause: Either it is a RAID 1, then, one disk may fail and I still have *all* data. Also for the system chunk, which according to btrfs fi df / btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see why it would need to disallow me to make it into an RAID 1 again after one device has been lost. Or it is no RAID 1 and then what is the point to begin with? As I was able to copy of all date of the degraded mount, I´d say it was a RAID 1. (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does two copies regardless of how many drives you use.) So, what's happening here is a bit complicated. The issue is entirely with older kernels that are missing a couple of specific patches, but it appears that not all distributions have their kernels updated to include those patches yet. In short, when you have a volume consisting of _exactly_ two devices using raid1 profiles that is missing one device, and you mount it writable and degraded on such a kernel, newly created chunks will be single-profile chunks instead of raid1 chunks with one half missing. Any write has the potential to trigger allocation of a new chunk, and more importantly any _read_ has the potential to trigger allocation of a new chunk if you don't use the `noatime` mount option (because a read will trigger an atime update, which results in a write). When older kernels then go and try to mount that volume a second time, they see that there are single-profile chunks (which can't tolerate _any_ device failures), and refuse to mount at all (because they can't guarantee that metadata is intact). Newer kernels fix this part by checking per-chunk if a chunk is degraded/complete/missing, which avoids this because all the single chunks are on the remaining device. As far as avoiding this in the future: * If you're just pulling data off the device, mark the device read-only in the _block layer_, not the filesystem, before you mount it. If you're using LVM, just mark the LV read-only using LVM commands This will make 100% certain that nothing gets written to the device, and thus makes sure that you won't accidentally cause issues
Re: How to ensure that a snapshot is not corrupted?
On 2018-08-10 06:07, Cerem Cem ASLAN wrote: Original question is here: https://superuser.com/questions/1347843 How can we sure that a readonly snapshot is not corrupted due to a disk failure? Is the only way calculating the checksums one on another and store it for further examination, or does BTRFS handle that on its own? I've posted an answer for the linked question on SuperUser, under the assumption that it will be more visible to people simply searching for it there than it would be on the ML. Here's the text of the answer though so people here can see it too: There are two possible answers depending on what you mean by 'corrupted by a disk failure'. ### If you mean simple at-rest data corruption BTRFS handles this itself, transparently to the user. It checksums everything, including data in snapshots, internally and then verifies the checksums as it reads each block. There are a couple of exceptions to this though: * If the volume is mounted with the `nodatasum` or `nodatacow` options, you will have no checksumming of data blocks. In most cases, you should not be mounting with these options, so this should not e an issue. * Any files for which the `NOCOW` attribute is set (`C` in the output of the `lsattr` command) are also not checked. You're not likely to have any truly important files with this attribute set (systemd journal files have it set, but that's about it unless you set it manually). ### If you mean non-trivial destruction of data on the volume because of loss of too many devices You can't protect against this except by having another copy of the data somewhere. Pretty much, if you've lost more devices than however many the storage profiles for the volume can tolerate, your data is gone, and nothing is going to get it back for you short of restoring from a backup.
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On 2018-08-12 03:04, Andrei Borzenkov wrote: 12.08.2018 06:16, Chris Murphy пишет: On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote: Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted: But whether data is shared or exclusive seems potentially ephemeral, and not something a sysadmin should even be able to anticipate let alone individual users. Define "user(s)". The person who is saving their document on a network share, and they've never heard of Btrfs. Arguably, in the context of btrfs tool usage, "user" /is/ the admin, I'm not talking about btrfs tools. I'm talking about rational, predictable behavior of a shared folder. If I try to drop a 1GiB file into my share and I'm denied, not enough free space, and behind the scenes it's because of a quota limit, I expect I can delete *any* file(s) amounting to create 1GiB free space and then I'll be able to drop that file successfully without error. But if I'm unwittingly deleting shared files, my quota usage won't go down, and I still can't save my file. So now I somehow need a secret incantation to discover only my exclusive files and delete enough of them in order to save this 1GiB file. It's weird, it's unexpected, I think it's a use case failure. Maybe Btrfs quotas isn't meant to work with samba or NFS shares. *shrug* That's how both NetApp and ZFS work as well. I doubt anyone can seriously call NetApp "not meant to work with NFS or CIFS shares". On NetApp space available to NFS/CIFS user is volume size minus space frozen in snapshots. If file, captured in snapshot, is deleted in active file system, it does not make a single byte available to external user. That's what surprised most every first time NetApp users. On ZFS snapshots are contained in dataset and you limit total dataset space consumption including all snapshots. Thus end effect is the same - deleting data that is itself captured in snapshot does not make a single byte available. ZFS allows you to additionally restrict active file system size ("referenced" quota in ZFS) - this more closely matches your expectation - deleting file in active file system decreases its "referenced" size thus allowing user to write more data (as long as user does not exceed total dataset quota). This is different from btrfs "exculsive" and "shared". This should not be hard to implement in btrfs, as "referenced" simply means all data in current subvolume, be it exclusive or shared. IOW ZFS allows to place restriction on both how much data user can use and how much data user is allowed additionally to protect (snapshot). Except user created snapshots are kind of irrelevant here. If we're talking about NFS/CIFS/SMB, there is no way for the user to create a snapshot (at least, not in-band), so provided the admin is sensible and only uses the referenced quota for limiting space usage by users, things behave no differently on ZFS than they do on ext4 or XFS using user quotas. Note also that a lot of storage appliances that use ZFS as the underlying storage don't expose any way for the admin to use anything other than the referenced quota (and usually space reservations). They do this because it makes the system behave as pretty much everyone intuitively expects, and it ensures that users don't have to go to an admin to remedy their free space issues. "Regular users" as you use the term, that is the non-admins who just need to know how close they are to running out of their allotted storage resources, shouldn't really need to care about btrfs tool usage in the first place, and btrfs commands in general, including btrfs quota related commands, really aren't targeted at them, and aren't designed to report the type of information they are likely to find useful. Other tools will be more appropriate. I'm not talking about any btrfs commands or even the term quota for regular users. I'm talking about saving a file, being denied, and how does the user figure out how to free up space? Users need to be educated. Same as with NetApp and ZFS. There is no magic, redirect-on-write filesystems work differently than traditional and users need to adapt. Of course devil is in details, and usability of btrfs quota is far lower than NetApp/ZFS. In those space consumption information is first class citizen integrated into the very basic tools, not something bolted on later and mostly incomprehensible to end user. Except that this _CAN_ be made to work and behave just like classic quotas. Your example of ZFS above proves it (referenced quotas behave just like classic VFS quotas). Yes, we need to educate users regarding qgroups, but we need a _WORKING_ alternative so they can do things like they always have, and like most stuff that uses ZFS as part of a pre-built system (FreeNAS for example) does. Anyway, it's a hypothetical scenario. While I have Samba running on a Btrfs volume with various shares as subvolumes, I don't have quotas enabled. Given
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On 2018-08-10 14:07, Chris Murphy wrote: On Thu, Aug 9, 2018 at 5:35 PM, Qu Wenruo wrote: On 8/10/18 1:48 AM, Tomasz Pala wrote: On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote: 2) Different limitations on exclusive/shared bytes Btrfs can set different limit on exclusive/shared bytes, further complicating the problem. 3) Btrfs quota only accounts data/metadata used by the subvolume It lacks all the shared trees (mentioned below), and in fact such shared tree can be pretty large (especially for extent tree and csum tree). I'm not sure about the implications, but just to clarify some things: when limiting somebody's data space we usually don't care about the underlying "savings" coming from any deduplicating technique - these are purely bonuses for system owner, so he could do larger resource overbooking. In reality that's definitely not the case. From what I see, most users would care more about exclusively used space (excl), other than the total space one subvolume is referring to (rfer). I'm confused. So what happens in the following case with quotas enabled on Btrfs: 1. Provision a user with a directory, pre-populated with files, using snapshot. Let's say it's 1GiB of files. 2. Set a quota for this user's directory, 1GiB. The way I'm reading the description of Btrfs quotas, the 1GiB quota applies to exclusive used space. So for starters, they have 1GiB of shared data that does not affect their 1GiB quota at all. 3. User creates 500MiB worth of new files, this is exclusive usage. They are still within their quota limit. 4. The shared data becomes obsolete for all but this one user, and is deleted. Suddenly, 1GiB of shared data for this user is no longer shared data, it instantly becomes exclusive data and their quota is busted. Now consider scaling this to 12TiB of storage, with hundreds of users, and dozens of abruptly busted quotas following this same scenario on a weekly basis. I *might* buy off on the idea that an overlay2 based initial provisioning would not affect quotas. But whether data is shared or exclusive seems potentially ephemeral, and not something a sysadmin should even be able to anticipate let alone individual users. Going back to the example, I'd expect to give the user a 2GiB quota, with 1GiB of initially provisioned data via snapshot, so right off the bat they are at 50% usage of their quota. If they were to modify every single provisioned file, they'd in effect go from 100% shared data to 100% exclusive data, but their quota usage would still be 50%. That's completely sane and easily understandable by a regular user. The idea that they'd start modifying shared files, and their quota usage climbs is weird to me. The state of files being shared or exclusive is not user domain terminology anyway. And it's important to note that this is the _only_ way this can sanely work for actually partitioning resources, which is the primary classical use case for quotas. Being able to see how much data is shared and exclusive in a subvolume is nice, but quota groups are the wrong name for it because the current implementation does not work at all like quotas and can trivially result in both users escaping quotas (multiple ways), and in quotas being overreached by very large amounts for potentially indefinite periods of time because of actions of individuals who _don't_ own the data the quota is for. The most common case is, you do a snapshot, user would only care how much new space can be written into the subvolume, other than the total subvolume size. I think that's expecting a lot of users. I also wonder if it expects a lot from services like samba and NFS who have to communicate all of this in some sane way to remote clients? My expectation is that a remote client shows Free Space on a quota'd system to be based on the unused amount of the quota. I also expect if I delete a 1GiB file, that my quota consumption goes down. But you're saying it would be unchanged if I delete a 1GiB shared file, and would only go down if I delete a 1GiB exclusive file. Do samba and NFS know about shared and exclusive files? If samba and NFS don't understand this, then how is a user supposed to understand it? It might be worth looking at how Samba and NFS work on top of ZFS on a platform like FreeNAS and trying to emulate that. Behavior there is as-follows: * The total size of the 'disk' reported over SMB (shown on Windows only if you map the share as a drive) is equal to the quota for the underlying dataset. * The reported space used on the 'disk' reported over SMB is based on physical space usage after compression, with a few caveats relating to deduplication: - Data which is shared across multiple datasets is accounted against _all_ datasets that reference it. - Data which is shared only within a given dataset is accounted only once. * Free space is reported simply as the total size minus the used space. * Usage reported by
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On 2018-08-10 14:21, Tomasz Pala wrote: On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote: I.e.: every shared segment should be accounted within quota (at least once). I think what you mean to say here is that every shared extent should be accounted to quotas for every location it is reflinked from. IOW, that if an extent is shared between two subvolumes each with it's own quota, they should both have it accounted against their quota. Yes. Moreover - if there would be per-subvolume RAID levels someday, the data should be accouted in relation to "default" (filesystem) RAID level, i.e. having a RAID0 subvolume on RAID1 fs should account half of the data, and twice the data in an opposite scenario (like "dup" profile on single-drive filesystem). This is irrelevant to your point here. In fact, it goes against it, you're arguing for quotas to report data like `du`, but all of chunk-profile stuff is invisible to `du` (and everything else in userspace that doesn't look through BTRFS ioctls). My point is user-point, not some system tool like du. Consider this: 1. user wants higher (than default) protection of some data, 2. user wants more storage space with less protection. Ad. 1 - requesting better redundancy is similar to cp --reflink=never - there are functional differences, but the cost is similar: trading space for security, Ad. 2 - many would like to have .cache, .ccache, tmp or some build system directory with faster writes and no redundancy at all. This requires per-file/directory data profile attrs though. Since we agreed that transparent data compression is user's storage bonus, gains from the reduced redundancy should also profit user. Do you actually know of any services that do this though? I mean, Amazon S3 and similar services have the option of reduced redundancy (and other alternate storage tiers), but they charge per-unit-data-per-unit-time with no hard limit on how much space they use, and charge different rates for different storage tiers. In comparison, what you appear to be talking about is something more similar to Dropbox or Google Drive, where you pay up front for a fixed amount of storage for a fixed amount of time and can't use more than that, and all the services I know of like that offer exactly one option for storage redundancy. That aside, you seem to be overthinking this. No sane provider is going to give their users the ability to create subvolumes themselves (there's too much opportunity for a tiny bug in your software to cost you a _lot_ of lost revenue, because creating subvolumes can let you escape qgroups) That means in turn that what you're trying to argue for is no different from the provider just selling units of storage for different redundancy levels separately, and charging different rates for each of them. In fact, that approach is better, because it works independent of the underlying storage technology (it will work with hardware RAID, LVM2, MD, ZFS, and even distributed storage platforms like Ceph and Gluster), _and_ it lets them charge differently than the trivial case of N copies costing N times as much as one copy (which is not quite accurate in terms of actual management costs). Now, if BTRFS were to have the ability to set profiles per-file, then this might be useful, albeit with the option to tune how it gets accounted. Disclaimer: all the above statements in relation to conception and understanding of quotas, not to be confused with qgroups.
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On 2018-08-09 13:48, Tomasz Pala wrote: On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote: 2) Different limitations on exclusive/shared bytes Btrfs can set different limit on exclusive/shared bytes, further complicating the problem. 3) Btrfs quota only accounts data/metadata used by the subvolume It lacks all the shared trees (mentioned below), and in fact such shared tree can be pretty large (especially for extent tree and csum tree). I'm not sure about the implications, but just to clarify some things: when limiting somebody's data space we usually don't care about the underlying "savings" coming from any deduplicating technique - these are purely bonuses for system owner, so he could do larger resource overbooking. So - the limit set on any user should enforce maximum and absolute space he has allocated, including the shared stuff. I could even imagine that creating a snapshot might immediately "eat" the available quota. In a way, that quota returned matches (give or take) `du` reported usage, unless "do not account reflinks withing single qgroup" was easy to implemet. I.e.: every shared segment should be accounted within quota (at least once). I think what you mean to say here is that every shared extent should be accounted to quotas for every location it is reflinked from. IOW, that if an extent is shared between two subvolumes each with it's own quota, they should both have it accounted against their quota. And the numbers accounted should reflect the uncompressed sizes. This is actually inconsistent with pretty much every other VFS level quota system in existence. Even ZFS does it's accounting _after_ compression. At this point, it's actually expected by most sysadmins that things behave that way. Moreover - if there would be per-subvolume RAID levels someday, the data should be accouted in relation to "default" (filesystem) RAID level, i.e. having a RAID0 subvolume on RAID1 fs should account half of the data, and twice the data in an opposite scenario (like "dup" profile on single-drive filesystem). This is irrelevant to your point here. In fact, it goes against it, you're arguing for quotas to report data like `du`, but all of chunk-profile stuff is invisible to `du` (and everything else in userspace that doesn't look through BTRFS ioctls). In short: values representing quotas are user-oriented ("the numbers one bought"), not storage-oriented ("the numbers they actually occupy").
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On 2018-08-09 19:35, Qu Wenruo wrote: On 8/10/18 1:48 AM, Tomasz Pala wrote: On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote: 2) Different limitations on exclusive/shared bytes Btrfs can set different limit on exclusive/shared bytes, further complicating the problem. 3) Btrfs quota only accounts data/metadata used by the subvolume It lacks all the shared trees (mentioned below), and in fact such shared tree can be pretty large (especially for extent tree and csum tree). I'm not sure about the implications, but just to clarify some things: when limiting somebody's data space we usually don't care about the underlying "savings" coming from any deduplicating technique - these are purely bonuses for system owner, so he could do larger resource overbooking. In reality that's definitely not the case. From what I see, most users would care more about exclusively used space (excl), other than the total space one subvolume is referring to (rfer). The most common case is, you do a snapshot, user would only care how much new space can be written into the subvolume, other than the total subvolume size. I would really love to know exactly who these users are, because it sounds to me like you've heard from exactly zero people who are currently using conventional quotas to impose actual resource limits on other filesystems (instead of just using them for accounting, which is a valid use case but not what they were originally designed for). So - the limit set on any user should enforce maximum and absolute space he has allocated, including the shared stuff. I could even imagine that creating a snapshot might immediately "eat" the available quota. In a way, that quota returned matches (give or take) `du` reported usage, unless "do not account reflinks withing single qgroup" was easy to implemet. In fact, that's the case. In current implementation, accounting on extent is the easiest (if not the only) way to implement. I.e.: every shared segment should be accounted within quota (at least once). Already accounted, at least for rfer. And the numbers accounted should reflect the uncompressed sizes. No way for current extent based solution. While this may be true, this would be a killer feature to have. Moreover - if there would be per-subvolume RAID levels someday, the data should be accouted in relation to "default" (filesystem) RAID level, i.e. having a RAID0 subvolume on RAID1 fs should account half of the data, and twice the data in an opposite scenario (like "dup" profile on single-drive filesystem). No possible again for current extent based solution. In short: values representing quotas are user-oriented ("the numbers one bought"), not storage-oriented ("the numbers they actually occupy"). Well, if something is not possible or brings so big performance impact, there will be no argument on how it should work in the first place. Thanks, Qu
Re: BTRFS and databases
On 2018-08-02 06:56, Qu Wenruo wrote: On 2018年08月02日 18:45, Andrei Borzenkov wrote: Отправлено с iPhone 2 авг. 2018 г., в 10:02, Qu Wenruo написал(а): On 2018年08月01日 11:45, MegaBrutal wrote: Hi all, I know it's a decade-old question, but I'd like to hear your thoughts of today. By now, I became a heavy BTRFS user. Almost everywhere I use BTRFS, except in situations when it is obvious there is no benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and server computers are mainly running on BTRFS with only a few file systems on ext4. I even installed BTRFS in corporate productive systems (in those cases, the systems were mainly on ext4; but there were some specific file systems those exploited BTRFS features). But there is still one question that I can't get over: if you store a database (e.g. MySQL), would you prefer having a BTRFS volume mounted with nodatacow, or would you just simply use ext4? I know that with nodatacow, I take away most of the benefits of BTRFS (those are actually hurting database performance – the exact CoW nature that is elsewhere a blessing, with databases it's a drawback). But are there any advantages of still sticking to BTRFS for a database albeit CoW is disabled, or should I just return to the old and reliable ext4 for those applications? Since I'm not a expert in database, so I can totally be wrong, but what about completely disabling database write-ahead-log (WAL), and let btrfs' data CoW to handle data consistency completely? This would make content of database after crash completely unpredictable, thus making it impossible to reliably roll back transaction. Btrfs itself (with datacow) can ensure the fs is updated completely. That's to say, even a crash happens, the content of the fs will be the same state as previous btrfs transaction (btrfs sync). Thus there is no need to rollback database transaction though. (Unless database transaction is not sync to btrfs transaction) Two issues with this statement: 1. Not all database software properly groups logically related operations that need to be atomic as a unit into transactions. 2. Even aside from point 1 and the possibility of database corruption, there are other legitimate reasons that you might need to roll-back a transaction (for example, the rather obvious case of a transaction that should not have happened in the first place). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-20 14:41, Hugo Mills wrote: On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote: 20.07.2018 20:16, Goffredo Baroncelli пишет: [snip] Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the "chunk allocator" You mean that currently RAID5 stripe size is equal to number of disks? Well, I suppose nobody is using btrfs with disk pools of two or three digits size. But they are (even if not very many of them) -- we've seen at least one person with something like 40 or 50 devices in the array. They'd definitely got into /dev/sdac territory. I don't recall what RAID level they were using. I think it was either RAID-1 or -10. That's the largest I can recall seeing mention of, though. I've talked to at least two people using it on 100+ disks in a SAN situation. In both cases however, BTRFS itself was only seeing about 20 devices and running in raid0 mode on them, with each of those being a RAID6 volume configured on the SAN node holding the disks for it. From what I understood when talking to them, they actually got rather good performance in this setup, though maintenance was a bit of a pain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-20 13:13, Goffredo Baroncelli wrote: On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote: On 2018-07-19 13:29, Goffredo Baroncelli wrote: [...] So until now you are repeating what I told: the only useful raid profile are - striping - mirroring - striping+paring (even limiting the number of disk involved) - striping+mirroring No, not quite. At least, not in the combinations you're saying make sense if you are using standard terminology. RAID05 and RAID06 are not the same thing as 'striping+parity' as BTRFS implements that case, and can be significantly more optimized than the trivial implementation of just limiting the number of disks involved in each chunk (by, you know, actually striping just like what we currently call raid10 mode in BTRFS does). Could you provide more information ? Just parity by itself is functionally equivalent to a really stupid implementation of 2 or more copies of the data. Setups with only one disk more than the number of parities in RAID5 and RAID6 are called degenerate for this very reason. All sane RAID5/6 implementations do striping across multiple devices internally, and that's almost always what people mean when talking about striping plus parity. What I'm referring to is different though. Just like RAID10 used to be implemented as RAID1 on top of RAID0, RAID05 is RAID0 on top of RAID5. That is, you're striping your data across multiple RAID5 arrays instead of using one big RAID5 array to store it all. As I mentioned, this mitigates the scaling issues inherent in RAID5 when it comes to rebuilds (namely, the fact that device failure rates go up faster for larger arrays than rebuild times do). Functionally, such a setup can be implemented in BTRFS by limiting RAID5/6 stripe width, but that will have all kinds of performance limitations compared to actually striping across all of the underlying RAID5 chunks. In fact, it will have the exact same performance limitations you're calling out BTRFS single mode for below. RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail. If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1 I think you're misunderstanding what I mean here. RAID15/16 consist of two layers: * The top layer is regular RAID1, usually limited to two copies. * The lower layer is RAID5 or RAID6. This means that the lower layer can validate which of the two copies in the upper layer is correct when they don't agree. This happens only because there is a redundancy greater than 1. Anyway BTRFS has the checksum, which helps a lot in this area The checksum helps, but what do you do when all copies fail the checksum? Or, worse yet, what do you do with both copies have the 'right' checksum, but different data? Yes, you could have one more copy, but that just reduces the chances of those cases happening, it doesn't eliminate them. Note that I'm not necessarily saying it makes sense to have support for this in BTRFS, just that it's a real-world counter-example to your statement that only those combinations make sense. In the case of BTRFS, these would make more sense than RAID51 and RAID61, but they still aren't particularly practical. For classic RAID though, they're really important, because you don't have checksumming (unless you have T10 DIF capable hardware and a RAID implementation that understands how to work with it, but that's rare and expensive) and it makes it easier to resize an array than having three copies (you only need 2 new disks for RAID15 or RAID16 to increase the size of the array, but you need 3 for 3-copy RAID1 or RAID10). It doesn't really provide significantly better redundancy (they can technically sustain more disk failures without failing completely than simple two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive more than one (or two if you're using RAID6 as the lower layer) disk failure), so it does not do the same thing that higher-order parity does. The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below). [] As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. The striping concept is based to the fact that if the "stripe size"
Re: Healthy amount of free space?
On 2018-07-20 01:01, Andrei Borzenkov wrote: 18.07.2018 16:30, Austin S. Hemmelgarn пишет: On 2018-07-18 09:07, Chris Murphy wrote: On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn wrote: If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations. There are two issues here: 1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit How is it possible? As long as fallocate actually allocates space, this should be checksummed which means it is no more possible to overwrite it. May be fallocate on btrfs could simply reserve space. Not sure whether it complies with fallocate specification, but as long as intention is to ensure write will not fail for the lack of space it should be adequate (to the extent it can be ensured on btrfs of course). Also hole in file returns zeros by definition which also matches fallocate behavior. Except it doesn't _have_ to be checksummed if there's no data there, and that will always be the case for a new allocation. When I say it could be NOCOW, I'm talking specifically about the first write to each newly allocated block (that is, one either beyond the previous end of the file, or one in a region that used to be a hole). This obviously won't work for places where there are already data. of extra space and doing a rotating approach like this for writes: - Write goes into the extra space. - Once the write is done, convert the region covered by the write into a new block of extra space. - When the final block of the preallocated region is written, deallocate the extra space. 2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly. And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead. ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have). What happens on btrfs then? fallocate specifies that new space should be initialized to zero, so something should still write those zeros? For new regions (places that were holes previously, or were beyond the end of the file), we create an unwritten extent, which is a region that's 'allocated', but everything reads back as zero. The problem is that we don't write into the blocks allocated for the unwritten extent at all, and only deallocate them once a write to another block finishes. In essence, we're (either explicitly or implicitly) applying COW semantics to a region that should not be COW until after the first write to each block. For the case of calling fallocate on existing data, we don't really do anything (unless the flag telling fallocate to unshare the region is passed). This is actually consistent with pretty much every other filesystem in existence, but that's because pretty much every other filesystem in existence implicitly provides the same guarantee that fallocate does for regions that already have data. This case can in theory be handled by the same looping algorithm I described above without needing the base amount of space allocated, but I wouldn't consider it important
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-19 13:29, Goffredo Baroncelli wrote: On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote: On 2018-07-18 15:42, Goffredo Baroncelli wrote: On 07/18/2018 09:20 AM, Duncan wrote: Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as excerpted: On 07/17/2018 11:12 PM, Duncan wrote: Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as excerpted: [...] When I say orthogonal, It means that these can be combined: i.e. you can have - striping (RAID0) - parity (?) - striping + parity (e.g. RAID5/6) - mirroring (RAID1) - mirroring + striping (RAID10) However you can't have mirroring+parity; this means that a notation where both 'C' ( = number of copy) and 'P' ( = number of parities) is too verbose. Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on top of mirroring or mirroring on top of raid5/6, much as raid10 is conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 on top of raid0. And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of) ??? Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above. No, there are cases where other configurations make sense. RAID05 and RAID06 are very widely used, especially on NAS systems where you have lots of disks. The RAID5/6 lower layer mitigates the data loss risk of RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of RAID5/6. In fact, this is pretty much the standard recommended configuration for large ZFS arrays that want to use parity RAID. This could be reasonably easily supported to a rudimentary degree in BTRFS by providing the ability to limit the stripe width for the parity profiles. Some people use RAID50 or RAID60, although they are strictly speaking inferior in almost all respects to RAID05 and RAID06. RAID01 is also used on occasion, it ends up having the same storage capacity as RAID10, but for some RAID implementations it has a different performance envelope and different rebuild characteristics. Usually, when it is used though, it's software RAID0 on top of hardware RAID1. RAID51 and RAID61 used to be used, but aren't much now. They provided an easy way to have proper data verification without always having the rebuild overhead of RAID5/6 and without needing to do checksumming. They are pretty much useless for BTRFS, as it can already tell which copy is correct. So until now you are repeating what I told: the only useful raid profile are - striping - mirroring - striping+paring (even limiting the number of disk involved) - striping+mirroring No, not quite. At least, not in the combinations you're saying make sense if you are using standard terminology. RAID05 and RAID06 are not the same thing as 'striping+parity' as BTRFS implements that case, and can be significantly more optimized than the trivial implementation of just limiting the number of disks involved in each chunk (by, you know, actually striping just like what we currently call raid10 mode in BTRFS does). RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail. If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1 I think you're misunderstanding what I mean here. RAID15/16 consist of two layers: * The top layer is regular RAID1, usually limited to two copies. * The lower layer is RAID5 or RAID6. This means that the lower layer can validate which of the two copies in the upper layer is correct when they don't agree. It doesn't really provide significantly better redundancy (they can technically sustain more disk failures without failing completely than simple two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive more than one (or two if you're using RAID6 as the lower layer) disk failure), so it does not do the same thing that higher-order parity does. The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below). [] As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in par
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-19 03:27, Qu Wenruo wrote: On 2018年07月14日 02:46, David Sterba wrote: Hi, I have some goodies that go into the RAID56 problem, although not implementing all the remaining features, it can be useful independently. This time my hackweek project https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56 aimed to implement the fix for the write hole problem but I spent more time with analysis and design of the solution and don't have a working prototype for that yet. This patchset brings a feature that will be used by the raid56 log, the log has to be on the same redundancy level and thus we need a 3-copy replication for raid6. As it was easy to extend to higher replication, I've added a 4-copy replication, that would allow triple copy raid (that does not have a standardized name). So this special level will be used for RAID56 for now? Or it will also be possible for metadata usage just like current RAID1? If the latter, the metadata scrub problem will need to be considered more. For more copies RAID1, it's will have higher possibility one or two devices missing, and then being scrubbed. For metadata scrub, inlined csum can't ensure it's the latest one. So for such RAID1 scrub, we need to read out all copies and compare their generation to find out the correct copy. At least from the changeset, it doesn't look like it's addressed yet. And this also reminds me that current scrub is not as flex as balance, I really like we could filter block groups to scrub just like balance, and do scrub in a block group basis, other than devid basis. That's to say, for a block group scrub, we don't really care which device we're scrubbing, we just need to ensure all device in this block is storing correct data. This would actually be rather useful for non-parity cases too. Being able to scrub only metadata when the data chunks are using a profile that provides no rebuild support would be great for performance. On the same note, it would be _really_ nice to be able to scrub a subset of the volume's directory tree, even if it were only per-subvolume. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-18 15:42, Goffredo Baroncelli wrote: On 07/18/2018 09:20 AM, Duncan wrote: Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as excerpted: On 07/17/2018 11:12 PM, Duncan wrote: Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as excerpted: On 07/15/2018 04:37 PM, waxhead wrote: Striping and mirroring/pairing are orthogonal properties; mirror and parity are mutually exclusive. I can't agree. I don't know whether you meant that in the global sense, or purely in the btrfs context (which I suspect), but either way I can't agree. In the pure btrfs context, while striping and mirroring/pairing are orthogonal today, Hugo's whole point was that btrfs is theoretically flexible enough to allow both together and the feature may at some point be added, so it makes sense to have a layout notation format flexible enough to allow it as well. When I say orthogonal, It means that these can be combined: i.e. you can have - striping (RAID0) - parity (?) - striping + parity (e.g. RAID5/6) - mirroring (RAID1) - mirroring + striping (RAID10) However you can't have mirroring+parity; this means that a notation where both 'C' ( = number of copy) and 'P' ( = number of parities) is too verbose. Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on top of mirroring or mirroring on top of raid5/6, much as raid10 is conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 on top of raid0. And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of) ??? Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above. No, there are cases where other configurations make sense. RAID05 and RAID06 are very widely used, especially on NAS systems where you have lots of disks. The RAID5/6 lower layer mitigates the data loss risk of RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of RAID5/6. In fact, this is pretty much the standard recommended configuration for large ZFS arrays that want to use parity RAID. This could be reasonably easily supported to a rudimentary degree in BTRFS by providing the ability to limit the stripe width for the parity profiles. Some people use RAID50 or RAID60, although they are strictly speaking inferior in almost all respects to RAID05 and RAID06. RAID01 is also used on occasion, it ends up having the same storage capacity as RAID10, but for some RAID implementations it has a different performance envelope and different rebuild characteristics. Usually, when it is used though, it's software RAID0 on top of hardware RAID1. RAID51 and RAID61 used to be used, but aren't much now. They provided an easy way to have proper data verification without always having the rebuild overhead of RAID5/6 and without needing to do checksumming. They are pretty much useless for BTRFS, as it can already tell which copy is correct. RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail. The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below). [] As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks. That's not the only benefit of striping though. The other big one is that you now have one volume that's the combined size of both of the original devices. Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation. With a "stripe size" of 1GB, it is very unlikely that this would happens. That's a pretty big assumption. There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things. If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks. Such a workload is not unusual on a very busy mail-server. At 1 GiB strip size it doesn't have the typical performance advantage of striping, but
Re: Healthy amount of free space?
On 2018-07-18 17:32, Chris Murphy wrote: On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn wrote: On 2018-07-18 13:40, Chris Murphy wrote: On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy wrote: I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? Yes, you're right, I had things backwards (well, kind of, this does work on ext4 and regular XFS, so it arguably should work here). I guess I'm confused what it even means to fallocate over a file with in-use blocks unless either -d or -p options are used. And from the man page, I don't grok the distinction between -d and -p either. But based on their descriptions I'd expect they both should work without enospc. Without any specific options, it forces allocation of any sparse regions in the file (that is, it gets rid of holes in the file). On BTRFS, I believe the command also forcibly unshares all the extents in the file (for the system call, there's a special flag for doing this). Additionally, you can extend a file with fallocate this way by specifying a length longer than the current size of the file, which guarantees that writes into that region will succeed, unlike truncating the file to a larger size, which just creates a hole at the end of the file to bring it up to size. As far as `-d` versus `-p`: `-p` directly translates to the option for the system call that punches a hole. It requires a length and possibly an offset, and will punch a hole at that exact location of that exact size. `-d` is a special option that's only available for the command. It tells the `fallocate` command to search the file for zero-filled regions, and punch holes there. Neither option should ever trigger an ENOSPC, except possibly if it has to split an extent for some reason and you are completely out of metadata space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-18 13:40, Chris Murphy wrote: On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy wrote: I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? Yes, you're right, I had things backwards (well, kind of, this does work on ext4 and regular XFS, so it arguably should work here). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-18 09:07, Chris Murphy wrote: On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn wrote: If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations. There are two issues here: 1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit of extra space and doing a rotating approach like this for writes: - Write goes into the extra space. - Once the write is done, convert the region covered by the write into a new block of extra space. - When the final block of the preallocated region is written, deallocate the extra space. 2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly. And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead. ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-18 03:20, Duncan wrote: Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as excerpted: On 07/17/2018 11:12 PM, Duncan wrote: Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as excerpted: On 07/15/2018 04:37 PM, waxhead wrote: Striping and mirroring/pairing are orthogonal properties; mirror and parity are mutually exclusive. I can't agree. I don't know whether you meant that in the global sense, or purely in the btrfs context (which I suspect), but either way I can't agree. In the pure btrfs context, while striping and mirroring/pairing are orthogonal today, Hugo's whole point was that btrfs is theoretically flexible enough to allow both together and the feature may at some point be added, so it makes sense to have a layout notation format flexible enough to allow it as well. When I say orthogonal, It means that these can be combined: i.e. you can have - striping (RAID0) - parity (?) - striping + parity (e.g. RAID5/6) - mirroring (RAID1) - mirroring + striping (RAID10) However you can't have mirroring+parity; this means that a notation where both 'C' ( = number of copy) and 'P' ( = number of parities) is too verbose. Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on top of mirroring or mirroring on top of raid5/6, much as raid10 is conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 on top of raid0. While it's not possible today on (pure) btrfs (it's possible today with md/dm-raid or hardware-raid handling one layer), it's theoretically possible both for btrfs and in general, and it could be added to btrfs in the future, so a notation with the flexibility to allow parity and mirroring together does make sense, and having just that sort of flexibility is exactly why Hugo made the notation proposal he did. Tho a sensible use-case for mirroring+parity is a different question. I can see a case being made for it if one layer is hardware/firmware raid, but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 (or 15 or 51) might be, where pure mirroring or pure parity wouldn't arguably be a at least as good a match to the use-case. Perhaps one of the other experts in such things here might help with that. Question #2: historically RAID10 is requires 4 disks. However I am guessing if the stripe could be done on a different number of disks: What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the data are stored on a different disk As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. Actually, it also behaves like LVM and MD RAID10 for any number of devices greater than 2, though the exact placement may diverge because of BTRFS's concept of different chunk types. In LVM and MD RAID10, each block is stored as two copies, and what disks it ends up on is dependent on the block number modulo the number of disks (so, for 3 disks A, B, and C, block 0 is on A and B, block 1 is on C and A, and block 2 is on B and C, with subsequent blocks following the same pattern). In an idealized model of BTRFS with only one chunk type, you get exactly the same behavior (because BTRFS allocates chunks based on disk utilization, and prefers lower numbered disks to higher ones in the event of a tie). At 1 GiB strip size it doesn't have the typical performance advantage of striping, but conceptually, it's equivalent to raid10 with huge 1 GiB strips/chunks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-18 04:39, Duncan wrote: Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted: As implemented in BTRFS, raid1 doesn't have striping. The argument is that because there's only two copies, on multi-device btrfs raid1 with 4+ devices of equal size so chunk allocations tend to alternate device pairs, it's effectively striped at the macro level, with the 1 GiB device-level chunks effectively being huge individual device strips of 1 GiB. At 1 GiB strip size it doesn't have the typical performance advantage of striping, but conceptually, it's equivalent to raid10 with huge 1 GiB strips/chunks. I forgot this bit... Similarly, multi-device single is regarded by some to be conceptually equivalent to raid0 with really huge GiB strips/chunks. (As you may note, "the argument is" and "regarded by some" are distancing phrases. I've seen the argument made on-list, but while I understand the argument and agree with it to some extent, I'm still a bit uncomfortable with it and don't normally make it myself, this thread being a noted exception tho originally I simply repeated what someone else already said in-thread, because I too agree it's stretching things a bit. But it does appear to be a useful conceptual equivalency for some, and I do see the similarity. If the file is larger than the data chunk size, it _is_ striped, because it spans multiple chunks which are on separate devices. Otherwise, it's more similar to what in GlusterFS is called a 'distributed volume'. In such a Gluster volume, each file is entirely stored on one node (or you have a complete copy on N nodes where N is the number of replicas), with the selection of what node is used for the next file created being based on which node has the most free space. That said, the main reason I explain single and raid1 the way I do is that I've found it's a much simpler way to explain generically how they work to people who already have storage background but may not care about the specifics. Perhaps it's a case of coder's view (no code doing it that way, it's just a coincidental oddity conditional on equal sizes), vs. sysadmin's view (code or not, accidental or not, it's a reasonably accurate high-level description of how it ends up working most of the time with equivalent sized devices).) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-17 13:54, Martin Steigerwald wrote: Nikolay Borisov - 17.07.18, 10:16: On 17.07.2018 11:02, Martin Steigerwald wrote: Nikolay Borisov - 17.07.18, 09:20: On 16.07.2018 23:58, Wolf wrote: Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) […] Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Btrfs will start running into problems when you run out of unallocated space. So the best advice will be monitor your device unallocated, once it gets really low - like 2-3 gb I will suggest you run balance which will try to free up unallocated space by rewriting data more compactly into sparsely populated block groups. If after running balance you haven't really freed any space then you should consider adding a new drive and running balance to even out the spread of data/metadata. What are these issues exactly? For example if you have plenty of data space but your metadata is full then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-16 16:58, Wolf wrote: Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:10.61TiB, Used:10.58TiB /dev/mapper/data1 1.75TiB /dev/mapper/data2 1.75TiB /dev/mapper/data3 856.00GiB /dev/mapper/data4 856.00GiB /dev/mapper/data5 1.75TiB /dev/mapper/data6 1.75TiB /dev/mapper/data7 6.29TiB /dev/mapper/data8 6.29TiB Metadata,RAID1: Size:15.00GiB, Used:13.00GiB /dev/mapper/data1 2.00GiB /dev/mapper/data2 3.00GiB /dev/mapper/data3 1.00GiB /dev/mapper/data4 1.00GiB /dev/mapper/data5 3.00GiB /dev/mapper/data6 1.00GiB /dev/mapper/data7 9.00GiB /dev/mapper/data8 10.00GiB Slightly OT, but the distribution of metadata chunks across devices looks a bit sub-optimal here. If you can tolerate the volume being somewhat slower for a while, I'd suggest balancing these (it should get you better performance long-term). System,RAID1: Size:64.00MiB, Used:1.50MiB /dev/mapper/data2 32.00MiB /dev/mapper/data6 32.00MiB /dev/mapper/data7 32.00MiB /dev/mapper/data8 32.00MiB Unallocated: /dev/mapper/data11004.52GiB /dev/mapper/data21004.49GiB /dev/mapper/data31006.01GiB /dev/mapper/data41006.01GiB /dev/mapper/data51004.52GiB /dev/mapper/data61004.49GiB /dev/mapper/data71005.00GiB /dev/mapper/data81005.00GiB Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Disclaimer: What I'm about to say is based on personal experience. YMMV. It depends on how you use the filesystem. Realistically, there are a couple of things I consider when trying to decide on this myself: * How quickly does the total usage increase on average, and how much can it be expected to increase in one day in the worst case scenario? This isn't really BTRFS specific, but it's worth mentioning. I usually don't let an array get close enough to full that it wouldn't be able to safely handle at least one day of the worst case increase and another 2 of average increases. In BTRFS terms, the 'safely handle' part means you should be adding about 5GB for a multi-TB array like you have, or about 1GB for a sub-TB array. * What are the typical write patterns? Do files get rewritten in-place, or are they only ever rewritten with a replace-by-rename? Are writes mostly random, or mostly sequential? Are writes mostly small or mostly large? The more towards the first possibility listed in each of those question (in-place rewrites, random access, and small writes), the more free space you should keep on the volume. * Does this volume see heavy usage of fallocate() either to preallocate space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch holes or remove ranges from files. If whatever software you're using does this a lot on this volume, you want even more free space. * Do old files tend to get removed in large batches? That is, possibly hundreds or thousands of files at a time. If so, and you're running a reasonably recent (4.x series) kernel or regularly balance the volume to clean up empty chunks, you don't need quite as much free space. * How quickly can you get a new device added, and is it critical that this volume always be writable? Sounds stupid, but a lot of people don't consider this. If you can trivially get a new device added immediately, you can generally let things go a bit further than you would normally, same for if the volume being read-only can be tolerated for a while without significant issues. It's worth noting that I explicitly do not care about snapshot usage. It rarely has much impact on this other than changing how the total usage increases in a day. Evaluating all of this is of course something I can't really do for you. If I had to guess, with no other information that the allocations shown, I'd say that you're probably generically fine until you get down to about 5GB more than twice the average
Re: [PATCH 0/4] 3- and 4- copy RAID1
On 2018-07-16 14:29, Goffredo Baroncelli wrote: On 07/15/2018 04:37 PM, waxhead wrote: David Sterba wrote: An interesting question is the naming of the extended profiles. I picked something that can be easily understood but it's not a final proposal. Years ago, Hugo proposed a naming scheme that described the non-standard raid varieties of the btrfs flavor: https://marc.info/?l=linux-btrfs=136286324417767 Switching to this naming would be a good addition to the extended raid. As just a humble BTRFS user I agree and really think it is about time to move far away from the RAID terminology. However adding some more descriptive profile names (or at least some aliases) would be much better for the commoners (such as myself). For example: Old format / New Format / My suggested alias SINGLE / 1C / SINGLE DUP / 2CD / DUP (or even MIRRORLOCAL1) RAID0 / 1CmS / STRIPE RAID1 / 2C / MIRROR1 RAID1c3 / 3C / MIRROR2 RAID1c4 / 4C / MIRROR3 RAID10 / 2CmS / STRIPE.MIRROR1 Striping and mirroring/pairing are orthogonal properties; mirror and parity are mutually exclusive. What about RAID1 -> MIRROR1 RAID10 -> MIRROR1S RAID1c3 -> MIRROR2 RAID1c3+striping -> MIRROR2S and so on... RAID5 / 1CmS1P / STRIPE.PARITY1 RAID6 / 1CmS2P / STRIPE.PARITY2 To me these should be called something like RAID5 -> PARITY1S RAID6 -> PARITY2S The S final is due to the fact that usually RAID5/6 spread the data on all available disks Question #1: for "parity" profiles, does make sense to limit the maximum disks number where the data may be spread ? If the answer is not, we could omit the last S. IMHO it should. Currently, there is no ability to cap the number of disks that striping can happen across. Ideally, that will change in the future, in which case not only the S will be needed, but also a number indicating how wide the stripe is. Question #2: historically RAID10 is requires 4 disks. However I am guessing if the stripe could be done on a different number of disks: What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the data are stored on a different disk This is what MD and LVM RAID10 do. They work somewhat differently from what BTRFS calls raid10 (actually, what we currently call raid1 works almost identically to MD and LVM RAID10 when more than 3 disks are involved, except that the chunk size is 1G or larger). Short of drastic internal changes to how that profile works, this isn't likely to happen. In spite of both of these, there is practical need for indicating the stripe width. Depending on the configuration of the underlying storage, it's fully possible (and sometimes even certain) that you will see chunks with differing stripe widths, so properly reporting the stripe width (in devices, not bytes) is useful for monitoring purposes). Consider for example a 6-device array using what's currently called a raid10 profile where 2 of the disks are smaller than the other four. On such an array, chunks will span all six disks (resulting in 2 copies striped across 3 disks each) until those two smaller disks are full, at which point new chunks will span only the remaining four disks (resulting in 2 copies striped across 2 disks each). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On 2018-07-03 03:35, Duncan wrote: Austin S. Hemmelgarn posted on Mon, 02 Jul 2018 07:49:05 -0400 as excerpted: Notably, most Intel systems I've seen have the SATA controllers in the chipset enumerate after the USB controllers, and the whole chipset enumerates after add-in cards (so they almost always have this issue), while most AMD systems I've seen demonstrate the exact opposite behavior, they enumerate the SATA controller from the chipset before the USB controllers, and then enumerate the chipset before all the add-in cards (so they almost never have this issue). Thanks. That's a difference I wasn't aware of, and would (because I tend to favor amd) explain why I've never seen a change in enumeration order unless I've done something like unplug my sata cables for maintenance and forget which ones I had plugged in where -- random USB stuff left plugged in doesn't seem to matter, even choosing different boot media from the bios doesn't seem to matter by the time the kernel runs (I'm less sure about grub). Additionally though, if you in some way make sure SATA drivers are loaded before USB ones, you will also never see this issue because of USB devices (same goes for GRUB). A lot of laptops that use connections other than USB for the keyboard and mouse behave like this if you use a properly stripped down initramfs because you won't have USB drivers in the initramfs (and therefore the SATA drivers always load first). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018-07-02 13:34, Marc MERLIN wrote: On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote: Am I supposed to put LVM thin volumes underneath so that I can share the same single 10TB raid5? Actually, because of the online resize ability in BTRFS, you don't technically _need_ to use thin provisioning here. It makes the maintenance a bit easier, but it also adds a much more complicated layer of indirection than just doing regular volumes. You're right that I can use btrfs resize, but then I still need an LVM device underneath, correct? So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10% each of the full size available (as a guess), and then I'd have to - btrfs resize down one that's bigger than I need - LVM shrink the LV - LVM grow the other LV - LVM resize up the other btrfs and I think LVM resize and btrfs resize are not linked so I have to do them separately and hope to type the right numbers each time, correct? (or is that easier now?) I kind of linked the thin provisioning idea because it's hands off, which is appealing. Any reason against it? No, not currently, except that it adds a whole lot more stuff between BTRFS and whatever layer is below it. That increase in what's being done adds some overhead (it's noticeable on 7200 RPM consumer SATA drives, but not on decent consumer SATA SSD's). There used to be issues running BTRFS on top of LVM thin targets which had zero mode turned off, but AFAIK, all of those problems were fixed long ago (before 4.0). You could (in theory) merge the LVM and software RAID5 layers, though that may make handling of the RAID5 layer a bit complicated if you choose to use thin provisioning (for some reason, LVM is unable to do on-line checks and rebuilds of RAID arrays that are acting as thin pool data or metadata). Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm radi5? Actually, it uses MD's RAID5 implementation as a back-end. Same for RAID6, and optionally for RAID0, RAID1, and RAID10. But yeah, if it's incompatible with thin provisioning, it's not that useful. It's technically not incompatible, just a bit of a pain. Last time I tried to use it, you had to jump through hoops to repair a damaged RAID volume that was serving as an underlying volume in a thin pool, and it required keeping the thin pool offline for the entire duration of the rebuild. Alternatively, you could increase your array size, remove the software RAID layer, and switch to using BTRFS in raid10 mode so that you could eliminate one of the layers, though that would probably reduce the effectiveness of bcache (you might want to get a bigger cache device if you do this). Sadly that won't work. I have more data than will fit on raid10 Thanks for your suggestions though. Still need to read up on whether I should do thin provisioning, or not. If you do go with thin provisioning, I would encourage you to make certain to call fstrim on the BTRFS volumes on a semi regular basis so that the thin pool doesn't get filled up with old unused blocks, preferably when you are 100% certain that there are no ongoing writes on them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit dangerous to do it while writes are happening). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On 2018-07-02 11:19, Marc MERLIN wrote: Hi Qu, thanks for the detailled and honest answer. A few comments inline. On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote: For full, it depends. (but for most real world case, it's still flawed) We have small and crafted images as test cases, which btrfs check can repair without problem at all. But such images are *SMALL*, and only have *ONE* type of corruption, which can represent real world case at all. right, they're just unittest images, I understand. 1) Too large fs (especially too many snapshots) The use case (too many snapshots and shared extents, a lot of extents get shared over 1000 times) is in fact a super large challenge for lowmem mode check/repair. It needs O(n^2) or even O(n^3) to check each backref, which hugely slow the progress and make us hard to locate the real bug. So, the non lowmem version would work better, but it's a problem if it doesn't fit in RAM. I've always considered it a grave bug that btrfs check repair can use so much kernel memory that it will crash the entire system. This should not be possible. While it won't help me here, can btrfs check be improved not to suck all the kernel memory, and ideally even allow using swap space if the RAM is not enough? Is btrfs check regular mode still being maintained? I think it's still better than lowmem, correct? 2) Corruption in extent tree and our objective is to mount RW Extent tree is almost useless if we just want to read data. But when we do any write, we needs it and if it goes wrong even a tiny bit, your fs could be damaged really badly. For other corruption, like some fs tree corruption, we could do something to discard some corrupted files, but if it's extent tree, we either mount RO and grab anything we have, or hopes the almost-never-working --init-extent-tree can work (that's mostly miracle). I understand that it's the weak point of btrfs, thanks for explaining. 1) Don't keep too many snapshots. Really, this is the core. For send/receive backup, IIRC it only needs the parent subvolume exists, there is no need to keep the whole history of all those snapshots. You are correct on history. The reason I keep history is because I may want to recover a file from last week or 2 weeks ago after I finally notice that it's gone. I have terabytes of space on the backup server, so it's easier to keep history there than on the client which may not have enough space to keep a month's worth of history. As you know, back when we did tape backups, we also kept history of at least several weeks (usually several months, but that's too much for btrfs snapshots). Bit of a case-study here, but it may be of interest. We do something kind of similar where I work for our internal file servers. We've got daily snapshots of the whole server kept on the server itself for 7 days (we usually see less than 5% of the total amount of data in changes on weekdays, and essentially 0 on weekends, so the snapshots rarely take up more than ab out 25% of the size of the live data), and then we additionally do daily backups which we retain for 6 months. I've written up a short (albeit rather system specific script) for recovering old versions of a file that first scans the snapshots, and then pulls it out of the backups if it's not there. I've found this works remarkably well for our use case (almost all the data on the file server follows a WORM access pattern with most of the files being between 100kB and 100MB in size). We actually did try moving it all over to BTRFS for a while before we finally ended up with the setup we currently have, but aside from the whole issue with massive numbers of snapshots, we found that for us at least, Amanda actually outperforms BTRFS send/receive for everything except full backups and uses less storage space (though that last bit is largely because we use really aggressive compression). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018-07-02 11:18, Marc MERLIN wrote: Hi Qu, I'll split this part into a new thread: 2) Don't keep unrelated snapshots in one btrfs. I totally understand that maintain different btrfs would hugely add maintenance pressure, but as explains, all snapshots share one fragile extent tree. Yes, I understand that this is what I should do given what you explained. My main problem is knowing how to segment things so I don't end up with filesystems that are full while others are almost empty :) Am I supposed to put LVM thin volumes underneath so that I can share the same single 10TB raid5? Actually, because of the online resize ability in BTRFS, you don't technically _need_ to use thin provisioning here. It makes the maintenance a bit easier, but it also adds a much more complicated layer of indirection than just doing regular volumes. If I do this, I would have software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and that's also starting to make me nervous :) Is there any other way that does not involve me creating smaller block devices for multiple btrfs filesystems and hope that they are the right size because I won't be able to change it later? You could (in theory) merge the LVM and software RAID5 layers, though that may make handling of the RAID5 layer a bit complicated if you choose to use thin provisioning (for some reason, LVM is unable to do on-line checks and rebuilds of RAID arrays that are acting as thin pool data or metadata). Alternatively, you could increase your array size, remove the software RAID layer, and switch to using BTRFS in raid10 mode so that you could eliminate one of the layers, though that would probably reduce the effectiveness of bcache (you might want to get a bigger cache device if you do this). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-30 02:33, Duncan wrote: Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as excerpted: On 2018-06-29 13:58, james harvey wrote: On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. The thing is though, as things stand right now, you need to manually edit the data on-disk directly or restore the file from a backup to fix the file. While it's technically true that you can manually repair this type of thing, both of the cases for doing it without those patches I mentioned, it's functionally impossible for a regular user to do it without potentially losing some data. [Usual backups rant, user vs. admin variant, nowcow/tmpfs edition. Regulars can skip as the rest is already predicted from past posts, for them. =;^] "Regular user"? "Regular users" don't need to bother with this level of detail. They simply get their "admin" to do it, even if that "admin" is their kid, or the kid from next door that's good with computers, or the geek squad (aka nsa-agent-squad) guy/gal, doing it... or telling them to install "a real OS", meaning whatever MS/Apple/Google something that they know how to deal with. If the "user" is dealing with setting nocow, choosing btrfs in the first place, etc, then they're _not_ a "regular user" by definition, they're already an admin.I'd argue that that's not always true. 'Regular users' also bli9ndly follow advice they find online about how to make their system run better, and quite often don't keep backups. And as any admin learns rather quickly, the value of data is defined by the number of backups it's worth having of that data. Which means it's not a problem. Either the data had a backup and it's (reasonably) trivial to restore the data from that backup, or the data was defined by lack of having that backup as of only trivial value, so low as to not be worth the time/trouble/resources necessary to make that backup in the first place. Which of course means what was defined as of most value, either the data of there was a backup, or the time/trouble/resources that would have gone into creating it if not, is *always* saved. (And of course the same goes for "I had a backup, but it's old", except in this case it's the value of the data delta between the backup and current. As soon as it's worth more than the time/trouble/hassle of updating the backup, it will by definition be updated. Not having a newer backup available thus simply means the value of the data that changed between the last backup and current was simply not enough to justify updating the backup, and again, what was of most value is *always* saved, either the data, or the time that would have otherwise gone into making the newer backup.) Because while a "regular user" may not know it because it's not his /job/ to know it, if there's anything an admin knows *well* it's that the working copy of data **WILL** be damaged. It's not a matter of if, but of when, and of whether it'll be a fat-finger mistake, or a hardware or software failure, or wetware (theft, ransomware, etc), or wetware (flood, fire and the water that put it out damage, etc), tho none of that actually matters after all, because in the end, the only thing that matters was how the value of t
Re: unsolvable technical issues?
On 2018-06-30 01:32, Andrei Borzenkov wrote: 30.06.2018 06:22, Duncan пишет: Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as excerpted: On 2018-06-24 16:22, Goffredo Baroncelli wrote: On 06/23/2018 07:11 AM, Duncan wrote: waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. I can speculate a bit. 1) When I see btrfs "technical issue that may never be resolved", the #1 first thing I think of, that AFAIK there are _definitely_ no plans to resolve, because it's very deeply woven into the btrfs core by now, is... [1)] Filesystem UUID Identification. Btrfs takes the UU bit of Universally Unique quite literally, assuming they really *are* unique, at least on that system[.] Because btrfs uses this supposedly unique ID to ID devices that belong to the filesystem, it can get *very* mixed up, with results possibly including dataloss, if it sees devices that don't actually belong to a filesystem with the same UUID as a mounted filesystem. As partial workaround you can disable udev btrfs rules and then do a "btrfs dev scan" manually only for the device which you need. You don't even need `btrfs dev scan` if you just specify the exact set of devices in the mount options. The `device=` mount option tells the kernel to check that device during the mount process. Not that lvm does any better in this regard[1], but has btrfs ever solved the bug where only one device= in the kernel commandline's rootflags= would take effect, effectively forcing initr* on people (like me) who would otherwise not need them and prefer to do without them, if they're using a multi-device btrfs as root? This requires in-kernel device scanning; I doubt we will ever see it. Not to mention the fact that as kernel people will tell you, device enumeration isn't guaranteed to be in the same order every boot, so device=/dev/* can't be relied upon and shouldn't be used -- but of course device=LABEL= and device=UUID= and similar won't work without userspace, basically udev (if they work at all, IDK if they actually do). Tho in practice from what I've seen, device enumeration order tends to be dependable /enough/ for at least those without enterprise-level numbers of devices to enumerate. Just boot with USB stick/eSATA drive plugged in, there are good chances it changes device order. It really depends on your particular hardware. If your USB controllers are at lower PCI addresses than your primary disk controllers, then yes, this will cause an issue. Same for whatever SATA controller your eSATA port is on (or stupid systems where the eSATA port is port 0 on the main SATA controller). Notably, most Intel systems I've seen have the SATA controllers in the chipset enumerate after the USB controllers, and the whole chipset enumerates after add-in cards (so they almost always have this issue), while most AMD systems I've seen demonstrate the exact opposite behavior, they enumerate the SATA controller from the chipset before the USB controllers, and then enumerate the chipset before all the add-in cards (so they almost never have this issue). That said, one of the constraints for them remaining consistent is that you don't change hardware. True, it /does/ change from time to time with a new kernel, but anybody sane keeps a tested-dependable old kernel around to boot to until they know the new one works as expected, and that sort of change is seldom enough that users can boot to the old kernel and adjust their settings for the new one as necessary when it does happen. So as "don't do it that way because it's not reliable" as it might indeed be in theory, in practice, just using an ordered /dev/* in kernel commandlines does tend to "just work"... provided one is ready for the occasion when that device parameter might need a bit of adjustment, of course. ... --- [1] LVM is userspace code on top of the kernelspace devicemapper, and therefore requires an initr* if root is on lvm, regardless. So btrfs actually does a bit better here, only requiring it for multi-device btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On 2018-06-29 23:22, Duncan wrote: Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as excerpted: On 2018-06-24 16:22, Goffredo Baroncelli wrote: On 06/23/2018 07:11 AM, Duncan wrote: waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. I can speculate a bit. 1) When I see btrfs "technical issue that may never be resolved", the #1 first thing I think of, that AFAIK there are _definitely_ no plans to resolve, because it's very deeply woven into the btrfs core by now, is... [1)] Filesystem UUID Identification. Btrfs takes the UU bit of Universally Unique quite literally, assuming they really *are* unique, at least on that system[.] Because btrfs uses this supposedly unique ID to ID devices that belong to the filesystem, it can get *very* mixed up, with results possibly including dataloss, if it sees devices that don't actually belong to a filesystem with the same UUID as a mounted filesystem. As partial workaround you can disable udev btrfs rules and then do a "btrfs dev scan" manually only for the device which you need. You don't even need `btrfs dev scan` if you just specify the exact set of devices in the mount options. The `device=` mount option tells the kernel to check that device during the mount process. Not that lvm does any better in this regard[1], but has btrfs ever solved the bug where only one device= in the kernel commandline's rootflags= would take effect, effectively forcing initr* on people (like me) who would otherwise not need them and prefer to do without them, if they're using a multi-device btrfs as root? I haven't tested this recently myself, so I don't know. Not to mention the fact that as kernel people will tell you, device enumeration isn't guaranteed to be in the same order every boot, so device=/dev/* can't be relied upon and shouldn't be used -- but of course device=LABEL= and device=UUID= and similar won't work without userspace, basically udev (if they work at all, IDK if they actually do). They aren't guaranteed to be stable, but they functionally are provided you don't modify hardware in any way and your disks can't be enumerated asynchronously without some form of ordered identification (IOW, you're using just one SATA or SCSI controller for all your disks). That said, the required component for the LABEL= and UUID= syntax is not udev, it's blkid. blkid can use udev to avoid having to read everything, but it's not mandatory. Tho in practice from what I've seen, device enumeration order tends to be dependable /enough/ for at least those without enterprise-level numbers of devices to enumerate. True, it /does/ change from time to time with a new kernel, but anybody sane keeps a tested-dependable old kernel around to boot to until they know the new one works as expected, and that sort of change is seldom enough that users can boot to the old kernel and adjust their settings for the new one as necessary when it does happen. So as "don't do it that way because it's not reliable" as it might indeed be in theory, in practice, just using an ordered /dev/* in kernel commandlines does tend to "just work"... provided one is ready for the occasion when that device parameter might need a bit of adjustment, of course. Also, while LVM does have 'issues' with cloned PV's, it fails safe (by refusing to work on VG's that have duplicate PV's), while BTRFS fails very unsafely (by randomly corrupting data). And IMO that "failing unsafe" is both serious and common enough that it easily justifies adding the point to a list of this sort, thus my putting it #1. Agreed. My point wasn't that BTRFS is doing things correctly, just that LVM is not a saint in this respect either (it's just more saintly than we are). 2) Subvolume and (more technically) reflink-aware defrag. It was there for a couple kernel versions some time ago, but "impossibly" slow, so it was disabled until such time as btrfs could be made to scale rather better in this regard. I still contend that the biggest issue WRT reflink-aware defrag was that it was not optional. The only way to get the old defrag behavior was to boot a kernel that didn't have reflink-aware defrag support. IOW, _everyone_ had to deal with the performance issues, not just the people who wanted to use reflink-aware defrag. Absolutely. Which of course suggests making it optional, with a suitable warning as to the speed implications with lots of snapshots/reflinks, when it does get enabled again (and as David mentions elsewhere, there's apparently some work going into the idea once again, which potentially moves it from the 3-5 year range, at best, back to a 1/2-2-year range, time will tell). 3) N-way-mirroring. [...] This is no
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 13:58, james harvey wrote: On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. The thing is though, as things stand right now, you need to manually edit the data on-disk directly or restore the file from a backup to fix the file. While it's technically true that you can manually repair this type of thing, both of the cases for doing it without those patches I mentioned, it's functionally impossible for a regular user to do it without potentially losing some data. Unless that changes, scrub telling you it's corrupt is not going to help much aside from making sure you don't make things worse by trying to use it. Given this, it would make sense to have a (disabled by default) option to have scrub repair it by just using the newer or older copy of the data. That would require classic RAID generational tracking though, which BTRFS doesn't have yet. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs suddenly think's it's raid6
On 2018-06-29 07:04, marble wrote: Hello, I have an external HDD. The HDD contains no partition. I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs. It's used to store some media files. The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux to play music from the drive. After disconnecting the drive from the Pi and connecting it to my laptop again, I couldn't mount it anymore. If I read the dmesg right, it now thinks that it's part of a raid6. btrfs check --repair also didn't help. ``` marble@archlinux ~ % uname -a Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux marble@archlinux ~ % btrfs --version btrfs-progs v4.16.1 marble@archlinux ~ % sudo cryptsetup open /dev/sda black [sudo] password for marble: Enter passphrase for /dev/sda: marble@archlinux ~ % mkdir /tmp/black marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black mount: /tmp/black: can't read superblock on /dev/mapper/black. marble@archlinux ~ % sudo btrfs fi show Label: 'black' uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Total devices 1 FS bytes used 143.38GiB devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black enabling repair mode Checking filesystem on /dev/mapper/black UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5 Fixed 0 roots. checking extents checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 owner ref check failed [1082114048 16384] repair deleting extent record: key [1082114048,168,16384] adding new tree backref on start 1082114048 len 16384 parent 0 root 5 Repaired extent references for 1082114048 ref mismatch on [59038556160 4096] extent item 1, found 0 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 incorrect local backref count on 59038556160 root 5 owner 334393 offset 0 found 0 wanted 1 back 0x56348aee5de0 backref disk bytenr does not match extent record, bytenr=59038556160, ref bytenr=0 backpointer mismatch on [59038556160 4096] owner ref check failed [59038556160 4096] checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979 checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3 bytenr mismatch, want=1082114048, have=9385453728028469028 failed to repair damaged filesystem, aborting marble@archlinux ~ % dmesg > /tmp/dmesg.log ``` Any clues? It's not thinking it's a raid6 array. All the messages before this one: Btrfs loaded, crc32c=crc32c-intel Are completely unrelated to BTRFS (because anything before that message happened before any BTRFS code ran). The raid6 messages are actually from the startup code for the kernel's generic parity RAID implementation. These: BTRFS error (device dm-1): bad tree block start 9385453728028469028 1082114048 BTRFS error (device dm-1): bad tree block start 2365503423870651471 1082114048 Are the relevant error messages. Unfortunately, I don't really know what's wrong in this case though. Hopefully one of the developers will have some further insight. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 07:46, Qu Wenruo wrote: On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote: On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. Well, that explains a lot of thing. So hardware RAID controller can be considered having a special battery (always atomic) journal device. If we can't provide UPS for the whole system, a battery powered journal device indeed makes sense. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. Well, a design flaw should be something that can't be easily fixed without *huge* on-disk format or behavior change. Flaw in btrfs' one-subvolume-per-tree metadata design or current extent booking behavior could be called design flaw. That would be a structural design flaw. it's a result of how the software is structured. There are other types of design flaws though. While for things like this, just as the submitted RFC patch, less than 100 lines could change the behavior. I would still consider this case a design flaw (a purely behavioral
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. The only special handling I can see that would be needed is around volumes mounted with the `nodatacow` option, which may not see generation changes for a very long time otherwise. others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. RAID1 can only tolerate 1 missing device, it has nothing to do with error detection. And it's impossible to detect such case without extra help. Your expectation is completely wrong. Well ... somehow it is my experience as well ... :) Acceptable, but not really apply to software based RAID1. Thanks, Qu I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe
Re: btrfs raid10 performance
On 2018-06-25 21:05, Sterling Windmill wrote: I am running a single btrfs RAID10 volume of eight LUKS devices, each using a 2TB SATA hard drive as a backing store. The SATA drives are a mixture of Seagate and Western Digital drives, some with RPMs ranging from 5400 to 7200. Each seems to individually performance test where I would expect for drives of this caliber. They are all attached to an LSI PCIe SAS controller and configured in JBOD. I have a relatively beefy quad core Xeon CPU that supports AES-NI and don't think LUKS is my bottleneck. Here's some info from the resulting filesystem: btrfs fi df /storage Data, RAID10: total=6.30TiB, used=6.29TiB System, RAID10: total=8.00MiB, used=560.00KiB Metadata, RAID10: total=9.00GiB, used=7.64GiB GlobalReserve, single: total=512.00MiB, used=0.00B In general I see good performance, especially read performance which is enough to regularly saturate my gigabit network when copying files from this host via samba. Reads are definitely taking advantage of the multiple copies of data available and spreading the load among all drives. Writes aren't quite as rosy, however. When writing files using dd like in this example: dd if=/dev/zero of=tempfile bs=1M count=10240 conv=fdatasync,notrun c status=progress And running a command like: iostat -m 1 to monitor disk I/O, writes seem to only focus on one of the eight disks at a time, moving from one drive to the next. This results in a sustained 55-90 MB/sec throughput depending on which disk is being written to (remember, some have faster spindle speed than others). Am I wrong to expect btrfs' RAID10 mode to write to multiple disks simultaneously and to break larger writes into smaller stripes across my four pairs of disks? I had trouble identifying whether btrfs RAID10 is writing (64K?) stripes or (1GB?) blocks to disk in this mode. The latter might make more sense based upon what I'm seeing? Anything else I should be trying to narrow down the bottleneck? First, you're probably incorrect that the disk access is being parallelized. Given that BTRFS still doesn't parallelize writes in raid1 mode, I very much doubt it does so in raid10 mode. Parallelizing writes is a performance optimization that still hasn't really been tackled by anyone. Realistically, BTRFS writes to exactly one disk at a time. So, in a four disk raid10 array, it first writes to disk 1, waits for that to finish, then writes to disk 2, waits for that to finish, then 3, waits, and then four. Overall, this makes writes rather slow. As far as striping across multiple disks, yes, that does happen. The specifics of this are a bit complicated though, and require explaining a bit about how BTRFS works in general. BTRFS uses a two-stage allocator, first allocating 'large' regions of disk space to be used for a specific type of data called chunks, and then allocating blocks out of those regions to actually store the data. There are three chunk types, data (used for storing actual file contents), metadata (used for storing things like filenames, access times, directory structure, etc), and system (used to store the allocation information for all the other chunks in the filesystem). Data chunks are typically 1 GB in size, metadata are typically 256 MB in size, and system chunks are highly variable but don't really matter for this explanation. The chunk level is where the actual replication and striping happen, and the chunk size represents what is exposed to the block allocator (so every 1 GB data chunk exposes 1 GB of space to the block allocator). Now, replicated (raid1 or dup profiles) chunks work just like you would expect, each of the two allocations for the chunk is 1 GB, and each byte is stored as-is in both. Striped (raid0 or raid10 profiles) are somewhat more complicated, and I actually don't know exactly how they end up allocated at the lower level. However, I do know how the striping works. In short, you can treat each striped set (either a full raid0 chunk, or half a raid10 chunk) as being functionally identical in operation to a conventional RAID0 array, striping occurs at a small block granularity (I think it's equal to the block size, which would be 4k in most cases), which unfortunately compounds the performance issues caused by BTRFS only writing to one disk at a time. As far as improving the performance, I've got two suggestions for alternative storage arrangements: * If you want to just stick with only BTRFS for storage, try just using raid1 mode. It will give you the same theoretical total capacity as raid10 does and will slow down reads somewhat, but should speed up writes significantly (because you're only writing to two devices, not striping across two sets of four). * If you're willing to try something a bit different, convert your storage array to two LVM or MD RAID0 volumes composed of four devices each, and then run BTRFS in raid1 mode on top of
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On 2018-06-25 12:07, Marc MERLIN wrote: On Tue, Jun 19, 2018 at 12:58:44PM -0400, Austin S. Hemmelgarn wrote: In your situation, I would run "btrfs pause ", wait to hear from a btrfs developer, and not use the volume whatsoever in the meantime. I would say this is probably good advice. I don't really know what's going on here myself actually, though it looks like the balance got stuck (the output hasn't changed for over 36 hours, unless you've got an insanely slow storage array, that's extremely unusual (it should only be moving at most 3GB of data per chunk)). I didn't hear from any developer, so I had to continue. - btrfs scrub cancel did not work (hang) - at reboot mounting the filesystem hung, even with 4.17, which is disappointing (it should not hang) - mount -o recovery still hung - mount -o ro did not hang though One tip here specifically, if you had to reboot during a balance and the FS hangs when it mounts, try mounting with `-o skip_balance`. That should pause the balance instead of resuming it on mount, at which point you should also be able to cancel it without it hanging. Sigh, why is my FS corrupted again? Anyway, back to btrfs check --repair and, it took all my 32GB of RAM on a system I can't add more RAM to, so I'm hosed. I'll note in passing (and it's not ok at all) that check --repair after a 20 to 30mn pause, takes all the kernel RAM more quickly than the system can OOM or log anything, and just deadlocks it. This is repeateable and totally not ok :( I'm now left with btrfs-progs git master, and lowmem which finally does a bit of repair. So far: gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 enabling repair mode WARNING: low-memory mode repair support is only partial Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Fixed 0 roots. ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent [147895111680 12345344] ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418 Delete backref in extent [150850146304 17522688] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449 Deleted root 2 item[156909494272, 178, 5476627808561673095] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452 Deleted root 2 item[156909494272, 178, 7338474132555182983] At the rate it's going, it'll probably take days though, it's already been 36H Marc -- To unsubscribe from this list: send the line &q
Re: unsolvable technical issues?
On 2018-06-24 16:22, Goffredo Baroncelli wrote: On 06/23/2018 07:11 AM, Duncan wrote: waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? If you forget about the "RAID"5/6 like features then the only annoyances that I have with BTRFS so far is... 1. Lack of per subvolume "RAID" levels 2. Lack of not using the deviceid to re-discover and re-add dropped devices And that's about it really... ... And those both have solutions on the roadmap, with RFC patches already posted for #2 (tho I'm not sure they use devid) altho realistically they're likely to take years to appear and be tested to stability. Meanwhile... While as the others have said you really need to go to the author to get what was referred to, and I agree, I can speculate a bit. While this *is* speculation, admittedly somewhat uninformed as I don't claim to be a dev, and I'd actually be interested in what others think so don't be afraid to tell me I haven't a clue, as long as you say why... based on several years reading the list now... 1) When I see btrfs "technical issue that may never be resolved", the #1 first thing I think of, that AFAIK there are _definitely_ no plans to resolve, because it's very deeply woven into the btrfs core by now, is... Filesystem UUID Identification. Btrfs takes the UU bit of Universally Unique quite literally, assuming they really *are* unique, at least on that system, and uses them to identify the possibly multiple devices that may be components of the filesystem, a problem most filesystems don't have to deal with since they're single-device-only. Because btrfs uses this supposedly unique ID to ID devices that belong to the filesystem, it can get *very* mixed up, with results possibly including dataloss, if it sees devices that don't actually belong to a filesystem with the same UUID as a mounted filesystem. As partial workaround you can disable udev btrfs rules and then do a "btrfs dev scan" manually only for the device which you need. The you can mount the filesystem. Unfortunately you cannot mount two filesystem with the same UUID. However I have to point out that also LVM/dm might have problems if you clone a PV You don't even need `btrfs dev scan` if you just specify the exact set of devices in the mount options. The `device=` mount option tells the kernel to check that device during the mount process. Also, while LVM does have 'issues' with cloned PV's, it fails safe (by refusing to work on VG's that have duplicate PV's), while BTRFS fails very unsafely (by randomly corrupting data). [...] der say 3-5 (or 5-7, or whatever) years. These could arguably include: 2) Subvolume and (more technically) reflink-aware defrag. It was there for a couple kernel versions some time ago, but "impossibly" slow, so it was disabled until such time as btrfs could be made to scale rather better in this regard. Did you try something like that with XFS+DM snapshot ? No you can't, because defrag in XFS cannot traverse snapshot (and I have to suppose that defrag cannot be effective on a dm-snapshot at all).. What I am trying to point out is that even tough btrfs is not the fastest filesystem (and for some workload is VERY slow), when you compare it when few snapshots were presents LVM/dm is a lot slower. IMHO most of the complaint which affect BTRFS, are due to the fact that with BTRFS an user can quite easily exploit a lot of features and their combinations. When a the slowness issue appears when some advance features combinations are used (i.e. multiple disks profile and (a lot of ) snapshots), this is reported as a BTRFS failure. But in fact even LVM/dm is very slow when the snapshot is used. I still contend that the biggest issue WRT reflink-aware defrag was that it was not optional. The only way to get the old defrag behavior was to boot a kernel that didn't have reflink-aware defrag support. IOW, _everyone_ had to deal with the performance issues, not just the people who wanted to use reflink-aware defrag. There's no hint yet as to when that might actually be, if it will _ever_ be, so this can arguably be validly added to the "may never be resolved" list. 3) N-way-mirroring. [...] This is not an issue, but a not implemented feature If you're looking at feature parity with competitors, it's an issue. 4) (Until relatively recently, and still in terms of scaling) Quotas. Until relatively recently, quotas could arguably be added to the list. They were rewritten multiple times, and until recently, appeared to be effectively eternally broken. Even tough what you are reporting is correct, I have to point out that the quota in BTRFS is more
Re: btrfs balance did not progress after 12H
On 2018-06-19 12:30, james harvey wrote: On Tue, Jun 19, 2018 at 11:47 AM, Marc MERLIN wrote: On Mon, Jun 18, 2018 at 06:00:55AM -0700, Marc MERLIN wrote: So, I ran this: gargamel:/mnt/btrfs_pool2# btrfs balance start -dusage=60 -v . & [1] 24450 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=60 gargamel:/mnt/btrfs_pool2# while :; do btrfs balance status .; sleep 60; done 0 out of about 0 chunks balanced (0 considered), -nan% left This (0/0/0, -nan%) seems alarming. I had this output once when the system spontaneously rebooted during a balance. I didn't have any bad effects afterward. Balance on '.' is running 0 out of about 73 chunks balanced (2 considered), 100% left Balance on '.' is running After about 20mn, it changed to this: 1 out of about 73 chunks balanced (6724 considered), 99% left This seems alarming. I wouldn't think # considered should ever exceed # chunks. Although, it does say "about", so maybe it can a little bit, but I wouldn't expect it to exceed it by this much. Actually, output like this is not unusual. In the above line, the 1 is how many chunks have been actually processed, the 73 is how many the command expects to process (that is, the count of chunks that fit the filtering requirements, in this case, ones which are 60% or less full), and the 6724 is how many it has checked against the filtering requirements. So, if you've got a very large number of chunks, and are selecting a small number with filters, then the considered value is likely to be significantly higher than the first two. Balance on '.' is running Now, 12H later, it's still there, only 1 out of 73. gargamel:/mnt/btrfs_pool2# btrfs fi show . Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Total devices 1 FS bytes used 12.72TiB devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 gargamel:/mnt/btrfs_pool2# btrfs fi df . Data, single: total=13.57TiB, used=12.60TiB System, DUP: total=32.00MiB, used=1.55MiB Metadata, DUP: total=121.50GiB, used=116.53GiB GlobalReserve, single: total=512.00MiB, used=848.00KiB kernel: 4.16.8 Is that expected? Should I be ready to wait days possibly for this balance to finish? It's now beeen 2 days, and it's still stuck at 1% 1 out of about 73 chunks balanced (6724 considered), 99% left First, my disclaimer. I'm not a btrfs developer, and although I've ran balance many times, I haven't really studied its output beyond the % left. I don't know why it says "about", and I don't know if it should ever be that far off. In your situation, I would run "btrfs pause ", wait to hear from a btrfs developer, and not use the volume whatsoever in the meantime. I would say this is probably good advice. I don't really know what's going on here myself actually, though it looks like the balance got stuck (the output hasn't changed for over 36 hours, unless you've got an insanely slow storage array, that's extremely unusual (it should only be moving at most 3GB of data per chunk)). That said, I would question the value of repacking chunks that are already more than half full. Anything above a 50% usage filter generally takes a long time, and has limited value in most cases (higher values are less likely to reduce the total number of allocated chunks). With `-duszge=50` or less, you're guaranteed to reduce the number of chunk if at least two match, and it isn't very time consuming for the allocator, all because you can pack at least two matching chunks into one 'new' chunk (new in quotes because it may re-pack them into existing slack space on the FS). Additionally, `-dusage=50` is usually sufficient to mitigate the typical ENOSPC issues that regular balancing is supposed to help with. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ntfs -> qemu -> raw-image -> btrfs -> iscsi
On 2018-06-15 13:40, Chris Murphy wrote: On Fri, Jun 15, 2018 at 5:33 AM, ein wrote: Hello group, does anyone have had any luck with hosting qemu kvm images resided on BTRFS filesystem while serving the volume via iSCSI? I encouraged some unidentified problem and I am able to replicate it. Basically the NTFS filesystem inside RAW image gets corrupted every time when Windows guest boots. What is weired is that changing filesystem for ext4 or xfs solves the issue. The problem replication looks as follows: 1) run chkdsk on the guest to make sure the filesystem structure is in good shape, 2) shut down the VM via libvirtd, 3) rsync changes between source and backup image, 4) generate SHA1 for backup and original and compare it, 5) try to run guest on the backup image, I was able to boot windows once for ten times, every time after reboot NTFS' chkdsk finds problems with filesystem and the VM is unable to boot again. What am I missing? VM disk config: cache=none uses O_DIRECT, and that's the source of the issue with VM images on Btrfs. Details are in the list archive. I'm not really sure what you want to use with Windows in this particular case, probably not cache=unsafe though. I'd say give writethrough a shot and see how it affects performance and fixes this problem. cache=writethrough is probably going to be the best option, unless you want to switch to cache=writeback and disable write caching in Windows (which from what I hear can actually give better performance than using cache=none). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum failed root raveled during balance
On 2018-05-29 10:02, ein wrote: On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote: On 2018-05-28 13:10, ein wrote: On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote: On 2018-05-23 06:09, ein wrote: On 05/23/2018 11:09 AM, Duncan wrote: ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted: IMHO the best course of action would be to disable checksumming for you vm files. Do you mean '-o nodatasum' mount flag? Is it possible to disable checksumming for singe file by setting some magical chattr? Google thinks it's not possible to disable csums for a single file. You can use nocow (-C), but of course that has other restrictions (like setting it on the files when they're zero-length, easiest done for existing data by setting it on the containing dir and copying files (no reflink) in) as well as the nocow effects, and nocow becomes cow1 after a snapshot (which locks the existing copy in place so changes written to a block /must/ be written elsewhere, thus the cow1, aka cow the first time written after the snapshot but retain the nocow for repeated writes between snapshots). But if you're disabling checksumming anyway, nocow's likely the way to go. Disabling checksumming only may be a way to go - we live without it every day. But nocow @ VM files defeats whole purpose of using BTRFS for me, even with huge performance penalty - backup reasons - I mean few snapshots (20-30), send & receive. Setting NOCOW on a file doesn't prevent it from being snapshotted, it just prevents COW operations from happening under most normal circumstances. In essence, it prevents COW from happening except for writing right after the snapshot. More specifically, the first write to a given block in a file set for NOCOW after taking a snapshot will trigger a _single_ COW operation for _only_ that block (unless you have autodefrag enabled too), after which that block will revert to not doing COW operations on write. This way, you still get consistent and working snapshots, but you also don't take the performance hits from COW except right after taking a snapshot. Yeah, just after I've post it, I've found some Duncan post from 2015, explaining it, thank you anyway. Is there anything we can do better in random/write VM workload to speed the BTRFS up and why? My settings: [...] /dev/mapper/raid10-images on /var/lib/libvirt type btrfs (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/) md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0] 468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [] bitmap: 0/4 pages [0KB], 65536KB chunk CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's kernel 4.15.0-0.bpo.2-amd64 As far as I understand compress and autodefrag are impacting negatively for performance (latency), especially autodefrag. I think also that nodatacow shall also speed things up and it's a must when using qemu and BTRFS. Is it better to use virtio or virt-scsi with TRIM support? FWIW, I've been doing just fine without nodatacow, but I also use raw images contained in sparse files, and keep autodefrag off for the dedicated filesystem I put the images on. So do I, RAW images created by qemu-img, but I am not sure if preallocation works as expected. The size of disks in filesystem looks fine though. Unless I'm mistaken, qemu-img will fully pre-allocate the images. You can easily check though with `ls -ls`, which will show the amount of space taken up by the file on-disk (before compression or deduplication) on the left. If that first column on the left doesn't match up with the apparent file size, then the file is sparse and not fully pre-allocated. From a practical perspective, if you really want maximal performance, it's worth pre-allocating space, as that both avoids the non-determinism of allocating blocks on first-write, and avoids some degree of fragmentation. If you would rather save the space and not pre-allocate, you can either use touch with the `--size` argument to quickly create an apropriately sized virtual disk image file. May I ask in what workloads? From my testing while having VM on BTRFS storage: - file/web servers works perfect on BTRFS. - Windows (2012/2016) file servers with AD, are perfect too, besides time required for Windows Update, but this service is... let's say not fine enough. - database (firebird) impact is huuuge, guest filesystem is Ext4, the database performs slower in this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm SASes. I am still thinking how to benchmark it properly. A lot of iowait in host's kernel. In my case, I've got a couple of different types of VM's, each with it's own type of workload: - A total of 8 static VM's that are always running, each running a different distribution/version of Linux. These see very little activity most of the time (I keep them around as reference systems so i
Re: csum failed root raveled during balance
On 2018-05-28 13:10, ein wrote: On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote: On 2018-05-23 06:09, ein wrote: On 05/23/2018 11:09 AM, Duncan wrote: ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted: IMHO the best course of action would be to disable checksumming for you vm files. Do you mean '-o nodatasum' mount flag? Is it possible to disable checksumming for singe file by setting some magical chattr? Google thinks it's not possible to disable csums for a single file. You can use nocow (-C), but of course that has other restrictions (like setting it on the files when they're zero-length, easiest done for existing data by setting it on the containing dir and copying files (no reflink) in) as well as the nocow effects, and nocow becomes cow1 after a snapshot (which locks the existing copy in place so changes written to a block /must/ be written elsewhere, thus the cow1, aka cow the first time written after the snapshot but retain the nocow for repeated writes between snapshots). But if you're disabling checksumming anyway, nocow's likely the way to go. Disabling checksumming only may be a way to go - we live without it every day. But nocow @ VM files defeats whole purpose of using BTRFS for me, even with huge performance penalty - backup reasons - I mean few snapshots (20-30), send & receive. Setting NOCOW on a file doesn't prevent it from being snapshotted, it just prevents COW operations from happening under most normal circumstances. In essence, it prevents COW from happening except for writing right after the snapshot. More specifically, the first write to a given block in a file set for NOCOW after taking a snapshot will trigger a _single_ COW operation for _only_ that block (unless you have autodefrag enabled too), after which that block will revert to not doing COW operations on write. This way, you still get consistent and working snapshots, but you also don't take the performance hits from COW except right after taking a snapshot. Yeah, just after I've post it, I've found some Duncan post from 2015, explaining it, thank you anyway. Is there anything we can do better in random/write VM workload to speed the BTRFS up and why? My settings: [...] /dev/mapper/raid10-images on /var/lib/libvirt type btrfs (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/) md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0] 468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [] bitmap: 0/4 pages [0KB], 65536KB chunk CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's kernel 4.15.0-0.bpo.2-amd64 As far as I understand compress and autodefrag are impacting negatively for performance (latency), especially autodefrag. I think also that nodatacow shall also speed things up and it's a must when using qemu and BTRFS. Is it better to use virtio or virt-scsi with TRIM support? FWIW, I've been doing just fine without nodatacow, but I also use raw images contained in sparse files, and keep autodefrag off for the dedicated filesystem I put the images on. Compression shouldn't have much in the way of negative impact unless you're also using transparent compression (or disk for file encryption) inside the VM (in fact, it may speed things up significantly depending on what filesystem is being used by the guest OS, the ext4 inode table in particular seems to compress very well). If you are using `nodatacow` though, you can just turn compression off, as it's not going to be used anyway. If you want to keep using compression, then I'd suggest using `compress-force` instead of `compress`, which makes BTRFS a bit more aggressive about trying to compress things, but makes the performance much more deterministic. You may also want to look int using `zstd` instead of `lzo` for the compression, it gets better ratios most of the time, and usually performs better than `lzo` does. Autodefrag should probably be off. If you have nodatacow set (or just have all the files marked with the NOCOW attribute), then there's not really any point to having autodefrag on. If like me you aren't turning off COW for data, it's still a good idea to have it off and just do batch defragmentation at a regularly scheduled time. For the VM settings, everything looks fine to me (though if you have somewhat slow storage and aren't giving the VM's lots of memory to work with, doing write-through caching might be helpful). I would probably be using virtio-scsi for the TRIM support, as with raw images you will get holes in the file where the TRIM command was issued, which can actually improve performance (and does improve storage utilization (though doing batch trims instead of using the `discard` mount option is better for performance if you have Linux guests). You're using an MD RAID10 array. This is generally the fastest option in terms of performance, but it also means you c
Re: csum failed root raveled during balance
On 2018-05-23 06:09, ein wrote: On 05/23/2018 11:09 AM, Duncan wrote: ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted: IMHO the best course of action would be to disable checksumming for you vm files. Do you mean '-o nodatasum' mount flag? Is it possible to disable checksumming for singe file by setting some magical chattr? Google thinks it's not possible to disable csums for a single file. You can use nocow (-C), but of course that has other restrictions (like setting it on the files when they're zero-length, easiest done for existing data by setting it on the containing dir and copying files (no reflink) in) as well as the nocow effects, and nocow becomes cow1 after a snapshot (which locks the existing copy in place so changes written to a block /must/ be written elsewhere, thus the cow1, aka cow the first time written after the snapshot but retain the nocow for repeated writes between snapshots). But if you're disabling checksumming anyway, nocow's likely the way to go. Disabling checksumming only may be a way to go - we live without it every day. But nocow @ VM files defeats whole purpose of using BTRFS for me, even with huge performance penalty - backup reasons - I mean few snapshots (20-30), send & receive. Setting NOCOW on a file doesn't prevent it from being snapshotted, it just prevents COW operations from happening under most normal circumstances. In essence, it prevents COW from happening except for writing right after the snapshot. More specifically, the first write to a given block in a file set for NOCOW after taking a snapshot will trigger a _single_ COW operation for _only_ that block (unless you have autodefrag enabled too), after which that block will revert to not doing COW operations on write. This way, you still get consistent and working snapshots, but you also don't take the performance hits from COW except right after taking a snapshot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
On 2018-05-21 13:43, David Sterba wrote: On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote: On 2018-05-18 12:36, Niccolò Belli wrote: On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote: Josef started working on that in 2014 and did not finish it. The patches can be still found in his tree. The problem is in excessive memory consumption when there are many snapshots that need to be tracked during the defragmentation, so there are measures to avoid OOM. There's infrastructure ready for use (shrinkers), there are maybe some problems but fundamentally is should work. I'd like to get the snapshot-aware working again too, we'd need to find a volunteer to resume the work on the patchset. Yeah I know of Josef's work, but 4 years had passed since then without any news on this front. What I would really like to know is why nobody resumed his work: is it because it's impossible to implement snapshot-aware degram without excessive ram usage or is it simply because nobody is interested? I think it's because nobody who is interested has both the time and the coding skills to tackle it. Personally though, I think the biggest issue with what was done was not the memory consumption, but the fact that there was no switch to turn it on or off. Making defrag unconditionally snapshot aware removes one of the easiest ways to forcibly unshare data without otherwise altering the files (which, as stupid as it sounds, is actually really useful for some storage setups), and also forces the people who have ridiculous numbers of snapshots to deal with the memory usage or never defrag. Good points. The logic of the sharing-aware is a technical detail, what's being discussed is the usecase and I think this would be good to clarify. 1) always -- the old (and now disabled) way, unconditionally (ie. no option for the user), problems with memory consumption 2) more fine grained: 2.1) defragment only the non-shared extents, ie. no sharing awareness needed, shared extents will be silently skipped 2.2) defragment only within the given subvolume -- like 1) but by user's choice The naive dedup, that Tomasz (CCed) mentions in another mail, would be probably beyond the defrag purpose and would make things more complicated. I'd vote for keeping complexity of the ioctl interface and defrag implementation low, so if it's simply saying "do forcible defrag" or "skip shared", then it sounds ok. If there's eg. "keep sharing only on this subvolunes", then it would need to read the snapshot ids from ioctl structure, then enumerate all extent owners and do some magic to unshare/defrag/share. That's a quick idea, lots of details would need to be clarified. From my perspective, I see two things to consider that are somewhat orthogonal to each other: 1. Whether to recurse into subvolumes or not (IIRC, we currently do not do so, because we see them like a mount point). 2. Whether to use the simple (not reflink-aware) defrag, the reflink aware one, or to base it on the extent/file type (use old simpler one for unshared extents, and new reflink aware one for shared extents). This second set of options is what I'd like to see the most (possibly without the option to base it on file or extent sharing automatically), though the first one would be nice to have. Better yet, having that second set of options and making the new reflink-aware defrag opt-in would allow people who really want it to use it, and those of us who don't need it for our storage setups to not need to worry about it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
On 2018-05-21 09:42, Timofey Titovets wrote: пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn <ahferro...@gmail.com>: On 2018-05-19 04:54, Niccolò Belli wrote: On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote: With a bit of work, it's possible to handle things sanely. You can deduplicate data from snapshots, even if they are read-only (you need to pass the `-A` option to duperemove and run it as root), so it's perfectly reasonable to only defrag the main subvolume, and then deduplicate the snapshots against that (so that they end up all being reflinks to the main subvolume). Of course, this won't work if you're short on space, but if you're dealing with snapshots, you should have enough space that this will work (because even without defrag, it's fully possible for something to cause the snapshots to suddenly take up a lot more space). Been there, tried that. Unfortunately even if I skip the defreg a simple duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs is going to eat more space than it was previously available (probably due to autodefrag?). It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME ioctl). There's two things involved here: * BTRFS has somewhat odd and inefficient handling of partial extents. When part of an extent becomes unused (because of a CLONE ioctl, or an EXTENT_SAME ioctl, or something similar), that part stays allocated until the whole extent would be unused. * You're using the default deduplication block size (128k), which is larger than your filesystem block size (which is at most 64k, most likely 16k, but might be 4k if it's an old filesystem), so deduplicating can split extents. That's a metadata node leaf != fs block size. btrfs fs block size == machine page size currently. You're right, I keep forgetting about that (probably because BTRFS is pretty much the only modern filesystem that doesn't let you change the block size). Because of this, if a duplicate region happens to overlap the front of an already shared extent, and the end of said shared extent isn't aligned with the deduplication block size, the EXTENT_SAME call will deduplicate the first part, creating a new shared extent, but not the tail end of the existing shared region, and all of that original shared region will stick around, taking up extra space that it wasn't before. Additionally, if only part of an extent is duplicated, then that area of the extent will stay allocated, because the rest of the extent is still referenced (so you won't necessarily see any actual space savings). You can mitigate this by telling duperemove to use the same block size as your filesystem using the `-b` option. Note that using a smaller block size will also slow down the deduplication process and greatly increase the size of the hash file. duperemove -b control "how hash data", not more or less and only support 4KiB..1MiB And you can only deduplicate the data at the granularity you hashed it at. In particular: * The total size of a region being deduplicated has to be an exact multiple of the hash block size (what you pass to `-b`). So for the default 128k size, you can only deduplicate regions that are multiples of 128k long (128k, 256k, 384k, 512k, etc). This is a simple limit derived from how blocks are matched for deduplication. * Because duperemove uses fixed hash blocks (as opposed to using a rolling hash window like many file synchronization tools do), the regions being deduplicated also have to be exactly aligned to the hash block size. So, with the default 128k size, you can only deduplicate regions starting at 0k, 128k, 256k, 384k, 512k, etc, but not ones starting at, for example, 64k into the file. And size of block for dedup will change efficiency of deduplication, when count of hash-block pairs, will change hash file size and time complexity. Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern. So, example, you have 2 of 2x4KiB blocks: 1: '' 2: '' With -b 8KiB hash of first block not same as second. But with -b 4KiB duperemove will see both '' and '' And then that blocks will be deduped. This supports what I'm saying though. Your deduplication granularity is bounded by your hash granularity. If in addition to the above you have a file that looks like: AABBBAA It would not get deduplicated against the first two at either `-b 4k` or `-b 8k` despite the middle 4k of the file being an exact duplicate of the final 4k of the first file and first 4k of the second one. If instead you have: AABB And the final 6k is a single on-disk extent, that extent will get split when you go to deduplicate against the first two files with a 4k block size because only the final 4k can be deduplicated, and the entire 6k original extent will stay completely allocated. Even, duperemove have 2 modes of deduping: 1. By extents 2. By blocks Ye
Re: Any chance to get snapshot-aware defragmentation?
On 2018-05-19 04:54, Niccolò Belli wrote: On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote: With a bit of work, it's possible to handle things sanely. You can deduplicate data from snapshots, even if they are read-only (you need to pass the `-A` option to duperemove and run it as root), so it's perfectly reasonable to only defrag the main subvolume, and then deduplicate the snapshots against that (so that they end up all being reflinks to the main subvolume). Of course, this won't work if you're short on space, but if you're dealing with snapshots, you should have enough space that this will work (because even without defrag, it's fully possible for something to cause the snapshots to suddenly take up a lot more space). Been there, tried that. Unfortunately even if I skip the defreg a simple duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs is going to eat more space than it was previously available (probably due to autodefrag?). It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME ioctl). There's two things involved here: * BTRFS has somewhat odd and inefficient handling of partial extents. When part of an extent becomes unused (because of a CLONE ioctl, or an EXTENT_SAME ioctl, or something similar), that part stays allocated until the whole extent would be unused. * You're using the default deduplication block size (128k), which is larger than your filesystem block size (which is at most 64k, most likely 16k, but might be 4k if it's an old filesystem), so deduplicating can split extents. Because of this, if a duplicate region happens to overlap the front of an already shared extent, and the end of said shared extent isn't aligned with the deduplication block size, the EXTENT_SAME call will deduplicate the first part, creating a new shared extent, but not the tail end of the existing shared region, and all of that original shared region will stick around, taking up extra space that it wasn't before. Additionally, if only part of an extent is duplicated, then that area of the extent will stay allocated, because the rest of the extent is still referenced (so you won't necessarily see any actual space savings). You can mitigate this by telling duperemove to use the same block size as your filesystem using the `-b` option. Note that using a smaller block size will also slow down the deduplication process and greatly increase the size of the hash file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
On 2018-05-18 13:18, Niccolò Belli wrote: On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote: and also forces the people who have ridiculous numbers of snapshots to deal with the memory usage or never defrag Whoever has at least one snapshot is never going to defrag anyway, unless he is willing to double the used space. With a bit of work, it's possible to handle things sanely. You can deduplicate data from snapshots, even if they are read-only (you need to pass the `-A` option to duperemove and run it as root), so it's perfectly reasonable to only defrag the main subvolume, and then deduplicate the snapshots against that (so that they end up all being reflinks to the main subvolume). Of course, this won't work if you're short on space, but if you're dealing with snapshots, you should have enough space that this will work (because even without defrag, it's fully possible for something to cause the snapshots to suddenly take up a lot more space). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
On 2018-05-18 12:36, Niccolò Belli wrote: On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote: Josef started working on that in 2014 and did not finish it. The patches can be still found in his tree. The problem is in excessive memory consumption when there are many snapshots that need to be tracked during the defragmentation, so there are measures to avoid OOM. There's infrastructure ready for use (shrinkers), there are maybe some problems but fundamentally is should work. I'd like to get the snapshot-aware working again too, we'd need to find a volunteer to resume the work on the patchset. Yeah I know of Josef's work, but 4 years had passed since then without any news on this front. What I would really like to know is why nobody resumed his work: is it because it's impossible to implement snapshot-aware degram without excessive ram usage or is it simply because nobody is interested? I think it's because nobody who is interested has both the time and the coding skills to tackle it. Personally though, I think the biggest issue with what was done was not the memory consumption, but the fact that there was no switch to turn it on or off. Making defrag unconditionally snapshot aware removes one of the easiest ways to forcibly unshare data without otherwise altering the files (which, as stupid as it sounds, is actually really useful for some storage setups), and also forces the people who have ridiculous numbers of snapshots to deal with the memory usage or never defrag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] btrfs: add read mirror policy
On 2018-05-18 04:06, Anand Jain wrote: Thanks Austin and Jeff for the suggestion. I am not particularly a fan of mount option either mainly because those options aren't persistent and host independent luns will have tough time to have them synchronize manually. Properties are better as it is persistent. And we can apply this read_mirror_policy property on the fsid object. But if we are talking about the properties then it can be stored as extended attributes or ondisk key value pair, and I am doubt if ondisk key value pair will get a nod. I can explore the extended attribute approach but appreciate more comments. Hmm, thinking a bit further, might it be easier to just keep this as a mount option, and add something that lets you embed default mount options in the volume in a free-form manner? Then, you could set this persistently there, and could specify any others you want too. Doing that would also give very well defined behavior for exactly when changes would apply (the next time you mount or remount the volume), though handling of whether or not an option came from there or was specified on the command-line might be a bit complicated. On 05/17/2018 10:46 PM, Jeff Mahoney wrote: On 5/17/18 8:25 AM, Austin S. Hemmelgarn wrote: On 2018-05-16 22:32, Anand Jain wrote: On 05/17/2018 06:35 AM, David Sterba wrote: On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote: Not yet ready for the integration. As I need to introduce -o no_read_mirror_policy instead of -o read_mirror_policy=- Mount option is mostly likely not the right interface for setting such options, as usual. I am ok to make it ioctl for the final. What do you think? But to reproduce the bug posted in Btrfs: fix the corruption by reading stale btree blocks It needs to be a mount option, as randomly the pid can still pick the disk specified in the mount option. Personally, I'd vote for filesystem property (thus handled through the standard `btrfs property` command) that can be overridden by a mount option. With that approach, no new tool (or change to an existing tool) would be needed, existing volumes could be converted to use it in a backwards compatible manner (old kernels would just ignore the property), and you could still have the behavior you want in tests (and in theory it could easily be adapted to be a per-subvolume setting if we ever get per-subvolume chunk profile support). Properties are a combination of interfaces presented through a single command. Although the kernel API would allow a direct-to-property interface via the btrfs.* extended attributes, those are currently limited to a single inode. The label property is set via ioctl and stored in the superblock. The read-only subvolume property is also set by ioctl but stored in the root flags. As it stands, every property is explicitly defined in the tools, so any addition would require tools changes. This is a bigger discussion, though. We *could* use the xattr interface to access per-root or fs-global properties, but we'd need to define that interface. btrfs_listxattr could get interesting, though I suppose we could simplify it by only allowing the per-subvolume and fs-global operations on root inodes. -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] btrfs: add read mirror policy
On 2018-05-17 10:46, Jeff Mahoney wrote: On 5/16/18 6:35 PM, David Sterba wrote: On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote: Not yet ready for the integration. As I need to introduce -o no_read_mirror_policy instead of -o read_mirror_policy=- Mount option is mostly likely not the right interface for setting such options, as usual. I've seen a few alternate suggestions in the thread. I suppose the real question is: what and where is the intended persistence for this choice? A mount option gets it via fstab. How would a user be expected to set it consistently via ioctl on each mount? Properties could work, but there's more discussion needed there. Personally, I like the property idea since it could conceivably be used on a per-file basis. For the specific proposed use case (the tests), it probably doesn't need to be persistent beyond mount options. However, this also allows for a trivial configuration using a slow storage device to provide redundancy for a fast storage device of the same size, which is potentially very useful for some people. In that case, I can see most people who would be using it wanting it to follow the filesystem regardless of what context it's being mounted in (for example, it shouldn't need an extra option if mounted from a recovery environment or if it's moved to another system). Most of my reason for recommending properties is that filesystem level properties appear to be the best thing BTRFS has to store per-volume configuration that's supposed to stay with the volume, despite not really being used for that even though there are quite a few mount options that are logical candidates for this type of thing (for example, the `ssd` options, `metadata_ratio`, and `max_inline` all make more logical sense as a property of the volume, not the mount). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] btrfs: add read mirror policy
On 2018-05-16 22:32, Anand Jain wrote: On 05/17/2018 06:35 AM, David Sterba wrote: On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote: Not yet ready for the integration. As I need to introduce -o no_read_mirror_policy instead of -o read_mirror_policy=- Mount option is mostly likely not the right interface for setting such options, as usual. I am ok to make it ioctl for the final. What do you think? But to reproduce the bug posted in Btrfs: fix the corruption by reading stale btree blocks It needs to be a mount option, as randomly the pid can still pick the disk specified in the mount option. Personally, I'd vote for filesystem property (thus handled through the standard `btrfs property` command) that can be overridden by a mount option. With that approach, no new tool (or change to an existing tool) would be needed, existing volumes could be converted to use it in a backwards compatible manner (old kernels would just ignore the property), and you could still have the behavior you want in tests (and in theory it could easily be adapted to be a per-subvolume setting if we ever get per-subvolume chunk profile support). Of course, I'd actually like to see most of the mount options available as filesystem level properties with the option to override through mount options, but that's a lot more ambitious of an undertaking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] btrfs: balance: add kernel log for end or paused
On 2018-05-16 09:23, Anand Jain wrote: On 05/16/2018 07:25 PM, Austin S. Hemmelgarn wrote: On 2018-05-15 22:51, Anand Jain wrote: Add a kernel log when the balance ends, either for cancel or completed or if it is paused. --- v1->v2: Moved from 2/3 to 3/3 fs/btrfs/volumes.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ce68c4f42f94..a4e243a29f5c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4053,6 +4053,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info, ret = __btrfs_balance(fs_info); mutex_lock(_info->balance_mutex); + if (ret == -ECANCELED && atomic_read(_info->balance_pause_req)) + btrfs_info(fs_info, "balance: paused"); + else if (ret == -ECANCELED && atomic_read(_info->balance_cancel_req)) + btrfs_info(fs_info, "balance: canceled"); + else + btrfs_info(fs_info, "balance: ended with status: %d", ret); + clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags); if (bargs) { Is there some way that these messages could be extended to include info about which volume the balance in question was on. Ideally, something that matches up with what's listed in the message from the previous patch. There's nothi9ng that prevents you from running balances on separate BTRFS volumes in parallel, so this message won't necessarily be for the most recent balance start message. Hm. That's not true, btrfs_info(fs_info,,) adds the fsid. So its already drilled down to the lowest granular possible. Ah, you're right, it does. Sorry about the noise. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] btrfs: balance: add kernel log for end or paused
On 2018-05-15 22:51, Anand Jain wrote: Add a kernel log when the balance ends, either for cancel or completed or if it is paused. --- v1->v2: Moved from 2/3 to 3/3 fs/btrfs/volumes.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ce68c4f42f94..a4e243a29f5c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4053,6 +4053,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info, ret = __btrfs_balance(fs_info); mutex_lock(_info->balance_mutex); + if (ret == -ECANCELED && atomic_read(_info->balance_pause_req)) + btrfs_info(fs_info, "balance: paused"); + else if (ret == -ECANCELED && atomic_read(_info->balance_cancel_req)) + btrfs_info(fs_info, "balance: canceled"); + else + btrfs_info(fs_info, "balance: ended with status: %d", ret); + clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags); if (bargs) { Is there some way that these messages could be extended to include info about which volume the balance in question was on. Ideally, something that matches up with what's listed in the message from the previous patch. There's nothi9ng that prevents you from running balances on separate BTRFS volumes in parallel, so this message won't necessarily be for the most recent balance start message. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs installation advices
On 2018-05-12 21:58, faurepi...@gmail.com wrote: Thanks you two very much for your answers. So if I sum up correctly, I could: 1- use Self-Encrypting Drive (SED), since my drive is a Samsung NVMe 960 EVO, which is supposed to support SED according to http://www.samsung.com/semiconductor/minisite/ssd/support/faqs-nvmessd: "*Do Samsung NVMe M.2 SSDs have hardware encryption?* Samsung NVMe SSDs provide internal hardware encryption of all data stored on the SSD, including the operating system. Data is decrypted through a pre-boot authentication process. Because all user data is encrypted, private information is protected against loss or theft. Encryption is done by hardware, which provides a safer environment without sacrificing performance. The encryption methods provided by each Samsung NVMe SSD are: AES (Advanced Encryption Standard, Class0 SED) TCG/OPAL, and eDrive Please note that you cannot use more than one encryption method simultaneously. *Do Samsung NVMe M.2 SSDs support TCG Opal?* TCG Opal is supported by Samsung NVMe SSDs (960EVO / PRO and newer). It is an authentication method that employs the protocol specified by the Trusted Computing Group (TCG) meaning that you will need to install TCG software supplied by a TCG OPAL software development company. User authentication is done by pre-boot authentication provided by the software. For more detailed information and instructions, please contact a TCG software company. In addition, TCG/opal can only be enabled / disabled by using special security software. " For the moment, I don't know how to use that self-encryption from linux. Could you please give me some tips or links about how you did? 2- now that the full drive is self-encrypted, I can build manually the three partitions from a live system: boot with ext(2,3,4), swap with swap, and root with btrfs 3- and finally install debian sid in the dedicaced partitions. Am I right? :) Yes, that approach will work, assuming you trust Samsung (since they're the ones who wrote the code responsible for the encryption, and you can't inspect that code yourself). Le 08/05/2018 à 13:32, Austin S. Hemmelgarn a écrit : On 2018-05-08 03:50, Rolf Wald wrote: Hello, some hints inside Am 08.05.2018 um 02:22 schrieb faurepi...@gmail.com: Hi, I'm curious about btrfs, and maybe considering it for my new laptop installation (a Lenovo T470). I was going to install my usual lvm+ext4+full disk encryption setup, but thought I should maybe give a try to btrfs. Is it possible to meet all these criteria? - operating system: debian sid - file system: btrfs - disk encryption (or at least of sensitives partitions) - hibernation feature (which implies a swap partition or file, and I've read btrfs is not a big fan of the latter) A swap partition is not possible inside or with btrfs alone. You can choose btrfs filesystem out of the box in debian install, but that would mean full-disk-encryption with lvm and btrfs. The extra layer lvm doesn't hurt, but you have two layers with many functions double, e.g. snapshotting, resize. Um, this isn't really as much of an issue as you might think. LVM has near zero overhead unless you're actually doing any of that stuff (as long as the LV is just a simple linear mapping, it has less than 1% more overhead than just using partitions). The only real caveat here is to make _ABSOLUTELY CERTAIN_ that you _DO NOT_ make LVM snapshots of _ANY_ BTRFS volumes. Doing so is a recipe for disaster, and will likely eat at least your data, and possibly your children. The bigger issue is that dm-crypt generally slows down device access, which BTRFS is very sensitive to. Using BTRFS with FDE works, but it's slow, so I would only suggest doing it with an SSD (and if you're using an SSD, you may be better off getting a TCG Opal compliant self-encrypting drive and just using the self-encryption functionality instead of FDE). If yes, how would you suggest me to achieve it? Yes, there is a solution, and it works for me now several years. You need to build three partitions, e.g. named boot, swap, root. The sizes choose to your need. the boot partition remains unencrypted, but the other two partitions are encrypted with cryptsetup (luks) separately. Normally there are two passphrases to type in (and to remember), but there is an option in the cryptsetup scripts (/lib/cryptsetup/scripts) decrypt_derived, which could take the key from the root partition to decrypt the swap partition also. The filesystems then on the partitions are boot with ext(2,3,4), swap with swap and root with btrfs. This configuration is not reachable with a standard debian installation. Debian always choose lvm if you want full encryption. You have to do the first steps manually: make partitions, cryptsetup(luks) for the partitions swap and root, and open the encrypted partitions manually. After that you can install your OS. The manual steps you ha
Re: Btrfs installation advices
On 2018-05-08 03:50, Rolf Wald wrote: Hello, some hints inside Am 08.05.2018 um 02:22 schrieb faurepi...@gmail.com: Hi, I'm curious about btrfs, and maybe considering it for my new laptop installation (a Lenovo T470). I was going to install my usual lvm+ext4+full disk encryption setup, but thought I should maybe give a try to btrfs. Is it possible to meet all these criteria? - operating system: debian sid - file system: btrfs - disk encryption (or at least of sensitives partitions) - hibernation feature (which implies a swap partition or file, and I've read btrfs is not a big fan of the latter) A swap partition is not possible inside or with btrfs alone. You can choose btrfs filesystem out of the box in debian install, but that would mean full-disk-encryption with lvm and btrfs. The extra layer lvm doesn't hurt, but you have two layers with many functions double, e.g. snapshotting, resize. Um, this isn't really as much of an issue as you might think. LVM has near zero overhead unless you're actually doing any of that stuff (as long as the LV is just a simple linear mapping, it has less than 1% more overhead than just using partitions). The only real caveat here is to make _ABSOLUTELY CERTAIN_ that you _DO NOT_ make LVM snapshots of _ANY_ BTRFS volumes. Doing so is a recipe for disaster, and will likely eat at least your data, and possibly your children. The bigger issue is that dm-crypt generally slows down device access, which BTRFS is very sensitive to. Using BTRFS with FDE works, but it's slow, so I would only suggest doing it with an SSD (and if you're using an SSD, you may be better off getting a TCG Opal compliant self-encrypting drive and just using the self-encryption functionality instead of FDE). If yes, how would you suggest me to achieve it? Yes, there is a solution, and it works for me now several years. You need to build three partitions, e.g. named boot, swap, root. The sizes choose to your need. the boot partition remains unencrypted, but the other two partitions are encrypted with cryptsetup (luks) separately. Normally there are two passphrases to type in (and to remember), but there is an option in the cryptsetup scripts (/lib/cryptsetup/scripts) decrypt_derived, which could take the key from the root partition to decrypt the swap partition also. The filesystems then on the partitions are boot with ext(2,3,4), swap with swap and root with btrfs. This configuration is not reachable with a standard debian installation. Debian always choose lvm if you want full encryption. You have to do the first steps manually: make partitions, cryptsetup(luks) for the partitions swap and root, and open the encrypted partitions manually. After that you can install your OS. The manual steps you have to make from a working distro, e.g. live system (disk or stick) with a recent kernel and recent btrfs-progs (debian sid is ok for this). After the install of the OS you have to made the changes for a successful (re)boot manually. Please read the advices you can find in the net. There are some nice articles. Thanks for your kind help. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
On 2018-05-03 04:11, Andrei Borzenkov wrote: On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote: ... Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives you 40TB of usable space). You're storing roughly 20TB of data on it, using a 16kB block size, and it sees about 1GB of writes a day, with no partial stripe writes. You, for reasons of argument, want to scrub it every week, because the data in question matters a lot to you. With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and can compute the parity at a rate of 1.25G/s (the ratio here is about the average across the almost 50 systems I have quick access to check, including a number of server and workstation systems less than a year old, though the numbers themselves are artificially low to accentuate the point here). At this rate, scrubbing by computing parity requires processing: * Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 1 seconds, or 222 minutes, or about 3.7 hours. * Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000 seconds, or 267 minutes, or roughly 4.4 hours. So, over a week, you would be spending 8.1 hours processing data solely for data integrity, or roughly 4.8214% of your time. Now assume instead that you're doing checksummed parity: * Scrubbing data is the same, 3.7 hours. * Scrubbing parity turns into computing checksums for 4TB of data, which would take 3200 seconds, or 53 minutes, or roughly 0.88 hours. Scrubbing must compute parity and compare with stored value to detect write hole. Otherwise you end up with parity having good checksum but not matching rest of data. Yes, but that assumes we don't do anything to deal with the write hole, and it's been pretty much decided by the devs that they're going to try and close the write hole. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 - 6 parity raid
On 2018-05-02 16:40, Goffredo Baroncelli wrote: On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote: On 2018-05-02 13:25, Goffredo Baroncelli wrote: On 05/02/2018 06:55 PM, waxhead wrote: So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ). I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct. In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum. My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification... The only gain is to avoid to try to use the parity when a) you need it (i.e. when the data is missing and/or corrupted) and b) it is corrupted. But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !). So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)). IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost. You do realize that a write is already rewriting checksums elsewhere? It would be pretty trivial to make sure that the checksums for every part of a stripe end up in the same metadata block, at which point the only cost is computing the checksum (because when a checksum gets updated, the whole block it's in gets rewritten, period, because that's how CoW works). Looking at this another way (all the math below uses SI units): [...] Good point: precomputing the checksum of the parity save a lot of time for the scrub process. You can see this in a more simply way saying that the parity calculation (which is dominated by the memory bandwidth) is like O(n) (where n is the number of disk); the parity checking (which again is dominated by the memory bandwidth) against a checksum is like O(1). And when the data written is 2 order of magnitude lesser than the data stored, the effort required to precompute the checksum is negligible. Excellent point about the computational efficiency, I had not thought of framing things that way. Anyway, my "rant" started when Ducan put near the missing of parity checksum and the write hole. The first might be a performance problem. Instead the write hole could lead to a loosing data. My intention was to highlight that the parity-checksum is not related to the reliability and safety of raid5/6. It may not be related to the safety, but it is arguably indirectly related to the reliability, dependent on your definition of reliability. Spending less time verifying the parity means you're spending less time in an indeterminate state of usability, which arguably does improve the reliability of the system. However, that does still have nothing to do with the write hole. So, lets look at data usage: 1GB of data is translates to 62500 16kB blocks of data, which equates to an additional 15625 blocks for parity. Adding parity checksums adds a 25% overhead to checksums being written, but that actually doesn't translate to a huge increase in the number of _blocks_ of checksums written. One 16k block can hold roughly 500 checksums, so it would take 125 blocks worth of checksums without parity, and 157 (technically 156.25, but you can't write a quarter block) with parity checksums. Thus, without parity checksums, writing 1GB of data involves writing 78250 blocks, while doing the same with parity checksums involves writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**. How you would store the checksum ? I asked that because I am not sure that we could use the "standard" btrfs metadata to store the checksum of the parity. Doing so you could face some pathological effect like: - update a block(1) in a stripe(1) - update the parity of stripe(1) containing block(1) - update the checksum of parity stripe (1), which is contained in another stripe(2) [**] - update the parity of stripe (2) containing the checksum of parity stripe(1) - update the checksum of parity stripe (2), which is contained in another stripe(3) and so on... [**] pay attention that the checksum and the parity *have* to be in different stripe, otherwise you have the egg/chicken problem: compute the parity, then update the checksum, then update the parity again because the