Re: What if TRIM issued a wipe on devices that don't TRIM?
On Thu, 6 Dec 2018 06:11:46 + Robert White wrote: > So it would be dog-slow, but it would be neat if BTRFS had a mount > option to convert any TRIM command from above into the write of a zero, > 0xFF, or trash block to the device below if that device doesn't support > TRIM. Real TRIM support would override the block write. There is such a project: "dm-linear like target which provides discard, but replaces it with write of random data to a discarded region. Thus, discarded data is securely deleted." https://github.com/vt-alt/dm-secdel -- With respect, Roman
btrfs progs always assume devid 1?
Hello, To migrate my FS to a different physical disk, I have added a new empty device to the FS, then ran the remove operation on the original one. Now my FS has only devid 2: Label: 'p1' uuid: d886c190-b383-45ba-9272-9f00c6a10c50 Total devices 1 FS bytes used 36.63GiB devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1 And all the operations of btrfs-progs now fail to work in their default invocation, such as: # btrfs fi resize max . Resize '.' of 'max' ERROR: unable to resize '.': No such device [768813.414821] BTRFS info (device dm-5): resizer unable to find device 1 Of course this works: # btrfs fi resize 2:max . Resize '.' of '2:max' But this is inconvenient and seems to be a rather simple oversight. If what I got is normal (the device staying as ID 2 after such operation), then count that as a suggestion that btrfs-progs should use the first existing devid, rather than always looking for hard-coded devid 1. -- With respect, Roman
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, 22 Nov 2018 22:07:25 +0900 Tomasz Chmielewski wrote: > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. Recently there has been a bugfix for TRIM in Btrfs: btrfs: Ensure btrfs_trim_fs can trim the whole fs https://patchwork.kernel.org/patch/10579539/ Perhaps your upgraded kernel is the first one to contain it, and for the first time you're seeing TRIM to actually *work*, with the actual performance impact of it on a large fragmented FS, instead of a few contiguous unallocated areas. -- With respect, Roman
Re: BTRFS on production: NVR 16+ IP Cameras
On Thu, 15 Nov 2018 11:39:58 -0700 Juan Alberto Cirez wrote: > Is BTRFS mature enough to be deployed on a production system to underpin > the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? What are you looking to gain from using Btrfs on an NVR system? It doesn't sound like any of its prime-time features -- such as snapshots, checksumming, compression or reflink copying -- are a must for a bulk video recording scenario. And even if you don't need or use those features, you still pay the price for having them, Btrfs can be 2x to 10x slower than its simpler competitors: https://www.phoronix.com/scan.php?page=article=linux418-nvme-raid=2 If you just meant to use its multi-device support instead of a separate RAID layer, IMO that can't be considered prime-time or depended upon in production (yes, not even RAID 1 or 10). -- With respect, Roman
Re: unable to mount btrfs after upgrading from 4.16.1 to 4.19.1
On Sat, 10 Nov 2018 03:08:01 +0900 Tomasz Chmielewski wrote: > After upgrading from kernel 4.16.1 to 4.19.1 and a clean restart, the fs > no longer mounts: Did you try rebooting back to 4.16.1 to see if it still mounts there? -- With respect, Roman
Re: CoW behavior when writing same content
On Tue, 9 Oct 2018 09:52:00 -0600 Chris Murphy wrote: > You'll be left with three files. /big_file and root/big_file will > share extents, and snapshot/big_file will have its own extents. You'd > need to copy with --reflink for snapshot/big_file to have shared > extents with /big_file - or deduplicate. Or use rsync for copying, in the mode where it reads and checksums blocks of both files, to copy only the non-matching portions. rsync --inplace This option is useful for transferring large files with block-based changes or appended data, and also on systems that are disk bound, not network bound. It can also help keep a copy-on-write filesystem snapshot from diverging the entire con‐ tents of a file that only has minor changes. -- With respect, Roman
Re: Problem with BTRFS
On Fri, 14 Sep 2018 19:27:04 +0200 Rafael Jesús Alcántara Pérez wrote: > BTRFS info (device sdc1): use lzo compression, level 0 > BTRFS warning (device sdc1): 'recovery' is deprecated, use > 'usebackuproot' instead > BTRFS info (device sdc1): trying to use backup root at mount time > BTRFS info (device sdc1): disk space caching is enabled > BTRFS info (device sdc1): has skinny extents > BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with > fs_devices total_rw_bytes 601023424 There is a recent feature added to "btrfs rescue" to fix this kind of condition: https://patchwork.kernel.org/patch/10011399/ You need a recent version of the Btrfs tools for it, not sure which, I see that it's not in version 4.13 but is present in 4.17. -- With respect, Roman
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
On Fri, 17 Aug 2018 23:17:33 +0200 Martin Steigerwald wrote: > > Do not consider SSD "compression" as a factor in any of your > > calculations or planning. Modern controllers do not do it anymore, > > the last ones that did are SandForce, and that's 2010 era stuff. You > > can check for yourself by comparing write speeds of compressible vs > > incompressible data, it should be the same. At most, the modern ones > > know to recognize a stream of binary zeroes and have a special case > > for that. > > Interesting. Do you have any backup for your claim? Just "something I read". I follow quote a bit of SSD-related articles and reviews which often also include a section to talk about the controller utilized, its background and technological improvements/changes -- and the compression going out of fashion after SandForce seems to be considered a well-known fact. Incidentally, your old Intel 320 SSDs actually seem to be based on that old SandForce controller (or at least license some of that IP to extend on it), and hence those indeed might perform compression. > As the data still needs to be transferred to the SSD at least when the > SATA connection is maxed out I bet you won´t see any difference in write > speed whether the SSD compresses in real time or not. Most controllers expose two readings in SMART: - Lifetime writes from host (SMART attribute 241) - Lifetime writes to flash (attribute 233, or 177, or 173...) It might be difficult to get the second one, as often it needs to be decoded from others such as "Average block erase count" or "Wear leveling count". (And seems to be impossible on Samsung NVMe ones, for example) But if you have numbers for both, you know the write amplification of the drive (and its past workload). If there is compression at work, you'd see the 2nd number being somewhat, or significantly lower -- and barely increase at all, if you write highly compressible data. This is not typically observed on modern SSDs, except maybe when writing zeroes. Writes to flash will be the same as writes from host, or most often somewhat higher, as the hardware can typically erase flash only in chunks of 2MB or so, hence there's quite a bit of under the hood reorganizing going on. Also as a result depending on workloads the "to flash" number can be much higher than "from host". Point is, even when the SATA link is maxed out in both cases, you can still check if there's compression at work via using those SMART attributes. > In any case: It was a experience report, no request for help, so I don´t > see why exact error messages are absolutely needed. If I had a support > inquiry that would be different, I agree. Well, when reading such stories (involving software that I also use) I imagine what if I had been in that situation myself, what would I do, would I have anything else to try, do I know about any workaround for this. And without any technical details to go from, those are all questions left unanswered. -- With respect, Roman
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
On Fri, 17 Aug 2018 14:28:25 +0200 Martin Steigerwald wrote: > > First off, keep in mind that the SSD firmware doing compression only > > really helps with wear-leveling. Doing it in the filesystem will help > > not only with that, but will also give you more space to work with. > > While also reducing the ability of the SSD to wear-level. The more data > I fit on the SSD, the less it can wear-level. And the better I compress > that data, the less it can wear-level. Do not consider SSD "compression" as a factor in any of your calculations or planning. Modern controllers do not do it anymore, the last ones that did are SandForce, and that's 2010 era stuff. You can check for yourself by comparing write speeds of compressible vs incompressible data, it should be the same. At most, the modern ones know to recognize a stream of binary zeroes and have a special case for that. As for general comment on this thread, always try to save the exact messages you get when troubleshooting or getting failures from your system. Saying just "was not able to add" or "btrfs replace not working" without any exact details isn't really helpful as a bug report or even as a general "experiences" story, as we don't know what was the exact cause of those, could that have been avoided or worked around, not to mention what was your FS state at the time (as in "btrfs fi show" and "fi df"). -- With respect, Roman
Re: trouble mounting btrfs filesystem....
On Tue, 14 Aug 2018 16:41:11 +0300 Dmitrii Tcvetkov wrote: > If usebackuproot doesn't help then filesystem is beyond repair and you > should try to refresh your backups with "btrfs restore" and restore from > them[1]. > > [1] > https://btrfs.wiki.kernel.org/index.php/FAQ#How_do_I_recover_from_a_.22parent_transid_verify_failed.22_error.3F This is really the worst unfixed Btrfs issue today. This happens a lot on unclean shutdowns or reboots, and the only advice is usually to "start over" -- even if your FS is 40 TB and the discrepancy is just by half a dozen transids. There needs to be a way in fsck to accept (likely minor!) FS damage and forcefully fixup transids to what they should be -- or even nuke the affected portions entirely. -- With respect, Roman
"error inheriting props for ino": Btrfs "compression" property
Hello, On two machines I have subvolumes where I backup other hosts' root filesystems via rsync. These subvolumes have the +c attribute on them. During the backup, sometimes I get tons of messages like these in dmesg: [Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props for ino 1213720 (root 1301): -28 [Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props for ino 1213723 (root 1301): -28 [Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props for ino 1213724 (root 1301): -28 [Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props for ino 1213725 (root 1301): -28 # btrfs inspect inode-resolve 1213720 . ./gemini/lib/modules/4.14.58-rm2+/kernel/virt This seems to be related to the "compression" property in Btrfs: # btrfs property get ./gemini/lib/modules/4.14.58-rm2+/ compression=zlib # btrfs property get ./gemini/lib/modules/4.14.58-rm2+/kernel/virt (no output) Why would it fail like that? This does seem harmless, but the messages are annoying and it's puzzling why this happens in the first place. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Mon, 2 Jul 2018 08:19:03 -0700 Marc MERLIN wrote: > I actually have fewer snapshots than this per filesystem, but I backup > more than 10 filesystems. > If I used as many snapshots as you recommend, that would already be 230 > snapshots for 10 filesystems :) (...once again me with my rsync :) If you didn't use send/receive, you wouldn't be required to keep a separate snapshot trail per filesystem backed up, one trail of snapshots for the entire backup server would be enough. Rsync everything to subdirs within one subvolume, then do timed or event-based snapshots of it. You only need more than one trail if you want different retention policies for different datasets (e.g. in my case I have 91 and 31 days). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, 29 Jun 2018 00:22:10 -0700 Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote: > > On Thu, 28 Jun 2018 23:59:03 -0700 > > Marc MERLIN wrote: > > > > > I don't waste a week recreating the many btrfs send/receive relationships. > > > > Consider not using send/receive, and switching to regular rsync instead. > > Send/receive is very limiting and cumbersome, including because of what you > > described. And it doesn't gain you much over an incremental rsync. As for > > Err, sorry but I cannot agree with you here, at all :) > > btrfs send/receive is pretty much the only reason I use btrfs. > rsync takes hours on big filesystems scanning every single inode on both > sides and then seeing what changed, and only then sends the differences I use it for backing up root filesystems of about 20 hosts, and for syncing large multi-terabyte media collections -- it's fast enough in both. Admittedly neither of those case has millions of subdirs or files where scanning may take a long time. And in the former case it's also all from and to SSDs. Maybe your use case is different where it doesn't work as well. But perhaps then general day-to-day performance is not great either, so I'd suggest looking into SSD-based LVM caching, it really works wonders with Btrfs. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Thu, 28 Jun 2018 23:59:03 -0700 Marc MERLIN wrote: > I don't waste a week recreating the many btrfs send/receive relationships. Consider not using send/receive, and switching to regular rsync instead. Send/receive is very limiting and cumbersome, including because of what you described. And it doesn't gain you much over an incremental rsync. As for snapshots on the backup server, you can either automate making one as soon as a backup has finished, or simply make them once/twice a day, during a period when no backups are ongoing. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set
On Mon, 14 May 2018 11:36:26 +0300 Nikolay Borisovwrote: > So what made you have these expectation, is it codified somewhere > (docs/man pages etc)? I'm fine with that semantics IF this is what > people expect. "Compression ...does not work for NOCOW files": https://btrfs.wiki.kernel.org/index.php/Compression The mount options man page does not say that the NOCOW attribute of files will be disregarded with compress-force. It only mentions interaction with the nodatacow and nodatasum mount options. So I'd expect the attribute to still work and prevent compression of NOCOW files. > Now the question is why people grew up to have this expectation and not the > other way round? IMO force_compress should really disregard everything else Both are knobs that the user needs to explicitly set, the difference is that the +C attribute is fine-grained and the mount option is global. If they are set by the user to conflicting values, it seems more useful to have the fine-grained control override the global one, not the other way round. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set
On Mon, 14 May 2018 11:10:34 +0300 Nikolay Borisovwrote: > But if we have mounted the fs with FORCE_COMPRESS shouldn't we disregard > the inode flags, presumably the admin knows what he is doing? Please don't. Personally I always assumed chattr +C would prevent both CoW and compression, and used that as a way to override volume-wide compress-force for a particular folder. Now that it turns out this wasn't working, the patch would fix it to behave in line with prior expectations. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 10 Mar 2018 16:50:22 +0100 Adam Borowskiwrote: > Since we're on a btrfs mailing list, if you use qemu, you really want > sparse format:raw instead of qcow2 or preallocated raw. This also works > great with TRIM. Agreed, that's why I use RAW. QCOW2 would add a second layer of COW on top of Btrfs, which sounds like a nightmare. Even if you would run those files as NOCOW in Btrfs, somehow I feel FS-native COW is more efficient than emulating it in userspace with special format files. > > It works, just not with some of the QEMU virtualized disk device drivers. > > You don't need to use qemu-img to manually dig holes either, it's all > > automatic. > > It works only with scsi and virtio-scsi drivers. Most qemu setups use > either ide (ouch!) or virtio-blk. It works with IDE as well. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 10 Mar 2018 15:19:05 +0100 Christoph Anton Mittererwrote: > TRIM/discard... not sure how far this is really a solution. It is the solution in a great many of usage scenarios, don't know enough about your particular one, though. Note you can use it on HDDs too, even without QEMU and the like: via using LVM "thin" volumes. I use that on a number of machines, the benefit is that since TRIMed areas are "stored nowhere", those partitions allow for incredibly fast block-level backups, as it doesn't have to physically read in all the free space, let alone any stale data in there. LVM snapshots are also way more efficient with thin volumes, which helps during backup. > dm-crypt per default blocks discard. Out of misguided paranoia. If your crypto is any good (and last I checked AES was good enough), there's really not a lot to gain for the "attacker" knowing which areas of the disk are used and which are not. > Some longer time ago I had a look at whether qemu would support that on > it's own,... i.e. the guest and it's btrfs would normally use discard, > but the image file below would mark the block as discarded and later on > e can use some qemu-img command to dig holes into exactly those > locations. > Back then it didn't seem to work. It works, just not with some of the QEMU virtualized disk device drivers. You don't need to use qemu-img to manually dig holes either, it's all automatic. > But even if it would in the meantime, a proper zerofree implementation > would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw > images in qemu, the whole thing couldn't work but with really zeroing > the blocks inside the guest. QEMU deallocates parts of its raw images for those areas which have been TRIM'ed by the guest. In fact I never use qcow2, always raw images only. Yet, boot a guest, issue fstrim, and see the raw file while still having the same size, show much lower actual disk usage in "du". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs subvolume mount with different options
On Fri, 12 Jan 2018 17:49:38 + (GMT) "Konstantin V. Gavrilenko"wrote: > Hi list, > > just wondering whether it is possible to mount two subvolumes with different > mount options, i.e. > > | > |- /a defaults,compress-force=lza You can have use different compression algorithms across the filesystem (including none), via "btrfs properties" on directories or subvolumes. They are inherited down the tree. $ mkdir test $ sudo btrfs prop set test compression zstd $ echo abc > test/def $ sudo btrfs prop get test/def compression compression=zstd But it appears this doesn't provide a way to apply compress-force. > |- /b defaults,nodatacow Nodatacow can be applied to any dir/subvolume recursively, or any file (as long as it's created but not written yet) via chattr +C. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.14.3] btrfs out of space error
On Fri, 15 Dec 2017 01:39:03 +0100 Ian Kumlienwrote: > Hi, > > Running a 4.14.3 kernel, this just happened, but there should have > been another 20 gigs or so available. > > The filesystem seems fine after a reboot though What are your mount options, and can you show the output of "btrfs fi df" and "btrfs fi us" for the filesystem? And what does "cat /sys/block/sdb/queue/rotational" return. I wonder if it's the same old "ssd allocation scheme" problem, and no balancing done in a long time or at all. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
On Sat, 18 Nov 2017 02:08:46 +0100 Hans van Kranenburgwrote: > It's using send + balance at the same time. There's something that makes > btrfs explode when you do that. > > It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various > different explosions in kernel log. Since that happened, I made sure I > never did those two things at the same time. Shouldn't it prevent send during balance, or balance during send then, if that's the case? You talk about it "exploding" like it's a normal thing, to have Invalid opcode BUGs in kernel log, and the user has to take care to not use two of the regular FS features at the same time. Seems to be a bug which should be fixed, rather than warning everyone "not to send during balance". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!
On Thu, 16 Nov 2017 16:12:56 -0800 Marc MERLINwrote: > On Thu, Nov 16, 2017 at 11:32:33PM +0100, Holger Hoffstätte wrote: > > Don't pop the champagne just yet, I just read that apprently 4.14 broke > > bcache for some people [1]. Not sure how much that affects you, but it might > > well make things worse. Yeah, I know, wonderful. > > Oh my, that's actually pretty terrible. > I've just reverted both my machines to 3.13, the last thing I need is more > btrfs corruption. Why so far back though, the latest 4.4 and 4.9 are both good series and run without issues for me since a long time. Or perhaps you meant 4.13 :) > I'm also starting to question if I should just drop bcache. It does help > access to a big and slow-ish array, but corruption and periodic btrfs > full rebuilds is not something I can afford to do timewise :-/ I suggest that you try lvmcache instead. It's much more flexible than bcache, does pretty much the same job, and has much less of the "hacky" feel to it. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, 14 Nov 2017 15:09:52 +0100 Klaus Agnolettiwrote: > Hi Roman > > I almost understand :-) - however, I need a bit more information: > > How do I copy the image file to the 6TB without screwing the existing > btrfs up when the fs is not mounted? Should I remove it from the raid > again? Oh, you already added it to your FS, that's so unfortunate. For my scenario I assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space. You could try removing it, but with one of the existing member drives malfunctioning, I wonder if trying any operation on that FS will cause further damage. For example if you remove the 6TB one, how do you prevent Btrfs from using the bad 2TB drive as destination to relocate data from the 6TB drive. Or use it for one of the metadata mirrors, which will fail to write properly, leading into transid failures later, etc. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, 14 Nov 2017 10:36:22 +0200 Klaus Agnolettiwrote: > Obviously, I want /dev/sdd emptied and deleted from the raid. * Unmount the RAID0 FS * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive (noting how much of it is actually unreadable -- chances are it's mostly intact) * physically remove the bad drive (have a powerdown or reboot for this to be sure Btrfs didn't remember it somewhere) * set up a loop device from the dd_rescue'd 2TB file * run `btrfs device scan` * mount the RAID0 filesystem * run the delete command on the loop device, it will not encounter I/O errors anymore. [1] Note that "ddrescue" and "dd_rescue" are two different programs for the same purpose, one may work better than the other. I don't remember which. :) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Mon, 13 Nov 2017 22:39:44 -0500 Davewrote: > I have my live system on one block device and a backup snapshot of it > on another block device. I am keeping them in sync with hourly rsync > transfers. > > Here's how this system works in a little more detail: > > 1. I establish the baseline by sending a full snapshot to the backup > block device using btrfs send-receive. > 2. Next, on the backup device I immediately create a rw copy of that > baseline snapshot. > 3. I delete the source snapshot to keep the live filesystem free of > all snapshots (so it can be optimally defragmented, etc.) > 4. hourly, I take a snapshot of the live system, rsync all changes to > the backup block device, and then delete the source snapshot. This > hourly process takes less than a minute currently. (My test system has > only moderate usage.) > 5. hourly, following the above step, I use snapper to take a snapshot > of the backup subvolume to create/preserve a history of changes. For > example, I can find the version of a file 30 hours prior. Sounds a bit complex, I still don't get why you need all these snapshot creations and deletions, and even still using btrfs send-receive. Here is my scheme: /mnt/dst <- mounted backup storage volume /mnt/dst/backup <- a subvolume /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory etc. /mnt/dst/backup/host1/bin/ /mnt/dst/backup/host1/etc/ /mnt/dst/backup/host1/home/ ... Self explanatory. All regular directories, not subvolumes. Snapshots: /mnt/dst/snaps/backup <- a regular directory /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup Accessing historic data: /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash ... /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system). No need for btrfs send-receive, only plain rsync is used, directly from hostX:/ to /mnt/dst/backup/host1/; No need to create or delete snapshots during the actual backup process; A single common timeline is kept for all hosts to be backed up, snapshot count not multiplied by the number of hosts (in my case the backup location is multi-purpose, so I somewhat care about total number of snapshots there as well); Also, all of this works even with source hosts which do not use Btrfs. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Tue, 14 Nov 2017 10:14:55 +0300 Marat Khaliliwrote: > Don't keep snapshots under rsync target, place them under ../snapshots > (if snapper supports this): > Or, specify them in --exclude and avoid using --delete-excluded. Both are good suggestions, in my case each system does have its own snapshots as well, but they are retained for much shorter. So I both use --exclude to avoid fetching the entire /snaps tree from the source system, and store snapshots of the destination system outside of the rsync target dirs. >Or keep using -x if it works, why not? -x will exclude content of all subvolumes down the tree on the source side -- not only the time-based ones. If you take care to never casually create any subvolumes content of which you'd still want backed up, then I guess it can work. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Move loop termination condition in while()
On Wed, 1 Nov 2017 11:32:18 +0200 Nikolay Borisovwrote: > Fallocating a file in btrfs goes through several stages. The one before > actually > inserting the fallocated extents is to create a qgroup reservation, covering > the desired range. To this end there is a loop in btrfs_fallocate which checks > to see if there are holes in the fallocated range or !PREALLOC extents past > EOF > and if so create qgroup reservations for them. Unfortunately, the main > condition > of the loop is burried right at the end of its body rather than in the actual > while statement which makes it non-obvious. Fix this by moving the condition > in the while statement where it belongs. No functional changes. If it turns out that "cur_offset >= alloc_end" from the get go, previously the loop body would be entered and executed once. With this change, it will not anymore. I did not examine the context to see if such case is possible, likely, beneficial or harmful. But if you wanted 100% no functional changes no matter what, maybe better use a "do ... while" loop? > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/file.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index e0d15c0d1641..ecbe186cb5da 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -3168,7 +3168,7 @@ static long btrfs_fallocate(struct file *file, int mode, > > /* First, check if we exceed the qgroup limit */ > INIT_LIST_HEAD(_list); > - while (1) { > + while (cur_offset < alloc_end) { > em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, cur_offset, > alloc_end - cur_offset, 0); > if (IS_ERR(em)) { > @@ -3204,8 +3204,6 @@ static long btrfs_fallocate(struct file *file, int mode, > } > free_extent_map(em); > cur_offset = last_byte; > - if (cur_offset >= alloc_end) > - break; > } > > /* -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Wed, 1 Nov 2017 01:00:08 -0400 Davewrote: > To reconcile those conflicting goals, the only idea I have come up > with so far is to use btrfs send-receive to perform incremental > backups as described here: > https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . Another option is to just use the regular rsync to a designated destination subvolume on the backup host, AND snapshot that subvolume on that host from time to time (or on backup completions, if you can synchronize that). rsync --inplace will keep space usage low as it will not reupload entire files in case of changes/additions to them. Yes rsync has to traverse both directory trees to find changes, but that's pretty fast (couple of minutes at most, for a typical root filesystem), especially if you use SSD or SSD caching. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need some assistance/direction in determining a system hang during heavy IO
On Thu, 26 Oct 2017 09:40:19 -0600 Cheyenne Willswrote: > Briefly when I upgraded a system from 4.0.5 kernel to 4.9.5 (and > later) I'm seeing a blocked task timeout with heavy IO against a > multi-lun btrfs filesystem. I've tried a 4.12.12 kernel and am still > getting the hang. There is now 4.9.58 (fifty three versions later!) and 4.12 series is long abandoned and gone from the charts altogether. So just in case, did you check with the latest kernels? Also, keep in mind the 120 second warnings are just that, and not an error condition by themselves. You can disable them or increase the maximum timeout in sysctl settings. And it is not clear from your reports if you only get warnings and after the load subsides everything is back to normal, or the FS locks out "for good", i.e. with all access attempts hanging indefinitely and no way to unmount the FS or otherwise recover. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mount failing - unable to find logical
On Wed, 18 Oct 2017 09:24:01 +0800 Qu Wenruowrote: > > > On 2017年10月18日 04:43, Cameron Kelley wrote: > > Hey btrfs gurus, > > > > I have a 4 disk btrfs filesystem that has suddenly stopped mounting > > after a recent reboot. The data is in an odd configuration due to > > originally being in a 3 disk RAID1 before adding a 4th disk and running > > a balance to convert to RAID10. There wasn't enough free space to > > completely convert, so about half the data is still in RAID1 while the > > other half is in RAID10. Both metadata and system are RAID10. It has > > been in this configuration for 6 months or so now since adding the 4th > > disk. It just holds archived media and hasn't had any data added or > > modified in quite some time. I feel pretty stupid now for not correcting > > that sooner though. > > > > I have tried mounting with different mount options for recovery, ro, > > degraded, etc. Log shows errors about "unable to find logical > > 3746892939264 length 4096" > > > > When I do a btrfs check, it doesn't find any issues. Running > > btrfs-find-root comes up with a message about a block that the > > generation doesn't match. If I specify that block on the btrfs check, I > > get transid verify failures. > > > > I ran a dry run of a recovery of the entire filesystem which runs > > through every file with no errors. I would just restore the data and > > start fresh, but unfortunately I don't have the free space at the moment > > for the ~4.5TB of data. > > > > I also ran full smart self tests on all 4 disks with no errors. > > > > root@nas2:~# uname -a > > Linux nas2 4.13.7-041307-generic #201710141430 SMP Sat Oct 14 14:39:06 > > UTC 2017 i686 i686 i686 GNU/Linux > > I don't think i686 kernel will cause any difference, but considering > most of us are using x86_64 to develop/test, maybe it will be a good > idea to upgrade to x86_64 kernel? Indeed a problem with mounting on 32-bit in 4.13 has been reported recently: https://www.spinics.net/lists/linux-btrfs/msg69734.html with the same error message. I believe it's this patchset that is supposed to fix that. https://www.spinics.net/lists/linux-btrfs/msg70001.html @Cameron maybe you didn't just reboot, but also upgraded your kernel at the same time? In any case, try a 4.9 series kernel, or a 64-bit machine if you want to stay with 4.13. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lost about 3TB
On Tue, 3 Oct 2017 10:54:05 + Hugo Millswrote: >There are other possibilities for missing space, but let's cover > the obvious ones first. One more obvious thing would be files that are deleted, but still kept open by some app (possibly even from network, via NFS or SMB!). @Frederic, did you try rebooting the system? -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Give up on bcache?
On Tue, 26 Sep 2017 16:50:00 + (UTC) Ferry Tothwrote: > https://www.phoronix.com/scan.php?page=article=linux414-bcache- > raid=2 > > I think it might be idle hopes to think bcache can be used as a ssd cache > for btrfs to significantly improve performance.. My personal real-world experience shows that SSD caching -- with lvmcache -- does indeed significantly improve performance of a large Btrfs filesystem with slowish base storage. And that article, sadly, only demonstrates once again the general mediocre quality of Phoronix content: it is an astonishing oversight to not check out lvmcache in the same setup, to at least try to draw some useful conclusion, is it Bcache that is strangely deficient, or SSD caching as a general concept does not work well in the hardware setup utilized. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, 12 Sep 2017 12:32:14 +0200 Adam Borowskiwrote: > discard in the guest (not supported over ide and virtio, supported over scsi > and virtio-scsi) IDE does support discard in QEMU, I use that all the time. It got broken briefly in QEMU 2.1 [1], but then fixed again. [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757927 -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount time for big filesystems
On Thu, 31 Aug 2017 07:45:55 -0400 "Austin S. Hemmelgarn"wrote: > If you use dm-cache (what LVM uses), you need to be _VERY_ careful and > can't use it safely at all with multi-device volumes because it leaves > the underlying block device exposed. It locks the underlying device so it can't be seen by Btrfs and cause problems. # btrfs dev scan Scanning for Btrfs filesystems # btrfs fi show Label: none uuid: 62ff7619-8202-47f6-8c7e-cef6f082530e Total devices 1 FS bytes used 112.00KiB devid1 size 16.00GiB used 2.02GiB path /dev/mapper/vg-OriginLV # ls -la /dev/mapper/ total 0 drwxr-xr-x 2 root root 140 Aug 31 12:01 . drwxr-xr-x 16 root root2980 Aug 31 12:01 .. crw--- 1 root root 10, 236 Aug 31 11:59 control lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-CacheDataLV_cdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-CacheDataLV_cmeta -> ../dm-2 lrwxrwxrwx 1 root root 7 Aug 31 12:06 vg-OriginLV -> ../dm-0 lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-OriginLV_corig -> ../dm-3 # btrfs dev scan /dev/dm-0 Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV' # btrfs dev scan /dev/dm-3 Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV_corig' ERROR: device scan failed on '/dev/mapper/vg-OriginLV_corig': Device or resource busy -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount time for big filesystems
On Thu, 31 Aug 2017 12:43:19 +0200 Marco Lorenzo Crocianiwrote: > Hi, > this 37T filesystem took some times to mount. It has 47 > subvolumes/snapshots and is mounted with > noatime,compress=zlib,space_cache. Is it normal, due to its size? If you could implement SSD caching in front of your FS (such as lvmcache or bcache), that would work wonders for performance in general, and especially for mount times. I have seen amazing results with lvmcache (of just 32 GB) for a 14 TB FS. As for in general, with your FS size perhaps you should be using "space_cache=v2" for better performance, but I'm not sure if that will have any effect on mount time (aside from slowing down the first mount with that). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleted subvols don't go away?
On Mon, 28 Aug 2017 15:03:47 +0300 Nikolay Borisovwrote: > when the cleaner thread runs again the snapshot's root item is going to > be deleted for good and you no longer will see it. Oh, that's pretty sweet -- it means there's actually a way to reliably wait for cleaner work to be done on all deleted snapshots before unmounting the FS. I was wondering about that recently for some transient filesystems (which get mounted, synced to, snapshot-created/removed, then unmounted). Now can just loop with a few second sleeps until `btrfs sub list -d $PATH` comes up empty. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netapp-alike snapshots?
On Tue, 22 Aug 2017 18:57:25 +0200 Ulli Horlacher <frams...@rus.uni-stuttgart.de> wrote: > On Tue 2017-08-22 (21:45), Roman Mamedov wrote: > > > It is beneficial to not have snapshots in-place. With a local directory of > > snapshots, issuing things like "find", "grep -r" or even "du" will take an > > inordinate amount of time and will produce a result you do not expect. > > Netapp snapshots are invisible for tools doing opendir()/readdir() > One could simulate this with symlinks for the snapshot directory: > store the snapshot elsewhere (not inplace) and create a symlink to it, in > every directory. > > > > Personally I prefer to have a /snapshots directory on every FS > > My users want the snapshots locally in a .snapshot subdirectory. > Because Netapp do it this way - for at least 20 years and we have a > multi-PB Netapp storage environment. > No chance to change this. Just a side note, you do know that only subvolumes can be snapshotted on Btrfs, not any regular directory? And that snapshots are not recursive, i.e. if a subvolume "contains" other subvolumes (hint: it really doesn't), snapshots of the parent one will not include content of subvolumes below that in the tree. I don't know how Netapp does this, from the way you describe that setup it feels like with Btrfs you're still in for some bad surprises and a part of your expectations will not be met. Do you plan to make each and every directory and subdirectory a subvolume (so that it could have a trail of its own snapshots)? There will be performance implications to that. Also deleting subvolumes can only be done via the "btrfs" tool, they won't delete like normal dirs, e.g. when trying to do that remotely via NFS or Samba share. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
On Tue, 22 Aug 2017 17:45:37 +0200 Ulli Horlacherwrote: > In perl I have now: > > $root = $volume; > while (`btrfs subvolume show "$root" 2>/dev/null` !~ /toplevel subvolume/) { > $root = dirname($root); > last if $root eq '/'; > } > > If you are okay with rolling your own solutions like this, take a look at "btrfs filesystem usage ". It will print the blockdevice used for mounting the base FS. From that you can find the mountpoint via /proc/mounts. Performance-wise it seems to work instantly on an almost full 2TB FS. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netapp-alike snapshots?
On Tue, 22 Aug 2017 16:24:51 +0200 Ulli Horlacherwrote: > On Tue 2017-08-22 (15:44), Peter Becker wrote: > > Is use: https://github.com/jf647/btrfs-snap > > > > 2017-08-22 15:22 GMT+02:00 Ulli Horlacher : > > > With Netapp/waffle you have automatic hourly/daily/weekly snapshots. > > > You can find these snapshots in every local directory (readonly). > > > Example: > > > > > > framstag@fex:/sw/share: ll .snapshot/ > > > drwxr-xr-x framstag root - 2017-08-14 10:21:47 > > > .snapshot/daily.2017-08-15_0010 > > > drwxr-xr-x framstag root - 2017-08-14 10:21:47 > > > .snapshot/daily.2017-08-16_0010 > > > drwxr-xr-x framstag root - 2017-08-14 10:21:47 > > > .snapshot/daily.2017-08-17_0010 > > > drwxr-xr-x framstag root - 2017-08-14 10:21:47 > > > .snapshot/daily.2017-08-18_0010 > > > drwxr-xr-x framstag root - 2017-08-18 23:59:29 > > > .snapshot/daily.2017-08-19_0010 > > > drwxr-xr-x framstag root - 2017-08-19 21:01:25 > > > .snapshot/daily.2017-08-20_0010 > > > drwxr-xr-x framstag root - 2017-08-20 19:48:40 > > > .snapshot/daily.2017-08-21_0010 > > > drwxr-xr-x framstag root - 2017-08-20 02:50:18 > > > .snapshot/hourly.2017-08-20_1210 > > > drwxr-xr-x framstag root - 2017-08-20 02:50:18 > > > .snapshot/hourly.2017-08-20_1610 > > > drwxr-xr-x framstag root - 2017-08-20 19:48:40 > > > .snapshot/hourly.2017-08-20_2010 > > > drwxr-xr-x framstag root - 2017-08-21 00:42:28 > > > .snapshot/hourly.2017-08-21_0810 > > > drwxr-xr-x framstag root - 2017-08-21 00:42:28 > > > .snapshot/hourly.2017-08-21_1210 > > > drwxr-xr-x framstag root - 2017-08-21 13:05:28 > > > .snapshot/hourly.2017-08-21_1610 > > btrfs-snap does not create local .snapshot/ sub-directories, but saves the > snapshots in the toplevel root volume directory. It is beneficial to not have snapshots in-place. With a local directory of snapshots, issuing things like "find", "grep -r" or even "du" will take an inordinate amount of time and will produce a result you do not expect. For some of those tools the problem can be avoided (by always keeping in mind to use "-x" with du, or "--one-file-system" with tar), but not for all of them. Personally I prefer to have a /snapshots directory on every FS, and e.g. timed snapshots of /home/username/src will live in /snapshots/home-username-src/. No point to hide it there with a dot either, as it's convenient to be able to browse older snapshots with GUI filemanagers (which hide dot-files by default). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
On Wed, 16 Aug 2017 12:48:42 +0100 (BST) "Konstantin V. Gavrilenko"wrote: > I believe the chunk size of 512kb is even worth for performance then the > default settings on my HW RAID of 256kb. It might be, but that does not explain the original problem reported at all. If mdraid performance would be the bottleneck, you would see high iowait, possibly some CPU load from the mdX_raidY threads. But not a single Btrfs thread pegging into 100% CPU. > So now I am moving the data from the array and will be rebuilding it with 64 > or 32 chunk size and checking the performance. 64K is the sweet spot for RAID5/6: http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum errors on top of dm-crypt
On Fri, 4 Aug 2017 12:44:44 +0500 Roman Mamedov <r...@romanrm.net> wrote: > > What is 0x98f94189, is it not a csum of a block of zeroes by any chance? > > It does seem to be something of that sort Actually, I think I know what happened. I used "dd bs=1M conv=sparse" to copy source FS onto a LUKS device, which skipped copying 1M-sized areas of zeroes from the source device by seeking over those areas on the destination device. This only works OK if the destination device is entirely zeroed beforehand. But I also use --allow-discards for the LUKS device; so it may be that after a discard passthrough to the underlying SSD, which will then return zeroes for discarded areas, LUKS will not take care to pass zeroes back "upwards" when reading from those areas, instead it may attempt to decrypt them with its crypto process, making them read back to userspace as random data. So after an initial TRIM the destination crypto device was not actually zeroed, far from it. :) As a result, every large non-sparse file with at least 1MB-long run of zeroes in it (those sqlite ones appear to fit the bill) was not written out entirely onto the destination device by dd, and the intended zero areas were left full of crypto-randomness instead. Sorry for the noise, I hope at least this catch was somewhat entertaining. And Btrfs saves the day once again. :) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SQLite Re: csum errors on top of dm-crypt
On Fri, 4 Aug 2017 12:18:58 +0500 Roman Mamedov <r...@romanrm.net> wrote: > What I find weird is why the expected csum is the same on all of these. > Any idea what this might point to as the cause? > > What is 0x98f94189, is it not a csum of a block of zeroes by any chance? It does seem to be something of that sort, as it appears in https://www.spinics.net/lists/linux-btrfs/msg67281.html (though as factual csum, not the expected one). > a few files turned out to be unreadable Actually, turns out ALL of those are sqlite files(!) .mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users) .moonchild productions/pale moon/.../urlclassifier3.sqlite .config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2 users) .config/chromium/Default/History .config/chromium/Default/Top Sites nothing else affected. Forgot to mention that the kernel version is 4.9.40. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
csum errors on top of dm-crypt
Hello, I've migrated my home dir to a luks dm-crypt device some time ago, and today during a scheduled backup a few files turned out to be unreadable, with csum errors from Btrfs in dmesg. What I find weird is why the expected csum is the same on all of these. Any idea what this might point to as the cause? What is 0x98f94189, is it not a csum of a block of zeroes by any chance? (I use a patch from Qu Wenruo to improve the error reporting). [483575.992252] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1 [483575.994518] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1 [483575.995640] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 2785280 csum 0x7f97f4a6 expected csum 0x98f94189 mirror 1 [483575.996599] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 1736704 csum 0x7476ddf8 expected csum 0x98f94189 mirror 1 [483585.020047] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1 [483585.023036] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1 [483585.023702] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1900544 csum 0x26c571dc expected csum 0x98f94189 mirror 1 [483585.023761] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 2949120 csum 0x27726fbe expected csum 0x98f94189 mirror 1 [483599.026289] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17645568 csum 0xdd5bf4de expected csum 0x98f94189 mirror 1 [483599.027425] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1 [483599.032396] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1 [483599.092709] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1 [483599.093080] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1 [483599.093242] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1736704 csum 0x1d4087fc expected csum 0x98f94189 mirror 1 [483627.708625] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1 [483627.709459] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1 [483627.709799] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2965504 csum 0xfaff212d expected csum 0x98f94189 mirror 1 [483634.462684] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 5062656 csum 0x8c7df392 expected csum 0x98f94189 mirror 1 [483634.462703] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1 [483634.466602] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 7159808 csum 0xfc06d954 expected csum 0x98f94189 mirror 1 [483634.466604] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 6111232 csum 0xc802b3b4 expected csum 0x98f94189 mirror 1 [483634.470118] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1 [483634.470257] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 10305536 csum 0x3d8c1843 expected csum 0x98f94189 mirror 1 [483634.471085] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 9256960 csum 0xba3fede3 expected csum 0x98f94189 mirror 1 [483634.471128] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 8302592 csum 0x7de15198 expected csum 0x98f94189 mirror 1 [484152.178497] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1 [484152.180422] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1736704 csum 0xf01ac658 expected csum 0x98f94189 mirror 1 [484152.181598] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1 [484152.182242] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 2785280 csum 0xc78988ec expected csum 0x98f94189 mirror 1 [484158.569489] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 2138112 csum 0xab34e90e expected csum 0x98f94189 mirror 1 [484158.571885] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 2785280 csum 0xd611911e expected csum 0x98f94189 mirror 1 [484158.575191] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 3833856 csum 0x6277c8a6 expected csum 0x98f94189 mirror 1 [484158.575620] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 4882432 csum 0x3293c3e7 expected csum 0x98f94189 mirror 1 [484158.578637] BTRFS warning
Re: Crashed filesystem, nothing helps
On Wed, 02 Aug 2017 11:17:04 +0200 Thomas Wurfbaumwrote: > A restore does also not help: > mainframe:~ # btrfs restore /dev/sdb1 /mnt > parent transid verify failed on 29392896 wanted 1486833 found 1486836 > parent transid verify failed on 29392896 wanted 1486833 found 1486836 > parent transid verify failed on 29392896 wanted 1486833 found 1486836 > parent transid verify failed on 29392896 wanted 1486833 found 1486836 > Ignoring transid failure > parent transid verify failed on 29409280 wanted 1486829 found 1486833 > parent transid verify failed on 29409280 wanted 1486829 found 1486833 > parent transid verify failed on 29409280 wanted 1486829 found 1486833 > parent transid verify failed on 29409280 wanted 1486829 found 1486833 > Ignoring transid failure > parent transid verify failed on 29376512 wanted 1327723 found 1486833 > parent transid verify failed on 29376512 wanted 1327723 found 1486833 > parent transid verify failed on 29376512 wanted 1327723 found 1486833 > parent transid verify failed on 29376512 wanted 1327723 found 1486833 > Ignoring transid failure Did it just abruptly exit there? Or you terminated it? IIRC these messages (about ignoring) are not a problem for restore, it should be able to continue. Or if not, it would print a more definitive error message, e.g. "Couldn't read tree root" or such. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, 1 Aug 2017 10:14:23 -0600 Liu Bowrote: > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > separate disk as a journal (aka raid5/6 log), so that after unclean > shutdown we can make sure data and parity are consistent on the raid > array by replaying the journal. Could it be possible to designate areas on the in-array devices to be used as journal? While md doesn't have much spare room in its metadata for extraneous things like this, Btrfs could use almost as much as it wants to, adding to size of the FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks. It doesn't seem convenient to need having an additional storage device around just for the log, and also needing to maintain its fault tolerance yourself (so the log device would better be on a mirror, such as mdadm RAID1? more expense and maintenance complexity). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
On Sun, 30 Jul 2017 18:14:35 +0200 "marcel.cochem"wrote: > I am pretty sure that not all data is lost as i can grep thorugh the > 100 GB SSD partition. But my question is, if there is a tool to rescue > all (intact) data and maybe have only a few corrupt files which can't > be recovered. There is such a tool, see https://btrfs.wiki.kernel.org/index.php/Restore -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
On Mon, 31 Jul 2017 11:12:01 -0700 Liu Bowrote: > Superblock and chunk tree root is OK, looks like the header part of > the tree root is now all-zero, but I'm unable to think of a btrfs bug > which can lead to that (if there is, it is a serious enough one) I see that the FS is being mounted with "discard". So maybe it was a TRIM gone bad (wrong location or in a wrong sequence). Generally it appears to be not recommended to use "discard" by now (because of its performance impact, and maybe possible issues like this), instead schedule to call "fstrim " once a day or so, and/or on boot-up. > on ssd like disks, by default there is only one copy for metadata. Time and time again, the default of "single" metadata for SSD is a terrible idea. Most likely DUP metadata would save the FS in this case. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
On Fri, 28 Jul 2017 17:40:50 +0100 (BST) "Konstantin V. Gavrilenko"wrote: > Hello list, > > I am stuck with a problem of btrfs slow performance when using compression. > > when the compress-force=lzo mount flag is enabled, the performance drops to > 30-40 mb/s and one of the btrfs processes utilises 100% cpu time. > mount options: btrfs > relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10 It does not work like that, you need to set compress-force=lzo (and remove compress=). With your setup I believe you currently use compress-force[=zlib](default), overriding compress=lzo, since it's later in the options order. Secondly, > autodefrag This sure sounded like a good thing to enable? on paper? right?... The moment you see anything remotely weird about btrfs, this is the first thing you have to disable and retest without. Oh wait, the first would be qgroups, this one is second. Finally, what is the reasoning behind "commit=10", and did you check with the default value of 30? -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best Practice: Add new device to RAID1 pool
On Mon, 24 Jul 2017 09:46:34 -0400 "Austin S. Hemmelgarn"wrote: > > I am a little bit confused because the balance command is running since > > 12 hours and only 3GB of data are touched. This would mean the whole > > balance process (new disc has 8TB) would run a long, long time... and > > is using one cpu by 100%. > > Based on what you're saying, it sounds like you've either run into a > bug, or have a huge number of snapshots ...and possibly quotas (qgroups) enabled. (perhaps automatically by some tool, and not by you). Try: btrfs quota disable With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Btrfs: add skeleton code for compression heuristic
On Fri, 21 Jul 2017 13:00:56 +0800 Anand Jainwrote: > > > On 07/18/2017 02:30 AM, David Sterba wrote: > > So it basically looks good, I could not resist and rewrote the changelog > > and comments. There's one code fix: > > > > On Mon, Jul 17, 2017 at 04:52:58PM +0300, Timofey Titovets wrote: > >> -static inline int inode_need_compress(struct inode *inode) > >> +static inline int inode_need_compress(struct inode *inode, u64 start, u64 > >> end) > >> { > >>struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > >> > >>/* force compress */ > >>if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) > >> - return 1; > >> + return btrfs_compress_heuristic(inode, start, end); > > > > > This must stay 'return 1', if force-compress is on, so the change is > > reverted. > > Initially I thought 'return 1' is correct, but looking in depth, > it is not correct as below.. > > The biggest beneficiary of the estimating the compression ratio > in advance (heuristic) is when customers are using the > -o compress-force. But 'return 1' here is making them not to > use heuristic. So definitely something is wrong. man mount says for btrfs: If compress-force is specified, all files will be compressed, whether or not they compress well. So compress-force by definition should always compress all files no matter what, and not use any heuristic. In fact it has no right to, as user forced compression to always on. Returning 1 up there does seem right to me. > -o compress is about the whether each of the compression-granular bytes > (BTRFS_MAX_UNCOMPRESSED) of the inode should be tried to compress OR > just give up for the whole inode by looking at the compression ratio > of the current compression-granular. > This approach can be overridden by -o compress-force. So in > -o compress-force there will be a lot more efforts in _trying_ > to compression than in -o compress. We must use heuristic for > -o compress-force. Semantic and the user expectation of compress-force dictates to always compress without giving up, even if it turns out to be slower and not providing much benefit. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "detect-zeroes=unmap" support in Btrfs?
On Tue, 18 Jul 2017 16:57:10 +0500 Roman Mamedov <r...@romanrm.net> wrote: > if a block written consists of zeroes entirely, instead of writing zeroes to > the backing storage, converts that into an "unmap" operation > (FALLOC_FL_PUNCH_HOLE[1]). BTW I found that it is very easy to "offline" process preexisting files for this, using "fallocate -d". -d, --dig-holes Detect and dig holes. Makes the file sparse in-place, without using extra disk space. The minimal size of the hole depends on filesystem I/O block size (usually 4096 bytes). Also, when using this option, --keep-size is implied. If no range is specified by --offset and --length, then all file is analyzed for holes. You can think of this as doing a "cp --sparse" and renaming the dest file as the original, without the need for extra disk space. So my suggestion is to implement an "online" counterpart to such forced-sparsifying, i.e. the same thing done on FS I/O in-band. (the analogy is with offline vs in-band dedup). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"detect-zeroes=unmap" support in Btrfs?
Hello, Qemu/KVM has this nice feature in its storage layer, "detect-zeroes=unmap". Basically the VM host detects if a block written by the guest consists of zeroes entirely, and instead of writing zeroes to the backing storage, converts that into an "unmap" operation (FALLOC_FL_PUNCH_HOLE[1]). I wonder if the same could be added into Btrfs directly? With a CoW filesystem there is really no reason to store long runs of zeroes, or even spend compression cycles on them (even if they compress really well), it would be more efficient to turn all zero-filled blocks into file holes. (In effect forcing all files with zero blocks into always being "sparse" files.) You could say this increases fragmentation, but given the CoW nature of Btrfs, any write to a file increases fragmentation already (except with "nocow"), and converting zeroes into holes would be beneficial due to not requiring any actual IO when those need to be read (reading back zeroes which are not stored anywhere, as opposed to reading actual zeroes from disk). [1] http://man7.org/linux/man-pages/man2/fallocate.2.html -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Chunk root problem
On Wed, 5 Jul 2017 22:10:35 -0600 Daniel Bradywrote: > parent transid verify failed Typically in Btrfs terms this means "you're screwed", fsck will not fix it, and nobody will know how to fix or what is the cause either. Time to restore from backups! Or look into "btrfs restore" if you don't have any. In your case it's especially puzzling as the difference in transid numbers is really significant (about 100K), almost like the FS was operating for months without updating some parts of itself -- and no checksum errors either, so all looks correct, except that everything is horribly wrong. This kind of error seems to occur more often in RAID setups, either Btrfs native RAID, or with Btrfs on top of other RAID setups -- i.e. where it becomes a complex issue that all writes to multi devices DO complete IN order, in case of an unclean shutdown. (which is much simpler on a single device FS). Also one of your disks or cables is failing (was /dev/sde on that boot, but may get a different index next boot), check SMART data for it and replace. > [ 21.230919] BTRFS info (device sdf): bdev /dev/sde errs: wr 402545, rd > 234683174, flush 194501, corrupt 0, gen 0 -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
On Thu, 8 Jun 2017 19:57:10 +0200 Hans van Kranenburgwrote: > There is an improvement with subvolume delete + nossd that is visible > between 4.7 and 4.9. I don't remember if I asked before, but did you test on 4.4? The two latest longterm series are 4.9 and 4.4. 4.7 should be abandoned and forgotten by now really, certainly not used daily in production, it's not even listed on kernel.org anymore. Also it's possible the 4.7 branch that you test did not receive all the bugfix backports from mainline like the longterm series do. > I have no idea what change between 4.7 and 4.9 is responsible for this, but > it's good. FWIW, this appears to be the big Btrfs change between 4.7 and 4.9 (in 4.8): Btrfs: introduce ticketed enospc infrastructure https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=957780eb2788d8c218d539e19a85653f51a96dc1 -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: getting rid of "csum failed" on a hw raid
On Wed, 7 Jun 2017 15:09:02 +0200 Adam Borowskiwrote: > On Wed, Jun 07, 2017 at 01:10:26PM +0300, Timofey Titovets wrote: > > 2017-06-07 13:05 GMT+03:00 Stefan G. Weichinger : > > > Am 2017-06-07 um 11:37 schrieb Timofey Titovets: > > > > > >> btrfs scrub start /mnt_path do this trick > > >> > > >> After, you can find info with paths in dmesg > > > > > > thank you, I think I have the file, it's a qemu-img-file. > > > I try cp-ing it to another fs first, but assume this will fail, right? > > > > Yes, because btrfs will return -EIO > > So try dd_rescue > > Or even plain dd conv=noerror. Both will do a faithful analogue of a > physical disk with a silent data corruption on the affected sectors. Yeah, except "plain dd conv=noerror" will produce a useless corrupted image, because it will be shifted forward by the number of unreadable bytes after the first error. You also need the "sync" flag in there. https://superuser.com/questions/622541/what-does-dd-conv-sync-noerror-do http://www.debianadmin.com/recover-data-from-a-dead-hard-drive-using-dd.html https://wiki.archlinux.org/index.php/disk_cloning Or just stick with dd_rescue and not try to correct people's perfectly good suggestions with completely wrong and harmful ones. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] Btrfs: compression must free at least PAGE_SIZE
On Sun, 21 May 2017 19:54:05 +0300 Timofey Titovetswrote: > Sorry, but i know about subpagesize-blocksize patch set, but i don't > understand where you see conflict? > > Can you explain what you mean? > > By PAGE_SIZE i mean fs cluster size in my patch set. This appears to be exactly the conflict. Subpagesize blocksize patchset would make it possible to use e.g. Btrfs with 4K block (cluster) size on a MIPS machine with 64K-sized pages. Would your checking for PAGE_SIZE still be correct then? > So if and when subpage patch set would merged, PAGE_SIZE should be > replaced with sector size, and all continue work correctly. I guess Duncan's question was why not compare against block size from the get go, rather than create more places for Chandan to scour through to eliminate all "blocksize = pagesize" assumptions... -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 6 corrupted
On Fri, 19 May 2017 11:55:27 +0300 Pasi Kärkkäinenwrote: > > > Try saving your data with "btrfs restore" first > > > > First post, he tried that. No luck. Tho that was with 4.4 userspace. > > It might be worth trying with the 4.11-rc or soon to be released 4.11 > > userspace, tho... > > > > Try with 4.12-rc, I assume :) No, actually I missed that this was already tried, and a newer kernel will not help at "btrfs restore", AFAIU it works entirely in userspace, not kernel. Newer btrfs-progs could be something to try though, as the version used seems pretty old -- btrfs-progs v4.4.1, while the current one is v4.11. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 6 corrupted
On Thu, 18 May 2017 04:09:38 +0200 Łukasz Wróblewskiwrote: > I will try when stable 4.12 comes out. > Unfortunately I do not have a backup. > Fortunately, these data are not so critical. > Some private photos and videos of youth. > However, I would be very happy if I could get it back. Try saving your data with "btrfs restore" first (i.e. you can do it right now, as it doesn't depend on kernel versions), after you have your data recovered and reliably backed up, then you can proceed with experiments on new kernel, patches and whatnot. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Fri, 12 May 2017 20:36:44 +0200 Kai Krakowwrote: > My concern is with fail scenarios of some SSDs which die unexpected and > horribly. I found some reports of older Samsung SSDs which failed > suddenly and unexpected, and in a way that the drive completely died: > No more data access, everything gone. HDDs start with bad sectors and > there's a good chance I can recover most of the data except a few > sectors. Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort of RAID. In a way it's even better, that SSDs [are said to] fail abruptly and entirely. You can then just restore from backups and go on. Whereas a failing HDD can leave you puzzled on e.g. whether it's a cable or controller problem instead, and possibly can even cause some data corruption which you won't notice until too late. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backing up BTRFS metadata
On Thu, 11 May 2017 09:19:28 -0600 Chris Murphywrote: > On Thu, May 11, 2017 at 8:56 AM, Marat Khalili wrote: > > Sorry if question sounds unorthodox, Is there some simple way to read (and > > backup) all BTRFS metadata from volume? > > btrfs-image Hm, I thought that's for debugging only, and that you can't actually restore metadata onto a data-containing FS and have anything mountable/readable as a result. Seems not to be the case, and in fact, could this be one of the "missing links" in the Fsck story, -w Walk all the trees manually and copy any blocks that are referenced. Use this option if your extent tree is corrupted to make sure that all of the metadata is captured. This certainly does sound like something to try for some of those broken filesystems where Btrfsck refuses to do anything. Save image with this manual walking/reconstruction of the trees, then restore. Too bad I already nuked mine, so can't experiment with that. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
On Wed, 10 May 2017 09:48:07 +0200 Martin Steigerwaldwrote: > Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. > Indeed it is. It may or may not be possible to build a perfect Fsck, but IMO for the time being, what's most sorely missing, is some sort of a knowingly destructive repair mode, as in "I don't care about partial user data loss, just whack the FS metadata into full logical consistency at any means necessary". Also feels like it doesn't currently deal with the majority of actual in-real-world corruptions, notably the "parent transid failure" (even by a few dozens increments) which it can only helpfully "Ignore" during repair. So even with a minor corruption (something wonky in just ONE block of a multi-terabyte FS) the answer is way too often "nuke the entire thing and restore from backups". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
On Wed, 10 May 2017 09:02:46 +0200 Stefan Priebe - Profihost AGwrote: > how to fix bad key ordering? You should clarify does the FS in question mount (read-write? read-only?) and what are the kernel messages if it does not. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "corrupt leaf, invalid item offset size pair"
On Mon, 8 May 2017 20:05:44 +0200 "Janos Toth F."wrote: > May be someone more talented will be able to assist you but in my > experience this kind of damage is fatal in practice (even if you could > theoretically fix it, it's probably easier to recreate the fs and > restore the content from backup, or use the rescue tool to save some > of the old content which you never had copies from and restore that). > I think the problem is that the disturbed disk gets out of sync > (obviously, it misses some queued/buffered writes) from the rest of > the fs/disk(s) but later gets accepted back like it's in a perfectly > fine state (and/or Btrfs is ready to deal with problems like this, > though it looks like it is not), and then some fatal corruption starts > developing (due to the problematic disk being treated like it has > correct data, even though it has some errors). If you have it mounted > RW long enough, it will probably get worse and gets unmountable at > some point (and thus harder, if not impossible to rescure any data). > This is how I usually lost my RAID-5 mode Btrfs filesystems before I > stopped experimenting with that. I never had this problem since I > disabled SATA HotPlug (in the firmware setup of the motherboard) and > switched to RAID-10 mode (and eventually replaced both faulty SATA > cables in the system, one at a time after an incident...). Yeah I scrapped the FS and now restoring from backups. For some of the stuff that wasn't backed up, "btrfs restore" worked remarkably well. This was my primary 9x2TB mdadm RAID6 with Btrfs on top. But after all, it appears to be too risky to run all storage as a huge SPOF like that. And since I had almost everything backed up elsewhere, there's seems to be little justification for the protections of RAID6 (the machine does not need 100.00% uptime and does not even have hot-swap drive bays). So I will now switch to using individual drives with single-device Btrfs on each, joined for convenience with mhddfs/unionfs/aufs on the directory tree level. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"corrupt leaf, invalid item offset size pair"
Hello, It appears like during some trouble with HDD cables and controllers, I got some disk corruption. As a result, after a short period of time my Btrfs went read-only, and now does not mount anymore. [Sun May 7 23:08:02 2017] BTRFS error (device dm-8): parent transid verify failed on 13799442505728 wanted 625048 found 624487 [Sun May 7 23:08:02 2017] BTRFS info (device dm-8): read error corrected: ino 1 off 13799442505728 (dev /dev/mapper/vg-r6p1 sector 6736670512) [Sun May 7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify failed on 13799589576704 wanted 625088 found 624488 [Sun May 7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify failed on 13799589576704 wanted 625088 found 624402 [Sun May 7 23:08:33 2017] [ cut here ] [Sun May 7 23:08:33 2017] WARNING: CPU: 3 PID: 2022 at fs/btrfs/extent-tree.c:6555 __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]() [Sun May 7 23:08:33 2017] BTRFS: Transaction aborted (error -5) [Sun May 7 23:08:33 2017] Modules linked in: dm_mirror dm_region_hash dm_log ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_nat xt_limit xt_length nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6t_rpfilter ipt_rpfilter xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_raw iptable_raw ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables cpufreq_userspace cpufreq_conservative cpufreq_stats cpufreq_powersave nbd nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache 8021q garp mrp bridge stp llc bonding tcp_illinois aoe crc32 loop it87 hwmon_vid fuse kvm_amd kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul ghash_clmulni_intel asus_wmi sparse_keymap rfkill [Sun May 7 23:08:33 2017] video sha256_ssse3 sha256_generic hmac mxm_wmi drbg ansi_cprng snd_hda_codec_realtek aesni_intel snd_hda_codec_generic aes_x86_64 snd_hda_intel lrw gf128mul snd_hda_codec glue_helper snd_pcsp snd_hda_core ablk_helper snd_hwdep cryptd snd_pcm snd_timer cp210x joydev snd serio_raw k10temp evdev usbserial edac_mce_amd edac_core soundcore fam15h_power sp5100_tco acpi_cpufreq tpm_infineon wmi i2c_piix4 tpm_tis tpm 8250_fintek shpchp processor button ext4 crc16 mbcache jbd2 btrfs dm_cache_smq raid10 raid1 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq crc32c_generic md_mod dm_cache_mq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c dm_mod sg sd_mod ata_generic hid_generic usbhid hid ohci_pci sata_mv ahci pata_jmicron libahci crc32c_intel sata_sil24 [Sun May 7 23:08:33 2017] ehci_pci ohci_hcd xhci_pci psmouse xhci_hcd ehci_hcd libata usbcore scsi_mod e1000e usb_common ptp pps_core [Sun May 7 23:08:33 2017] CPU: 3 PID: 2022 Comm: btrfs-transacti Not tainted 4.4.66-rm1+ #181 [Sun May 7 23:08:33 2017] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 LE R2.0, BIOS 2601 03/24/2015 [Sun May 7 23:08:33 2017] 0286 2595262f 8800d675baf0 812ff351 [Sun May 7 23:08:33 2017] 8800d675bb38 a03929d2 8800d675bb28 8107eb95 [Sun May 7 23:08:33 2017] 0c42b8ffb000 fffb 8805b6c60800 [Sun May 7 23:08:33 2017] Call Trace: [Sun May 7 23:08:33 2017] [] dump_stack+0x63/0x82 [Sun May 7 23:08:33 2017] [] warn_slowpath_common+0x95/0xe0 [Sun May 7 23:08:33 2017] [] warn_slowpath_fmt+0x5c/0x80 [Sun May 7 23:08:33 2017] [] __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs] [Sun May 7 23:08:33 2017] [] ? btrfs_merge_delayed_refs+0x67/0x610 [btrfs] [Sun May 7 23:08:33 2017] [] __btrfs_run_delayed_refs+0x99c/0x1260 [btrfs] [Sun May 7 23:08:33 2017] [] ? dequeue_task_fair+0x597/0x870 [Sun May 7 23:08:33 2017] [] ? put_prev_entity+0x42/0x760 [Sun May 7 23:08:33 2017] [] btrfs_run_delayed_refs+0x7e/0x2b0 [btrfs] [Sun May 7 23:08:33 2017] [] ? del_timer_sync+0x48/0x50 [Sun May 7 23:08:33 2017] [] btrfs_commit_transaction+0x5d/0xa60 [btrfs] [Sun May 7 23:08:33 2017] [] ? start_transaction+0x99/0x4d0 [btrfs] [Sun May 7 23:08:33 2017] [] transaction_kthread+0x1dc/0x250 [btrfs] [Sun May 7 23:08:33 2017] [] ? btrfs_cleanup_transaction+0x560/0x560 [btrfs] [Sun May 7 23:08:33 2017] [] kthread+0xfa/0x110 [Sun May 7 23:08:33 2017] [] ? kthread_park+0x60/0x60 [Sun May 7 23:08:33 2017] [] ret_from_fork+0x3f/0x70 [Sun May 7 23:08:33 2017] [] ? kthread_park+0x60/0x60 [Sun May 7 23:08:33 2017] ---[ end trace 13439d259c35afcf ]--- [Sun May 7 23:08:33 2017] BTRFS: error (device dm-8) in __btrfs_free_extent:6555: errno=-5 IO failure [Sun May 7 23:08:33 2017] BTRFS info (device dm-8): forced readonly [Sun May 7 23:08:33 2017] BTRFS: error (device dm-8) in btrfs_run_delayed_refs:2930: errno=-5 IO failure [Sun May 7 23:14:26 2017] BTRFS error (device dm-8): cleaner transaction attach returned -30 Unmounted and
Re: btrfs check --repair: failed to repair damaged filesystem, aborting
On Tue, 2 May 2017 23:17:11 -0700 Marc MERLINwrote: > On Tue, May 02, 2017 at 11:00:08PM -0700, Marc MERLIN wrote: > > David, > > > > I think you maintain btrfs-progs, but I'm not sure if you're in charge > > of check --repair. > > Could you comment on the bottom of the mail, namely: > > > failed to repair damaged filesystem, aborting > > > So, I'm out of luck now, full wipe and 3-5 day rebuild? > > Actually, another thought: > Is there or should there be a way to repair around the bit that cannot > be repaired? > Separately, or not, can I locate which bits are causing the repair to > fail and maybe get a pointer to the path/inode so that I can hopefully > just delete those bad data structures (assuming deleting them is even > possible and that the FS won't just go read only as I try to do that) There is the "btrfs-corrupt-block" tool which helped me to kick Btrfsck further along its course in a similar "unrepairable" situation. https://www.spinics.net/lists/linux-btrfs/msg53061.html In your case it appears like the block 2899180224512 is giving it the most trouble, so you could start with killing that one. From what I can tell this tool zeroes out the entire block, so Btrfsck can simply delete the reference and forget it, rather than repeatedly trying to figure out solutions and bailing out with "failed to repair damaged filesystem, aborting". Depending on what was stored in it, you may have either no visible effect, or a complete filesystem failure, or anything in between. Hence if you want to experiment with this, find a way to work on writable overlay snapshots (also described in the linked message). > Here is the full run if that helps: > https://pastebin.com/STMFHty4 > > > Thanks, > > Marc > > > > Rest: > > On Tue, May 02, 2017 at 11:47:22AM -0700, Marc MERLIN wrote: > > > (cc trimmed) > > > > > > The one in debian/unstable crashed: > > > gargamel:~# btrfs --version > > > btrfs-progs v4.7.3 > > > gargamel:~# btrfs check --repair /dev/mapper/dshelf2 > > > bytenr mismatch, want=2899180224512, have=3981076597540270796 > > > extent-tree.c:2721: alloc_reserved_tree_block: Assertion `ret` failed. > > > btrfs[0x43e418] > > > btrfs[0x43e43f] > > > btrfs[0x43f276] > > > btrfs[0x43f46f] > > > btrfs[0x4407ef] > > > btrfs[0x440963] > > > btrfs(btrfs_inc_extent_ref+0x513)[0x44107a] > > > btrfs[0x420053] > > > btrfs[0x4265eb] > > > btrfs(cmd_check+0x)[0x427d6d] > > > btrfs(main+0x12f)[0x40a341] > > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f6b632e82b1] > > > btrfs(_start+0x2a)[0x40a37a] > > > > > > Ok, it's old, let's take git from today: > > > gargamel:~# btrfs --version > > > btrfs-progs v4.10.2 > > > As a note, > > > gargamel:~# btrfs check --mode=lowmem --repair /dev/mapper/dshelf2 > > > enabling repair mode > > > ERROR: low memory mode doesn't support repair yet > > > > > > As a note, a 32bit binary on a 64bit kernel: > > > gargamel:~# btrfs check --repair /dev/mapper/dshelf2 > > > enabling repair mode > > > Checking filesystem on /dev/mapper/dshelf2 > > > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 > > > checking extents > > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > > > checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E > > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > > > bytenr mismatch, want=2899180224512, have=3981076597540270796 > > > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 > > > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 > > > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B > > > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B > > > parent transid verify failed on 1671538819072 wanted 293964 found 293902 > > > parent transid verify failed on 1671538819072 wanted 293964 found 293902 > > > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 > > > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 > > > cmds-check.c:6291: add_data_backref: BUG_ON `!back` triggered, value 1 > > > Aborted > > > > > > let's try again with a 64bit binary built from git: > > > (...) > > > Repaired extent references for 4227617038336 > > > ref mismatch on [4227872751616 4096] extent item 1, found 0 > > > Incorrect local backref count on 4227872751616 parent 3493071667200 owner > > > 0 > > > offset 0 found 0 wanted 1 back 0x56470b18e7f0 > > > Backref disk bytenr does not match extent record, bytenr=4227872751616, > > > ref > > > bytenr=0 > > > backpointer mismatch on [4227872751616 4096] > > > owner ref check failed [4227872751616 4096] > > > repair deleting extent record: key 4227872751616 168 4096 > > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > > > checksum
Re: [PATCH 3/3] Make max_size consistent with nr
On Fri, 28 Apr 2017 11:13:36 +0200 Christophe de Dinechinwrote: > Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1. > In check_extent_refs, we will call: > > set_extent_dirty(root->fs_info->excluded_extents, >rec->start, >rec->start + rec->max_size - 1); > > This ends up with BUG_ON(end < start) in insert_state. > > Signed-off-by: Christophe de Dinechin > --- > cmds-check.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/cmds-check.c b/cmds-check.c > index 58e65d6..774e9b6 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree > *extent_cache, u64 bytenr, > tmpl.start = bytenr; > tmpl.nr = 1; > tmpl.metadata = 1; > +tmpl.max_size = 1; > > ret = add_extent_rec_nolookup(extent_cache, ); > if (ret) The original code uses Tab characters for indent, but your addition uses spaces. Also same problem in patch 2/3. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No space left on device when doing "mkdir"
On Thu, 27 Apr 2017 08:52:30 -0500 Gerard Saraberwrote: > I could just reboot the system and be fine for a week or so, but is > there any way to diagnose this? `btrfs fi df` for a start. Also obligatory questions: do you have a lot of snapshots, and do you use qgroups? -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Tue, 18 Apr 2017 03:23:13 + (UTC) Duncan <1i5t5.dun...@cox.net> wrote: > Without reading the links... > > Are you /sure/ it's /all/ ssds currently on the market? Or are you > thinking narrowly, those actually sold as ssds? > > Because all I've read (and I admit I may not actually be current, but...) > on for instance sd cards, certainly ssds by definition, says they're > still very write-cycle sensitive -- very simple FTL with little FTL wear- > leveling. > > And AFAIK, USB thumb drives tend to be in the middle, moderately complex > FTL with some, somewhat simplistic, wear-leveling. > If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be SSDs "by definition", but nobody will think of an SD card when you say "I bought an SSD for my computer". And yes, SD card and USB flash sticks are commonly understood to be much simpler and more brittle devices than full blown desktop (not to mention server) SSDs. > While the stuff actually marketed as SSDs, generally SATA or direct PCIE/ > NVME connected, may indeed match your argument, no real end-user concern > necessary any more as the FTLs are advanced enough that user or > filesystem level write-cycle concerns simply aren't necessary these days. > > > So does that claim that write-cycle concerns simply don't apply to modern > ssds, also apply to common thumb drives and sd cards? Because these are > certainly ssds both technically and by btrfs standards. > -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, 17 Apr 2017 07:53:04 -0400 "Austin S. Hemmelgarn"wrote: > General info (not BTRFS specific): > * Based on SMART attributes and other factors, current life expectancy > for light usage (normal desktop usage) appears to be somewhere around > 8-12 years depending on specifics of usage (assuming the same workload, > F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper > end, XFS is roughly in the middle, ext4 and NTFS are on the low end > (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the > bottom of the barrel). Life expectancy for an SSD is defined not in years, but in TBW (terabytes written), and AFAICT that's not "from host", but "to flash" (some SSDs will show you both values in two separate SMART attributes out of the box, on some it can be unlocked). Filesystem may come into play only by the amount of write amplification they cause (how much "to flash" is greater than "from host"). Do you have any test data to show that FSes are ranked in that order by WA they cause, or is it all about "general feel" and how they are branded (F2FS says so, so it must be the best) > * Queued DISCARD support is still missing in most consumer SATA SSD's, > which in turn makes the trade-off on those between performance and > lifetime much sharper. My choice was to make a script to run from crontab, using "fstrim" on all mounted SSDs nightly, and aside from that all FSes are mounted with "nodiscard". Best of the both worlds, and no interference with actual IO operation. > * Modern (2015 and newer) SSD's seem to have better handling in the FTL > for the journaling behavior of filesystems like ext4 and XFS. I'm not > sure if this is actually a result of the FTL being better, or some > change in the hardware. Again, what makes you think this, did you observe the write amplification readings and now those are demonstrably lower than on "2014 and older" SSDs? So, by how much, and which models did you compare? > * In my personal experience, Intel, Samsung, and Crucial appear to be > the best name brands (in relative order of quality). I have personally > had bad experiences with SanDisk and Kingston SSD's, but I don't have > anything beyond circumstantial evidence indicating that it was anything > but bad luck on both counts. Why not think in terms not of "name brands" but platforms, i.e. a controller model + flash combination. For instance Intel have been using some other companies' controllers in their SSDs. Kingston uses tons of various controllers (Sandforce/Phison/Marvell/more?) depending on the model and range. > * Files with NOCOW and filesystems with 'nodatacow' set will both hurt > performance for BTRFS on SSD's, and appear to reduce the lifetime of the > SSD. "Appear to"? Just... what. So how many SSDs did you have fail under nocow? Or maybe can we get serious in a technical discussion? Did you by any chance mean cause more writes to the SSD and more "to flash" writes (resulting in a higher WA). If so, then by how much, and what was your test scenario comparing the same usage with and without nocow? > * Compression should help performance and device lifetime most of the > time, unless your CPU is fully utilized on a regular basis (in which > case it will hurt performance, but still improve device lifetimes). Days are long gone since the end user had to ever think about device lifetimes with SSDs. Refer to endurance studies such as http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead http://ssdendurancetest.com/ https://3dnews.ru/938764/ It has been demonstrated that all SSDs on the market tend to overshoot even their rated TBW by several times, as a result it will take any user literally dozens of years to wear out the flash no matter which filesystem or what settings used. And most certainly it's not worth it changing anything significant in your workflow (such as enabling compression if it's otherwise inconvenient or not needed) just to save the SSD lifetime. On Mon, 17 Apr 2017 13:13:39 -0400 "Austin S. Hemmelgarn" wrote: > > What is a high end SSD these days? Built-in NVMe? > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones As opposed to bad Samsung EVO drives and low-quality Intel ones? > and the Crucial MX series, but > probably some others. My choice of words here probably wasn't the best > though. Again, which controller? Crucial does not manufacture SSD controllers on their own, they just pack and brand stuff manufactured by someone else. So if you meant Marvell based SSDs, then that's many brands, not just Crucial. > For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets > rewritten in-place. This means that cheap FTL's will rewrite that erase > block in-place (which won't hurt performance but will impact device > lifetime), and good ones will rewrite into a free
Re: About free space fragmentation, metadata write amplification and (no)ssd
On Sun, 9 Apr 2017 06:38:54 + Paul Joneswrote: > -Original Message- > From: linux-btrfs-ow...@vger.kernel.org > [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Hans van Kranenburg > Sent: Sunday, 9 April 2017 6:19 AM > To: linux-btrfs > Subject: About free space fragmentation, metadata write amplification and > (no)ssd > > > So... today a real life story / btrfs use case example from the trenches at > > work... > > Snip!! > > Great read. I do the same thing for backups on a much smaller scale and it > works brilliantly. Two 4T drives in btrfs raid1. > I will mention that I recently setup caching using LLVM (1 x 300G ssd for > each 4T drive), and it's extraordinary how much of a difference it makes. > Especially when running deduplication. If it's feasible perhaps you could try > it with a nvme drive. You mean LVM, not LLVM :) I was actually going to suggest that as well, in my case I use a 32GB SSD cache for my entire 14TB filesystem with 15 GB metadata (*2 in DUP). In fact you should check the metadata size on yours, most likely you can get by with an order of magnitude smaller cache for exactly the same benefit (and have the rest of 2x300GB for other interesting uses). And yeah it's amazing, especially when deleting old snapshots or doing backups. In my case I backup the entire root FS from about 30 hosts, and keep that in periodic snapshots for a month. Previously I would also stagger rsync runs so that no more than 4 or 5 hosts get backed up at the same time (and still there would be tons of trashing in seeks and iowait), now it's no problem whatsoever. The only issue that I have with this setup is you need to "cleanly close" the cached LVM device on shutdown/reboot, and apparently there is no init script in Debian that would do that (experimenting with adding some hacks, but no success yet). So on every boot the entire cache is marked dirty and data is being copied from cache to the actual storage, which takes some time, since this appears to be done in a random IO pattern. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mix ssd and hdd in single volume
On Mon, 3 Apr 2017 11:30:44 +0300 Marat Khaliliwrote: > You may want to look here: https://www.synology.com/en-global/dsm/Btrfs > . Somebody forgot to tell Synology, which already supports btrfs in all > hardware-capable devices. I think Rubicon has been crossed in > 'mass-market NAS[es]', for good or not. AFAIR Synology did not come to this list asking for (any kind of) advice prior to implementing that (else they would have gotten the same kind of post from Duncan and others), and it's not Btrfs developers job to have an outreach program to contact vendors and educate them to not use Btrfs. I don't remember seeing them actively contribute improvements or fixes especially for the RAID5 or RAID6 features (which they ADVERTISE on that page as a fully working part of their product). That doesn't seem honest to end users or playing nicely with the upstream developers. What the upstream gets instead is just those end-users coming here one by one some years later, asking how to fix a broken Btrfs RAID5 on an embedded box running some 3.10 or 3.14 kernel. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?
On Sun, 2 Apr 2017 09:30:46 +0300 Andrei Borzenkovwrote: > 02.04.2017 03:59, Duncan пишет: > > > > 4) In fact, since an in-place convert is almost certainly going to take > > more time than a blow-away and restore from backup, > > This caught my eyes. Why? In-place convert just needs to recreate > metadata. If you have multi-terabyte worth of data copying them twice > hardly can be faster. In-place convert is most certainly faster than copy-away and restore, in fact it can be very fast if you use the option to not calculate checksums for the entire filesystem's data (btrfs-convert -d). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
On Mon, 27 Mar 2017 13:32:47 -0600 Chris Murphywrote: > How about if qgroups are enabled, then non-root user is prevented from > creating new subvolumes? That sounds like, if you turn your headlights on in a car, then in-vehicle air conditioner randomly stops working. :) Two things only vaguely related from the end user's point of view. > Or is there a way for a new nested subvolume to be included in its > parent's quota, rather than the new subvolume having a whole new quota > limit? Either that, or a separate "allow non-root user subvolumes/snapshots creation" mount option. There is already one for deletion, after all. user_subvol_rm_allowed Allow subvolumes to be deleted by a non-root user. Use with caution. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
On Mon, 27 Mar 2017 16:49:47 +0200 Christian Theunewrote: > Also: the idea of migrating on btrfs also has its downside - the performance > of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current > shrinking job to finish but this is likely limited to the “find free space” > algorithm. We’re talking about a few megabytes converted per second. Sigh. Btw since this is all on LVM already, you could set up lvmcache with a small SSD-based cache volume. Even some old 60GB SSD would work wonders for performance, and with the cache policy of "writethrough" you don't have to worry about its reliability (much). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
On Mon, 27 Mar 2017 15:20:37 +0200 Christian Theunewrote: > (Background info: we’re migrating large volumes from btrfs to xfs and can > only do this step by step: copying some data, shrinking the btrfs volume, > extending the xfs volume, rinse repeat. If someone should have any > suggestions to speed this up and not having to think in terms of _months_ > then I’m all ears.) I would only suggest that you reconsider XFS. You can't shrink XFS, therefore you won't have the flexibility to migrate in the same way to anything better that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not perform that much better over Ext4, and very importantly, Ext4 can be shrunk. >From the looks of it Ext4 has also overcome its 16TB limitation: http://askubuntu.com/questions/779754/how-do-i-resize-an-ext4-partition-beyond-the-16tb-limit -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
On Sat, 25 Mar 2017 23:00:20 -0400 "J. Hart"wrote: > I have a Btrfs filesystem on a backup server. This filesystem has a > directory to hold backups for filesystems from remote machines. In this > directory is a subdirectory for each machine. Under each machine > subdirectory is one directory for each filesystem (ex /boot, /home, etc) > on that machine. In each filesystem subdirectory are incremental > snapshot subvolumes for that filesystem. The scheme is something like > this: > > /backup/// > > I'd like to try to back up (duplicate) the file server filesystem > containing these snapshot subvolumes for each remote machine. The > problem is that I don't think I can use send/receive to do this. "Btrfs > send" requires "read-only" snapshots, and snapshots are not recursive as > yet. I think there are too many subvolumes which change too often to > make doing this without recursion practical. You could have done time-based snapshots on the top level (for /backup/), say, every 6 hours, and keep those for e.g. a month. Then don't bother with any other kind of subvolumes/snapshots on the backup machine, and do backups from remote machines into their respective subdirectories using simple 'rsync'. That's what a sensible scheme looks like IMO, as opposed to a Btrfs-induced exercise in futility that you have (there are subvolumes? must use them for everything, even the frigging /boot/; there is send/receive? absolutely must use it for backing up; etc.) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help : "bad tree block start" -> btrfs forced readonly
On Fri, 17 Mar 2017 10:27:11 +0100 Lionel Boutonwrote: > Hi, > > Le 17/03/2017 à 09:43, Hans van Kranenburg a écrit : > > btrfs-debug-tree -b 3415463870464 > > Here is what it gives me back : > > btrfs-debug-tree -b 3415463870464 /dev/sdb > btrfs-progs v4.6.1 > checksum verify failed on 3415463870464 found A85405B7 wanted 01010101 > checksum verify failed on 3415463870464 found A85405B7 wanted 01010101 > bytenr mismatch, want=3415463870464, have=72340172838076673 > ERROR: failed to read 3415463870464 > > Is there a way to remove part of the tree and keep the rest ? It could > help minimize the time needed to restore data. If you are able to experiment with writable snapshots, you could try using "btrfs-corrupt-block" to kill the bad block, and see what btrfsck makes out of the rest. In a similar case I got little to no damage to the overall FS. http://www.spinics.net/lists/linux-btrfs/msg53061.html -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.10/4.11 Experiences
On Thu, 16 Feb 2017 13:37:53 +0200 Imran Geriskovanwrote: > What are your experiences for btrfs regarding 4.10 and 4.11 kernels? > I'm still on 4.8.x. I'd be happy to hear from anyone using 4.1x for > a very typical single disk setup. Are they reasonably stable/good > enough for this case? You should always check with https://www.kernel.org/ what are the current versions and what is their status. As you can see, 4.8 is basically dead in the water, nowhere seen on the website, it does not get any updates anymore by the kernel devs. If yours is a distro kernel, you now have to rely on whatever fixes (and with what kind of quality) the distro maintainers are able to backport. Personally I took a liking to always running the latest longterm series, i.e. right now staying on 4.4 and after a few initial hiccups it appears rock-solid Btrfs-wise (as you said for single device, no multi-devices, no qgroup etc). I'd suggest that you either upgrade to 4.9 (from the news it appears that one will be granted the next longterm serues status), or switch to 4.4, which may or may not be less preferable, given there are some scary sounding reports about 4.9 (if you have this list's archive, search for "4.9" in thread titles) with little to no conclusive resolutions. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected behavior involving file attributes and snapshots.
On Tue, 14 Feb 2017 10:30:43 -0500 "Austin S. Hemmelgarn"wrote: > I was just experimenting with snapshots on 4.9.0, and came across some > unexpected behavior. > > The simple explanation is that if you snapshot a subvolume, any files in > the subvolume that have the NOCOW attribute will not have that attribute > in the snapshot. Some further testing indicates that this is the only > file attribute that isn't preserved (I checked all the chattr flags that > BTRFS supports). > > I'm kind of curious whether: > 1. This is actually documented somewhere, as it's somewhat unexpected > given that everything else is preserved when snapshotting. > 2. This is intended behavior, or just happens to be a side effect of the > implementation. I don't seem to get this on 4.4.45 and 4.4.47. $ btrfs sub create test Create subvolume './test' $ touch test/abc $ chattr +C test/abc $ echo def > test/abc $ ls -la test/abc -rw-r--r-- 1 rm rm 4 Feb 14 20:52 test/abc $ lsattr test/abc ---C test/abc $ btrfs sub snap test test2 Create a snapshot of 'test' in './test2' $ ls -la test2/abc -rw-r--r-- 1 rm rm 4 Feb 14 20:52 test2/abc $ lsattr test2/abc ---C test2/abc -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On Tue, 7 Feb 2017 09:13:25 -0500 Peter Zaitsevwrote: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. It still does provide some advantage, as in each write into new area since last hour snapshot is going to be CoW'ed only once, as opposed to every new write getting CoW'ed every time no matter what. I'm not sold on autodefrag, what I'd suggest instead is to schedule regular defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase space usage temporarily as it will partially unmerge extents previously shared across snapshots, but you won't get away runaway fragmentation anymore, as you would without nodatacow or with periodical snapshotting. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it possible to have metadata-only device with no data?
On Sun, 5 Feb 2017 22:55:42 +0100 Hans van Kranenburgwrote: > On 02/05/2017 10:42 PM, Alexander Tomokhov wrote: > > Is it possible, having two drives to do raid1 for metadata but keep data on > > a single drive only? > > Nope. > > Would be a really nice feature though... Putting metadata on SSD and > bulk data on HDD... > You can play around with this hack just to see how that would perform, but it comes with no warranty and untested even by me. I was going to try it, but put it on hold since you'd also need to make sure the SSD is being preferred for metadata reads (and not HDD), but so far have not figured out a simple way of ensuring that. --- linux-amd64-4.4/fs/btrfs/volumes.c.orig 2016-11-01 22:41:41.970978721 +0500 +++ linux-amd64-4.4/fs/btrfs/volumes.c 2016-11-01 22:58:45.958977731 +0500 @@ -4597,6 +4597,14 @@ if (total_avail == 0) continue; + /* If we have two devices and one is less than 25% of the total FS size, then +* presumably it's a small device just for metadata RAID1, don't use it +* for new data chunks. */ + if ((fs_devices->num_devices == 2) && + (device->total_bytes * 4 < fs_devices->total_rw_bytes) && + (type & BTRFS_BLOCK_GROUP_DATA)) + continue; + ret = find_free_dev_extent(trans, device, max_stripe_size * dev_stripes, _offset, _avail); -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5: btrfs rescue chunk-recover segfaults.
On Mon, 23 Jan 2017 14:15:55 +0100 Simon Waidwrote: > I have a btrfs raid5 array that has become unmountable. That's the third time you send this today. Will you keep resending every few hours until you get a reply? That's not how mailing lists work. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On Thu, 19 Jan 2017 17:39:37 +0100 "Alejandro R. Mosteo"wrote: > I was wondering, from a point of view of data safety, if there is any > difference between using dup or making a raid1 from two partitions in > the same disk. This is thinking on having some protection against the > typical aging HDD that starts to have bad sectors. RAID1 will write slower compared to DUP, as any optimization to make RAID1 devices work in parallel will cause a total performance disaster for you as you will start trying to write to both partitions at the same time, turning all linear writes into random ones, which are about two orders of magnitude slower than linear on spinning hard drives. DUP shouldn't have this issue, but still it will be twice slower than single, since you are writing everything twice. You could consider DUP data for when a disk is already known to be getting bad sectors from time to time -- but then it's a fringe exercise to try and keep using such disk in the first place. Yeah with DUP data DUP metadata you can likely have some more life out of such disk as a throwaway storage space for non-essential data, at half capacity, but is it worth the effort, as it's likely to start failing progressively worse over time. In all other cases the performance and storage space penalty of DUP within a single device are way too great (and gained redundancy is too low) compared to a proper system of single profile data + backups, or a RAID5/6 system (not Btrfs-based) + backups. > On a related note, I see this caveat about dup in the manpage: > > "For example, a SSD drive can remap the blocks internally to a single > copy thus deduplicating them. This negates the purpose of increased > redunancy (sic) and just wastes space" That ability is vastly overestimated in the man page. There is no miracle content-addressable storage system working at 500 MB/sec speeds all within a little cheap controller on SSDs. Likely most of what it can do, is just compress simple stuff, such as runs of zeroes or other repeating byte sequences. And the DUP mode is still useful on SSDs, for cases when one copy of the DUP gets corrupted in-flight due to a bad controller or RAM or cable, you could then restore that block from its good-CRC DUP copy. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't add/replace a device on degraded filesystem
On Thu, 29 Dec 2016 19:27:30 -0500 Rich Gannonwrote: > I can mount my filesystem with -o degraded, but I can not do btrfs > replace or btrfs device add as the filesystem is in read-only mode, and > I can not mount read-write. You can try my patch which removes that limitation https://patchwork.kernel.org/patch/9419189/ Also as Duncan said there's a "more proper" patch to fix this in the works somewhere, which does a per-chunk check for degraded, and would also allow to mount the FS read-write in your case. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with btrfs filesystem loading
On Thu, 29 Dec 2016 16:42:09 +0100 Michał Zeganwrote: > I have odroid c2, processor architecture aarch64, linux kernel from > master as of today from http://github.com/torwalds/linux.git. > It seems that the btrfs module cannot be loaded. The only thing that > happens is that after modprobe i see: > modprobe: can't load module btrfs (kernel/fs/btrfs/btrfs.ko.gz): unknown > symbol in module, or unknown parameter > No errors in dmesg, like I have ignore_loglevel in kernel cmdline and no > logs in console appear except logs for loading dependencies like xor > module, but that is probably not important. > The kernel has been recompiled few minutes ago from scratch, the only > thing left was .config file. What is that? other modules load correctly > from what I can see. In the past there's been some trouble with crc32 dependencies: https://www.spinics.net/lists/linux-btrfs/msg32104.html Not sure if that's relevant anymore, but in any case, check if you have crc32-related stuff either built-in or compiled as modules, if latter, try loading those before btrfs (/lib/modules/*/kernel/crypto/crc32*) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, 30 Nov 2016 07:50:17 -0500 "Austin S. Hemmelgarn"wrote: > > *) Read performance is not optimized: all metadata is always read from the > > first device unless it has failed, data reads are supposedly balanced > > between > > devices per PID of the process reading. Better implementations dispatch > > reads > > per request to devices that are currently idle. > Based on what I've seen, the metadata reads get balanced too. https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451 This starts from the mirror number 0 and tries others in an incrementing order, until succeeds. It appears that as long as the mirror with copy #0 is up and not corrupted, all reads will simply get satisfied from it. > > *) Write performance is not optimized, during long full bandwidth sequential > > writes it is common to see devices writing not in parallel, but with a long > > periods of just one device writing, then another. (Admittedly have been some > > time since I tested that). > I've never seen this be an issue in practice, especially if you're using > transparent compression (which caps extent size, and therefore I/O size > to a given device, at 128k). I'm also sane enough that I'm not doing > bulk streaming writes to traditional HDD's or fully saturating the > bandwidth on my SSD's (you should be over-provisioning whenever > possible). For a desktop user, unless you're doing real-time video > recording at higher than HD resolution with high quality surround sound, > this probably isn't going to hit you (and even then you should be > recording to a temporary location with much faster write speeds (tmpfs > or ext4 without a journal for example) because you'll likely get hit > with fragmentation). I did not use compression while observing this; Also I don't know what is particularly insane about copying a 4-8 GB file onto a storage array. I'd expect both disks to write at the same time (like they do in pretty much any other RAID1 system), not one-after-another, effectively slowing down the entire operation by as much as 2x in extreme cases. > As far as not mounting degraded by default, that's a conscious design > choice that isn't going to change. There's a switch (adding 'degraded' > to the mount options) to enable this behavior per-mount, so we're still > on-par in that respect with LVM and MD, we just picked a different > default. In this case, I actually feel it's a better default for most > cases, because most regular users aren't doing exhaustive monitoring, > and thus are not likely to notice the filesystem being mounted degraded > until it's far too late. If the filesystem is degraded, then > _something_ has happened that the user needs to know about, and until > some sane monitoring solution is implemented, the easiest way to ensure > this is to refuse to mount. The easiest is to write to dmesg and syslog, if a user doesn't monitor those either, it's their own fault; and the more user friendly one would be to still auto mount degraded, but read-only. Comparing to Ext4, that one appears to have the "errors=continue" behavior by default, the user has to explicitly request "errors=remount-ro", and I have never seen anyone use or recommend the third option of "errors=panic", which is basically the equivalent of the current Btrfs practce. > > *) It does not properly handle a device disappearing during operation. > > (There > > is a patchset to add that). > > > > *) It does not properly handle said device returning (under a > > different /dev/sdX name, for bonus points). > These are not an easy problem to fix completely, especially considering > that the device is currently guaranteed to reappear under a different > name because BTRFS will still have an open reference on the original > device name. > > On top of that, if you've got hardware that's doing this without manual > intervention, you've got much bigger issues than how BTRFS reacts to it. > No correctly working hardware should be doing this. Unplugging and replugging a SATA cable of a RAID1 member should never put your system under the risk of a massive filesystem corruption; you cannot say it absolutely doesn't with the current implementation. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Convert from RAID 5 to 10
On Wed, 30 Nov 2016 00:16:48 +0100 Wilson Meierwrote: > That said, btrfs shouldn't be used for other then raid1 as every other > raid level has serious problems or at least doesn't work as the expected > raid level (in terms of failure recovery). RAID1 shouldn't be used either: *) Read performance is not optimized: all metadata is always read from the first device unless it has failed, data reads are supposedly balanced between devices per PID of the process reading. Better implementations dispatch reads per request to devices that are currently idle. *) Write performance is not optimized, during long full bandwidth sequential writes it is common to see devices writing not in parallel, but with a long periods of just one device writing, then another. (Admittedly have been some time since I tested that). *) A degraded RAID1 won't mount by default. If this was the root filesystem, the machine won't boot. To mount it, you need to add the "degraded" mount option. However you have exactly a single chance at that, you MUST restore the RAID to non-degraded state while it's mounted during that session, since it won't ever mount again in the r/w+degraded mode, and in r/o mode you can't perform any operations on the filesystem, including adding/removing devices. *) It does not properly handle a device disappearing during operation. (There is a patchset to add that). *) It does not properly handle said device returning (under a different /dev/sdX name, for bonus points). Most of these also apply to all other RAID levels. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
On Mon, 28 Nov 2016 00:03:12 -0500 Zygo Blaxellwrote: > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 8e3a5a2..b1314d6 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct > btrfs_path *path, > max_size = min_t(unsigned long, PAGE_SIZE, max_size); > ret = btrfs_decompress(compress_type, tmp, page, > extent_offset, inline_size, max_size); > + WARN_ON(max_size > PAGE_SIZE); > + if (max_size < PAGE_SIZE) { > + char *map = kmap(page); > + memset(map + max_size, 0, PAGE_SIZE - max_size); > + kunmap(page); > + } > kfree(tmp); > return ret; > } Wasn't this already posted as: btrfs: fix silent data corruption while reading compressed inline extents https://patchwork.kernel.org/patch/9371971/ but you don't indicate that's a V2 or something, and in fact the patch seems exactly the same, just the subject and commit message are entirely different. Quite confusing. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount option nodatacow for VMs on SSD?
On Fri, 25 Nov 2016 12:01:37 + (UTC) Duncan <1i5t5.dun...@cox.net> wrote: > Obviously this can be a HUGE problem on spinning rust due to its seek times, > a problem zero-seek-time ssds don't have They are not strictly zero seek time either. Sure you don't have the issue of moving the physical head around, but still, sequential reads are way faster even on SSDs, compared to random reads. Somewhat typical result for a consumer SSD: Sequential Read : 382.301 MB/s Sequential Write : 315.124 MB/s Random Read 512KB : 261.751 MB/s Random Write 512KB : 334.615 MB/s Random Read 4KB (QD=1) :19.859 MB/s [ 4848.5 IOPS] Random Write 4KB (QD=1) :61.794 MB/s [ 15086.3 IOPS] Random Read 4KB (QD=32) : 132.415 MB/s [ 32327.9 IOPS] Random Write 4KB (QD=32) : 203.051 MB/s [ 49573.0 IOPS] If you have tons of 4K fragments, reading them in can go as low as 20 MB/sec, compared to 382 MB/sec if they were all in one piece. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: My system mounts the wrong btrfs partition, from the wrong disk!
On Fri, 25 Nov 2016 12:05:57 +0100 Niccolò Belliwrote: > This is something pretty unbelievable, so I had to repeat it several times > before finding the courage to actually post it to the mailing list :) > > After dozens of data loss I don't trust my btrfs partition that much, so I > make a backup copy with dd weekly. https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices "don't make copies with dd." -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0
On Wed, 16 Nov 2016 11:55:32 +0100 Martin Steigerwaldwrote: > I do think that above kernel messages invite such a kind of interpretation > tough. I took the "BTRFS: open_ctree failed" message as indicative to some > structural issue with the filesystem. For the reason as to why the writable mount didn't work, check "btrfs fi df" for the filesystem to see if you have any "single" profile chunks on it: quite likely you did already mount it "degraded,rw" in the past *once*, after which those "single" chunks get created, and consequently it won't mount r/w anymore (without lifting the restriction on the number of missing devices as proposed). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0
On Wed, 16 Nov 2016 11:25:00 +0100 Martin Steigerwaldwrote: > merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit > mount: Falscher Dateisystemtyp, ungültige Optionen, der > Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende > Kodierungsseite oder ein anderer Fehler > > Manchmal liefert das Systemprotokoll wertvolle Informationen – > versuchen Sie dmesg | tail oder ähnlich > merkaba:~#32> dmesg | tail -6 > [ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts > [ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache > [ 3080.120703] BTRFS info (device dm-13): disk space caching is enabled > [ 3080.120706] BTRFS info (device dm-13): has skinny extents > [ 3080.150957] BTRFS warning (device dm-13): missing devices (1) exceeds > the limit (0), writeable mount is not allowed > [ 3080.195941] BTRFS: open_ctree failed I have to wonder did you read the above message? What you need at this point is simply "-o degraded,ro". But I don't see that tried anywhere down the line. See also (or try): https://patchwork.kernel.org/patch/9419189/ -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: when btrfs scrub reports errors and btrfs check --repair does not
On Sun, 13 Nov 2016 07:06:30 -0800 Marc MERLINwrote: > So first: > a) find -inum returns some inodes that don't match > b) but argh, multiple files (very different) have the same inode number, so > finding > files by inode number after scrub flagged an inode bad, isn't going to work :( I wonder why do you even need scrub to verify file readability. Just try reading all files by using e.g. "cfv -Crr", the read errors produced will point you directly to files which are unreadable, without the need to lookup them in a backward way via inum. Then just restore those from backups. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] [PATCH] Mounting "degraded,rw" should allow for any number of devices missing
Hello, Mounting "degraded,rw" should allow for any number of devices missing, as in many cases the current check seems overly strict and not helpful during what is already a manual recovery scenario. Let's assume the user applying the "degraded" option knows best what condition their FS is in and what are the next steps required to recover from the degraded state. Specifically this would allow salvaging "JBOD-style" arrays of data=single metadata=RAID1, if the user is ready to accept loss of data portions which were on the removed drive. Currently if one of the disks got removed it is not possible for such array to be mounted rw at all -- hence not possible to "dev delete missing" and the only solution is to recreate the FS. Besides, I am currently testing a concept of SSD+HDD array with data=single and metadata=RAID1, where the SSD is used for RAID1 metadata chunks only. E.g. my 13 TB FS only has about 14 GB of metadata at the moment, so I could comfortably use a spare 60GB SSD as a metadata-only device for it. (Making all metadata reads prefer SSD could be the next step.) It would be nice to be able to just lose/fail/forget that SSD, without having to redo the entire FS. But again, since the remaining device has data=single, currently it won't be write-mountable in the degraded state, even though the missing device had only ever contained RAID1 chunks. Maybe someone has other ideas how to solve the above scenarios? Thanks --- linux-amd64-4.4/fs/btrfs/disk-io.c.orig 2016-11-09 16:19:50.431117913 +0500 +++ linux-amd64-4.4/fs/btrfs/disk-io.c 2016-11-09 16:20:31.567117874 +0500 @@ -2992,7 +2992,8 @@ btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); if (fs_info->fs_devices->missing_devices > fs_info->num_tolerated_disk_barrier_failures && - !(sb->s_flags & MS_RDONLY)) { + !(sb->s_flags & MS_RDONLY) && + !btrfs_raw_test_opt(fs_info->mount_opt, DEGRADED)) { pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n", fs_info->fs_devices->missing_devices, fs_info->num_tolerated_disk_barrier_failures); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check --repair: ERROR: cannot read chunk root
On Fri, 4 Nov 2016 01:01:13 -0700 Marc MERLINwrote: > Basically I have this: > sde8:64 0 3.7T 0 > └─sde1 8:65 0 3.7T 0 > └─md59:50 14.6T 0 > └─bcache0252:00 14.6T 0 > └─crypt_bcache0 (dm-0) 253:00 14.6T 0 > > I'll try dd'ing the md5 directly now, but that's going to take another 2 days > :( > > That said, given that almost half the device is not readable from user space > for some reason, that would explain why btrfs check is failing. Obviously it > can't do its job if it can't read blocks. I don't see anything to support the notion that "half is unreadable", maybe just a 512-byte sector is unreadable -- but that would be enough to make regular dd bail out -- which is why you should be using dd_rescue for this, not regular dd. Assuming you just want to copy over as much data as possible, and not simply test if dd fails or not (but in any case dd_rescue at least would not fail instantly and would tell you precise count of how much is unreadable). There is "GNU ddrescue" and "dd_rescue", I liked the first one better, but they both work on a similar principle. Also didn't you recently have issues with bad block lists on mdadm. This mysterious "unreadable and nothing in dmesg" does appear to be a continuation of that. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it possible to speed up unlink()?
On Thu, 20 Oct 2016 08:09:14 -0400 "Austin S. Hemmelgarn"wrote: > > So, it's possible to return unlink() early? or this a bad idea(and why)? > I may be completely off about this, but I could have sworn that unlink() > returns when enough info is on the disk that both: > 1. The file isn't actually visible in the directory. > 2. If the system crashes, the filesystem will know to finish the cleanup. As I understand it there is no fundamental reason why rm of a heavily fragmented file couldn't be exactly as fast as deleting a subvolume with only that single file in it. Remove the directory reference and instantly return success to userspace, continuing to clean up extents in the background. However for many uses that could be counter-productive, as scripts might expect the disk space to be freed up completely after the rm command returns (as they might need to start filling up the partition with new data). In snapshot deletion there are various commit modes built in for that purpose, but I'm not sure if you can easily extend POSIX file deletion to implement synchronous and non-synchronous deletion modes. * Try the 'unlink' program instead of 'rm'; if "just remove the dir entry for now" was implemented anywhere, I'd expect it to be via that. * Try doing 'eatmydata rm', but that's more of a crazy idea than anything else, as eatmydata only affects fsyncs, and I don't think rm is necessarily invoking those. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs-progs: Add a command to show bg info
On Tue, 18 Oct 2016 09:39:32 +0800 Qu Wenruowrote: > > static const char * const cmd_inspect_inode_resolve_usage[] = { > > "btrfs inspect-internal inode-resolve [-v] ", > > "Get file system paths for the given inode", > > @@ -702,6 +814,8 @@ const struct cmd_group inspect_cmd_group = { > > 0 }, > > { "min-dev-size", cmd_inspect_min_dev_size, > > cmd_inspect_min_dev_size_usage, NULL, 0 }, > > + { "bg_analysis", cmd_inspect_bg_analysis, > > + cmd_inspect_bg_analysis_usage, NULL, 0 }, > > Just naming preference, IMHO show-block-groups or dump-block-groups > seems better for me. And in any case please don't mix separation by "-" and "_" in the same command string. In btrfs tool the convention is to separate words in subcommand names using "-". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Wed, 12 Oct 2016 15:19:16 -0400 Zygo Blaxellwrote: > I'm not even sure btrfs does this--I haven't checked precisely what > it does in dup mode. It could send both copies of metadata to the > disks with a single barrier to separate both metadata updates from > the superblock updates. That would be bad in this particular case. It would be bad in any case, including a single physical disk and no RAID, and I don't think there's any basis to speculate that mdadm doesn't implement write barriers properly. > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there > is an interruption (system crash, a disk times out, etc) in degraded mode, Moreover, in any non-COW system writes temporarily corrupt data. So again, writing to a (degraded or not) mdadm RAID5 is not much different than writing to a single physical disk. However I believe in the Btrfs case metadata is always COW, so this particular problem may be not as relevant here in the first place. -- With respect, Roman pgpM_a8ZbdVne.pgp Description: OpenPGP digital signature
Re: RAID system with adaption to changed number of disks
On Tue, 11 Oct 2016 17:58:22 -0600 Chris Murphywrote: > But consider the identical scenario with md or LVM raid5, or any > conventional hardware raid5. A scrub check simply reports a mismatch. > It's unknown whether data or parity is bad, so the bad data strip is > propagated upward to user space without error. On a scrub repair, the > data strip is assumed to be good, and good parity is overwritten with > bad. That's why I love to use Btrfs on top of mdadm RAID5/6 -- combining a mature and stable RAID implementation with Btrfs anti-corruption checksumming "watchdog". In the case that you described, no silent corruption will occur, as Btrfs will report an uncorrectable read error -- and I can just restore the file in question from backups. On Wed, 12 Oct 2016 00:37:19 -0400 Zygo Blaxell wrote: > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a > snowball's chance in hell of surviving a disk failure on a live array > with only data losses. This would work if mdadm and btrfs successfully > arrange to have each dup copy of metadata updated separately, and one > of the copies survives the raid5 write hole. I've never tested this > configuration, and I'd test the heck out of it before considering > using it. Not sure what you mean here, a non-fatal disk failure (i.e. within being compensated by redundancy) is invisible to the upper layers on mdadm arrays. They do not need to "arrange" anything, on such failure from the point of view of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's still perfectly and correctly readable and writable. -- With respect, Roman pgpCiQALhZ93Z.pgp Description: OpenPGP digital signature
Re: csum failed during copy/compare
On Mon, 10 Oct 2016 10:44:39 +0100 Martin Devwrote: > I work for system verification of SSDs and we've recently come up > against an issue with BTRFS on Ubuntu 16.04 > This seems to be a recent change ...well, a change in what? If you really didn't change anything on your machines and the used process, there is no reason for anything to start breaking, other than obvious hardware issues from age/etc (likely not what's happening here). So you most likely did change something yourself, and perhaps the change was upgrading OS version, kernel version(!!!), or versions of software in general. As such, the first suggestion would be go through the recent software updates history, maybe even restore an OS image you used three months ago (if available) and confirm that the problem doesn't occur there. After that it's a process called bisecting, there are tools for that, but likely you don't even need those yet, just carefully note when you got which upgrades, paying highest attention to the kernel version, and note at which point the corruptions start to occur. > as the same process has been used for the last 2 years -- With respect, Roman pgp6nLUf_jf4p.pgp Description: OpenPGP digital signature