btrfs-progs top-level options man pages and usage
I have a cron job which frequently deletes a subvolume and I decided I wanted to silence the output. I remembered there was a -q option and thought I would just quickly glance at the documentation for it to check there wasn't some reason I had not put that in the script when I first wrote it some time ago. That opened up an Alice-in-Wonderland rabbit hole... the documentation for the common command options in btrfs-progs is not just awful, I ended up very confused about what I was seeing... There are several issues: 1) The man pages do not describe the top-level btrfs command options, or their equivalents at subcommand level. 1a) btrfs.asciidoc makes no mention of -q, --quiet, -v, --verbose or even of the concept of top-level btrfs command options. It just explains how the subcommand structure works. 1b) btrfs-subvolume.asciidoc makes no mention of -q or --quiet. However, it *does* mention -v and --verbose but with the completely cryptic (if you are not a btrfs-progs developer) description "(deprecated) alias for global '-v' option". What global '-v' option is that then as it is not documented in btrfs.asciidoc? And what about '-q'? I haven't looked at the other subcommand pages. 2) The built-in help in btrfs and btrfs-subvolume do not describe the top level command options. 2a) 'btrfs' shows a usage message that shows the -q, --quiet, -v and --verbose options but with no information on them. 'btrfs --help' provides no further information. 'btrfs --help --full' does, but it is almost 800 lines long. 2b) 'btrfs subvolume' doesn't even mention these options in its usage message. Nor does it mention the --help option or the --help --full option as global options or as subcommand options. 'btrfs subvolume --help' provides exactly the same output. 'btrfs subvolume --help --full' does at least mention the options - if anyone can ever guess that it exists. Again, I haven't looked at the other subcommands. 3) The difference between global options and subcommand options is very unfortunate, and very confusing. I *hate* the concept of global options (as implemented) -- in my mental model the 'btrfs' command is really just a prefix to group together various btrfs-related commands and I don't even *think* about ever inserting an option between btrfs and the subcommand. In my mental model, the command is 'btrfs subvolume'. In my mind, if the command was 'btrfs' then the syntax would more naturally be 'btrfs create subvolume' (like 'Siri, call David') instead of 'btrfs subvolume create'. 3a) One particularly unfortunate effect is that 'btrfs subv del -v XXX' works but 'btrfs subv del -q XXX' does not. I consider myself very experienced with btrfs but even after drafting the first version of this email I changed my script to do this and was surprised when it didn't work. 3b) Another confusing effect is that both 'btrfs -v subv del XXX' and 'btrfs subv del -v XXX' work but 'btrfs subv -v del XXX' does not. I think fixing the man page to document the global options is important and I am happy to try to create a suitable patch. Does anyone else feel the other issues should be fixed? Graham
Re: no memory is freed after snapshots are deleted
On 10/03/2021 12:07, telsch wrote: > Dear devs, > > after my root partiton was full, i deleted the last monthly snapshots. > however, no memory was freed. > so far rebalancing helped: > > btrfs balance start -v -musage=0 / > btrfs balance start -v -dusage=0 / > > i have deleted all snapshots, but no memory is being freed this time. Don't forget that, in general, deleting a snapshot does nothing - if the original files are still there (or any other snapshots of the same files are still there). In my experience, if you *really* need space urgently you are best of starting with deleting some big files *and* all the snapshots containing them, rather than starting by deleting snapshots. If you are doing balances with low space, I find it useful to watch dmesg to see if the balance is hitting problems finding space to even free things up. However, one big advantage of btrfs is that you can easily temporarily add a small amount of space while you sort things out. Just plug in a USB memory stick, and add it to the filesystem using 'btrfs device add'. I don't recommend leaving it as part of the filesystem for long - it is too easy for the memory stick to fail, or for you remove it forgetting how important it is, but it can be useful when you are trying to do things like remove snapshots and files or run balance. Don't forget to use btrfs device remove to remove it - not just unplugging it!
Re: Re: nfs subvolume access?
On 10/03/2021 08:09, Ulli Horlacher wrote: > On Wed 2021-03-10 (07:59), Hugo Mills wrote: > >>> On tsmsrvj I have in /etc/exports: >>> >>> /data/fex tsmsrvi(rw,async,no_subtree_check,no_root_squash) >>> >>> This is a btrfs subvolume with snapshots: >>> >>> root@tsmsrvj:~# btrfs subvolume list /data >>> ID 257 gen 35 top level 5 path fex >>> ID 270 gen 36 top level 257 path fex/spool >>> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test >>> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test >>> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test >>> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test >>> >>> root@tsmsrvj:~# find /data/fex | wc -l >>> 489887 > >>I can't remember if this is why, but I've had to put a distinct >> fsid field in each separate subvolume being exported: >> >> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash > > I must export EACH subvolume?! I have had similar problems. I *think* the current case is that modern NFS, using NFS V4, can cope with the whole disk being accessible without giving each subvolume its own FSID (which I have stopped doing). HOWEVER, I think that find (and anything else which uses fsids and inode numbers) will see subvolumes as having duplicated inodes. > The snapshots are generated automatically (via cron)! > I cannot add them to /etc/exports Well, you could write some scripts... but I don't think it is necessary. I *think* it is only necessary if you want `find` to be able to cross between subvolumes on the NFS mounted disks. However, I am NOT an NFS expert, nor have I done a lot of work on this. I might be wrong. But I do NFS mount my snapshots disk remotely and use it. And I do see occasional complaints from find, but I live with it.
Re: Large multi-device BTRFS array (usually) fails to mount on boot.
On 19/02/2021 17:42, Joshua wrote: > February 3, 2021 3:16 PM, "Graham Cobb" wrote: > >> On 03/02/2021 21:54, jos...@mailmag.net wrote: >> >>> Good Evening. >>> >>> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having >>> problems mounting on >>> boot without timing out. This causes the system to drop to emergency mode. >>> I am then able to mount >>> the array in emergency mode and all data appears fine, but upon reboot it >>> fails again. >>> >>> I actually first had this problem around a year ago, and initially put >>> considerable effort into >>> extending the timeout in systemd, as I believed that to be the problem. >>> However, all the methods I >>> attempted did not work properly or caused the system to continue booting >>> before the array was >>> mounted, causing all sorts of issues. Eventually, I was able to almost >>> completely resolve it by >>> defragmenting the extent tree and subvolume tree for each subvolume. (btrfs >>> fi defrag >>> /mountpoint/subvolume/) This seemed to reduce the time required to mount, >>> and made it mount on boot >>> the majority of the time. >> >> Not what you asked, but adding "x-systemd.mount-timeout=180s" to the >> mount options in /etc/fstab works reliably for me to extend the timeout. >> Of course, my largest filesystem is only 20TB, across only two devices >> (two lvm-over-LUKS, each on separate physical drives) but it has very >> heavy use of snapshot creation and deletion. I also run with commit=15 >> as power is not too reliable here and losing power is the most frequent >> cause of a reboot. > > Thanks for the suggestion, but I have not been able to get this method to > work either. > > Here's what my fstab looks like, let me know if this is not what you meant! > > UUID={snip} / ext4 errors=remount-ro 0 0 > UUID={snip} /mnt/data btrfs > defaults,noatime,compress-force=zstd:2,x-systemd.mount-timeout=300s 0 0 Hmmm. The line from my fstab is: LABEL=lvmdata /mnt/data btrfs defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,commit=15,space_cache=v2,x-systemd.mount-timeout=180s,nofail 0 3 I note that I do have "nofail" in there, although it doesn't fail for me so I assume it shouldn't make a difference. I can't swear that the disk is currently taking longer to mount than the systemd default (and I will not be in a position to reboot this system any time soon to check). But I am quite sure this made a difference when I added it. Not sure why it isn't working for you, unless it is some systemd problem. It isn't systemd giving up and dropping to emergency because of some other startup problem that occurs before the mount is finished, is it? I could believe systemd cancels any mounts in progress when that happens. Graham
Re: Large multi-device BTRFS array (usually) fails to mount on boot.
On 03/02/2021 21:54, jos...@mailmag.net wrote: > Good Evening. > > I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having > problems mounting on boot without timing out. This causes the system to drop > to emergency mode. I am then able to mount the array in emergency mode and > all data appears fine, but upon reboot it fails again. > > I actually first had this problem around a year ago, and initially put > considerable effort into extending the timeout in systemd, as I believed that > to be the problem. However, all the methods I attempted did not work properly > or caused the system to continue booting before the array was mounted, > causing all sorts of issues. Eventually, I was able to almost completely > resolve it by defragmenting the extent tree and subvolume tree for each > subvolume. (btrfs fi defrag /mountpoint/subvolume/) This seemed to reduce the > time required to mount, and made it mount on boot the majority of the time. > Not what you asked, but adding "x-systemd.mount-timeout=180s" to the mount options in /etc/fstab works reliably for me to extend the timeout. Of course, my largest filesystem is only 20TB, across only two devices (two lvm-over-LUKS, each on separate physical drives) but it has very heavy use of snapshot creation and deletion. I also run with commit=15 as power is not too reliable here and losing power is the most frequent cause of a reboot.
Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
On 23/01/2021 17:21, Zygo Blaxell wrote: > On Sat, Jan 23, 2021 at 02:55:52PM +0000, Graham Cobb wrote: >> On 22/01/2021 22:42, Zygo Blaxell wrote: >> ... >>>> So the point is: what happens if the root subvolume is not mounted ? >>> >>> It's not an onerous requirement to mount the root subvol. You can do (*) >>> >>> tmp="$(mktemp -d)" >>> mount -osubvolid=5 /dev/btrfs "$tmp" >>> setfattr -n 'btrfs...' -v... "$tmp" >>> umount "$tmp" >>> rmdir "$tmp" >> >> No! I may have other data on that disk which I do NOT want to become >> accessible to users on this system (even for a short time). As a simple >> example, imagine, a disk I carry around to take emergency backups of >> other systems, but I need to change this attribute to make the emergency >> backup of this system run as quickly as possible before the system dies. >> Or a disk used for audit trails, where users should not be able to >> modify their earlier data. Or where I suspect a security breach has >> occurred. I need to be able to be confident that the only data >> accessible on this system is data in the specific subvolume I have mounted. > > Those are worthy goals, but to enforce them you'll have to block or filter > the mount syscall with one of the usual sandboxing/container methods. > > If you have that already set up, you can change root subvol xattrs from > the supervisor side. No users will have access if you don't give them > access to the root subvol or the mount syscall on the restricted side > (they might also need a block device FD belonging to the filesystem). > > If you don't have the sandbox already set up, then root subvol access > is a thing your users can already do, and it may be time to revisit the > assumptions behind your security architecture. I'm not talking about root. I mean unpriv'd users (who can't use mount)! If I (as root) mount the whole disk, those users may be able to read or modify files from parts of that disk to which they should not have access. That may be why I am not mounting the whole disk in the first place. I gave a few very simple examples, but I can think of many more cases where a disk may contain files which users might be able to access if the disk was mounted (maybe the disk has subvols used by many different systems but UIDs are not coordinated, or ...). And, of course, if they can open a FD during the brief time it is mounted, they can stop it being unmounted again. No. If I have chosen to mount just a subvol, it is because I don't want to mount the whole disk.
Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
On 22/01/2021 22:42, Zygo Blaxell wrote: ... >> So the point is: what happens if the root subvolume is not mounted ? > > It's not an onerous requirement to mount the root subvol. You can do (*) > > tmp="$(mktemp -d)" > mount -osubvolid=5 /dev/btrfs "$tmp" > setfattr -n 'btrfs...' -v... "$tmp" > umount "$tmp" > rmdir "$tmp" No! I may have other data on that disk which I do NOT want to become accessible to users on this system (even for a short time). As a simple example, imagine, a disk I carry around to take emergency backups of other systems, but I need to change this attribute to make the emergency backup of this system run as quickly as possible before the system dies. Or a disk used for audit trails, where users should not be able to modify their earlier data. Or where I suspect a security breach has occurred. I need to be able to be confident that the only data accessible on this system is data in the specific subvolume I have mounted. Also, the backup problem is a very real problem - abusing xattrs for filesystem controls really screws up writing backup procedures to correctly backup xattrs used to describe or manage data (or for any other purpose). I suppose btrfs can internally store it in an xattr if it wants, as long as any values set are just ignored and changes happen through some other operation (e.g. sysfs). It still might confuse programs like rsync which would try to reset the values each time a sync is done.
NVME experience?
I am about to deploy my first btrfs filesystems on NVME. Does anyone have any hints or advice? Initially they will be root disks, but I am thinking about also moving home disks and other frequently used data to NVME, but probably not backups and other cold data. I am mostly wondering about non-functionals like typical failure modes, life and wear, etc. Which might affect decisions like how to split different filesystems among the drives, whether to mix NVME with other drives (SSD or HDD), etc. Are NVME drives just SSDs with a different interface? With similar lifetimes (by bytes written, or another measure)? And similar typical failure modes? Are they better or worse in terms of firmware bugs? Error detection for corrupted data? SMART or other indicators that they are starting to fail and should be replaced? I assume that they do not (in practice) suffer from "faulty cable" problems. Anyway, I am hoping someone has experiences to share which might be useful. Graham
Re: [PATCH 2/2] btrfs: send: fix invalid commands for inodes with changed type but same gen
On 12/01/2021 11:27, Filipe Manana wrote: > ... > In other words, what I think we should have is a check that forbids > using two roots for an incremental send that are not snapshots of the > same subvolume (have different parent uuids). Are you suggesting that rule should also apply for clone sources (-c option)? Or are clone sources handled differently from the parent in the code?
Re: Re: Re: cloning a btrfs drive with send and receive: clone is bigger than the original?
On 10/01/2021 07:41, cedric.dew...@eclipso.eu wrote: > I've tested some more. > > Repeatedly sending the difference between two consecutive snapshots creates a > structure on the target drive where all the snapshots share data. So 10 > snapshots of 10 files of 100MB takes up 1GB, as expected. > > Repeatedly sending the difference between the first snapshot and each next > snapshot creates a structure on the target drive where the snapshots are > independent, so they don't share any data. How can that be avoided? If you send a snapshot B with a parent A, any files not present in A will be created in the copy of B. The fact that you already happen to have a copy of the files somewhere else on the target is not known to either the sender or the receiver - how would it be? If you want the send process to take into account *other* snapshots that have previously been sent, you need to tell send to also use those snapshots as clone sources. That is what the -c option is for. Alternatively, use a deduper on the destination after the receive has finished and let it work out what can be shared.
Re: synchronize btrfs snapshots over a unreliable, slow connection
On 07/01/2021 03:09, Zygo Blaxell wrote: ... > I would only attempt to put the archives into long-term storage after> > verifying that they produce correct output when fed to btrfs receive;> otherwise, you could find out too late that a months-old archive was> damaged, incomplete, or incorrect, and restores after that point are no> longer possible.> > Once that verification has been done and the subvol is no longer needed> for incremental sends, you can delete the subvol and keep the archive(s)> that produced it. Personally, I wouldn't do that. Particularly if this was my only or main backup. I don't think btrfs has many tests that new versions of "receive" can correctly process old archives - let alone an incremental sequence of them generated by versions of "send" with bugs fixed years before. If it was me, I would always keep the "latest" subvol online, or at least as a newly created full (not incremental) send archive.
Re: synchronize btrfs snapshots over a unreliable, slow connection
On 05/01/2021 08:34, Forza wrote: > > > On 2021-01-04 21:51, cedric.dew...@eclipso.eu wrote: >> I have a master NAS that makes one read only snapshot of my data per >> day. I want to transfer these snapshots to a slave NAS over a slow, >> unreliable internet connection. (it's a cheap provider). This rules >> out a "btrfs send -> ssh -> btrfs receive" construction, as that can't >> be resumed. >> >> Therefore I want to use rsync to synchronize the snapshots on the >> master NAS to the slave NAS. >> >> My thirst thought is something like this: >> 1) create a read-only snapshot on the master NAS: >> btrfs subvolume snapshot -r /mnt/nas/storage >> /mnt/nas/storage_snapshots/storage-$(date +%Y_%m_%d-%H%m) >> 2) send that data to the slave NAS like this: >> rsync --partial -var --compress --bwlimit=500KB -e "ssh -i >> ~/slave-nas.key" /mnt/nas/storage_snapshots/storage-$(date >> +%Y_%m_%d-%H%m) cedric@123.123.123.123/nas/storage >> 3) Restart rsync until all data is copied (by checking the error code >> of rsync, is it's 0 then all data has been transferred) >> 4) Create the read-only snapshot on the slave NAS with the same name >> as in step 1. Seems like a reasonable approach to me, but see comment below. >> Does somebody already has a script that does this? I don't. >> Is there a problem with this approach that I have not yet considered? Not a problem as such, but you could also consider using something like rsnapshot (or reimplementing your own version by using rsync --link-dest) instead of relying on btrfs snapshots on the slave NAS. That way you don't need btrfs on that NAS at all if you don't want. I used that approach as the (old) NAS I was using had a very old linux version and didn't even run btrfs. > One option is to store the send stream as a compressed file and rsync > that file over and do a shasum or similar on it. I have looked into that in the past and eventually decided against it. My main concern was being too reliant on very complex and less used features of btrfs, including one which has had several bugs in the past (send/receive). I decided my backups needed to be reliable and robust more than they need to be optimally efficient. I had even considered just saving the original send stream, and the subsequent incremental sends (all compressed) - until I realised that any tiny corruption or bug in even one of those streams could make the later streams completely unrestorable. In the end, I decided to use a very boring (but powerful and well-maintained), widely used, conventional backup tool (specifically, dar, under the control of the dar_automatic_backup script) and I copy the dar archives (compressed and encrypted) onto my offsite backup server (actually, now, I store them in S3, using rclone). They are also convenient to occasionally put on a disk which I can give to a friend to put at the back of their cupboard somewhere in case I need it (faster and cheaper to access than S3)! In my case, I had some spare disks and plenty of bandwidth so I also use rsnapshot from my onsite NAS to an offsite NAS. But that is for convenience (not having to have dar read through all the archives) - I consider the S3 dar archives my "main" disaster-recovery backup. > btrbk[2] is a Btrfs backup tool that also can store snapshots as > archives on remote location. You may want to have a look at that too. I use btrbk for local snapshots (storing snapshots of all my systems on my main server system). But I consider those convenient copies for restoring files deleted by mistake, or restoring earlier configurations, not backups (for example, a serious electrical problem or fire in the server machine could destroy both the original disk and the snapshot disk). Your situation is different, of course - so just some things to consider.
Re: synchronize a btrfs snapshot via network with resume support?
On 01/01/2021 14:42, cedric.dew...@eclipso.eu wrote: ... > I'm looking for a program that can synchronize a btrfs snapshot via a > network, and supports resuming of interrupted transfers. Not an answer to your question... the way I would solve your problem is to do the "btrfs send" to a local file on master, reliably transfer the file, then do the "btrfs receive" on slave from the file. Reliable file transfer programs exist (you can even just use rsync --inplace if I remember correctly). Unfortunately, that requires that you have enough space on both ends to store the btrfs-send file. > > Buttersink [1] claims it can do Resumable, checksummed multi-part uploads to > S3 as a back-end, but not between two PC's via ssh. I don't use Buttersink. But if it can do that without storing the file locally on either end (which I would be slightly surprised about) then it sounds like that might be the way to do it: essentially you would be paying AWS for the temporary filespace needed. Graham
Re: AW: WG: How to properly setup for snapshots
On 21/12/2020 20:45, Claudius Ellsel wrote: > I had a closer look at snapper now and have installed and set it up. This > seems to be really the easiest way for me, I guess. My main confusion was > probably that I was unsure whether I had to create a subvolume prior to this > or not, which got sorted out now. The situation is apparently still not > ideal, as to my current understanding it would have been preferable to set up > a subvolume first at root level directly after creating the files system. > However, as this is only the data drive in the machine (OS runs on an ext4 > SSD) it is at least not planned to simply swap the entire volume to a > snapshot but rather to restore snapshots at file / folder level, where > snapper can also be used for. > > One can possibly also safely achieve the same with only using btrfs > commandline tools (I made the mistake of not thinking about read only > snapshots when I wrote about my fear to delete / modify mounted snapshots). > Still I have a better feeling when using snapper, as I hope it is less easy > to screw things up with it :) I have never used snapper but I know it is a popular tool. But there are many other tools - with their own advantages, disadvantages and ways of working. You may want to experiment with several of them. Personally I use btrbk for automation of snapshots. Just remember that btrfs snapshots are not a backup tool. At best, they are a convenience tool, for quickly restoring deleted or modified files (or full subvolumes) to an earlier (snapshotted) state without having to access your backups. But they are completely useless if a hardware or software problem (kernel bug, disk problem, system memory error, faulty cable, etc) corrupts the filesystem. You don't have a backup solution unless you are copying the files to another physical disk, preferably on a different system. btrbk can help with that as well (but it is just automating btrfs send and btrfs receive underneath).
Re: MD RAID 5/6 vs BTRFS RAID 5/6
On 17/10/2019 16:57, Chris Murphy wrote: > On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB > wrote: >> >> It would be interesting to know the pros and cons of this setup that >> you are suggesting vs zfs. >> +zfs detects and corrects bitrot ( >> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ ) >> +zfs has working raid56 >> -modules out of kernel for license incompatibilities (a big minus) >> >> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem >> to find any conclusive doc about it right now) > > Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12. Presumably this is dependent on checksums? So neither detection nor fixup happen for NOCOW files? Even a scrub won't notice because scrub doesn't attempt to compare both copies unless the first copy has a bad checksum -- is that correct? > >> I'm one of those that is waiting for the write hole bug to be fixed in >> order to use raid5 on my home setup. It's a shame it's taking so long. > > For what it's worth, the write hole is considered to be rare. > https://lwn.net/Articles/665299/ > > Further, the write hole means a) parity is corrupt or stale compared > to data stripe elements which is caused by a crash or powerloss during > writes, and b) subsequently there is a missing device or bad sector in > the same stripe as the corrupt/stale parity stripe element. The effect > of b) is that reconstruction from parity is necessary, and the effect > of a) is that it's reconstructed incorrectly, thus corruption. But > Btrfs detects this corruption, whether it's metadata or data. The > corruption isn't propagated in any case. But it makes the filesystem > fragile if this happens with metadata. Any parity stripe element > staleness likely results in significantly bad reconstruction in this > case, and just can't be worked around, even btrfs check probably can't > fix it. If the write hole problem happens with data block group, then > EIO. But the good news is that this isn't going to result in silent > data or file system metadata corruption. For sure you'll know about > it. If I understand correctly, metadata always has checksums so that is true for filesystem structure. But for no-checksum files (such as nocow files) the corruption will be silent, won't it? Graham
Re: [Not TLS] Re: [PATCH 3/4] btrfs: include non-missing as a qualifier for the latest_bdev
On 04/10/2019 09:11, Nikolay Borisov wrote: > > > On 4.10.19 г. 10:50 ч., Anand Jain wrote: >> btrfs_free_extra_devids() reorgs fs_devices::latest_bdev >> to point to the bdev with greatest device::generation number. >> For a typical-missing device the generation number is zero so >> fs_devices::latest_bdev will never point to it. >> >> But if the missing device is due to alienating [1], then >> device::generation is not-zero and if it is >= to rest of >> device::generation in the list, then fs_devices::latest_bdev >> ends up pointing to the missing device and reports the error >> like this [2] >> >> [1] >> mkfs.btrfs -fq /dev/sdd && mount /dev/sdd /btrfs >> mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc >> sleep 3 # avoid racing with udev's useless scans if needed >> btrfs dev add -f /dev/sdb /btrfs > > Hm, here I think the correct way is to refuse adding /dev/sdb to an > existing fs if it's detected to be part of a different one. I.e it > should require wipefs to be done. I disagree. -f means "force overwrite of existing filesystem on the given disk(s)". It shouldn't be any different whether the existing fs is btrfs or something else. Graham
Intro to fstests environment?
Hi, I seem to have another case where scrub gets confused when it is cancelled and restarted many times (or, maybe, it is my error or something). I will look into it further but, instead of just hacking away at my script to work out what is going on, I thought I might try to create a regression test for it this time. I have read the kdave/fstests READMEs and the wiki. Is there any other documentation or advice I should read? Of course, I will look at existing test scripts as well. I don't suppose anyone has a convenient VM image or similar which I can use as a starting point? Graham
Re: [PATCH v2 RESEND] btrfs-progs: add verbose option to btrfs device scan
On 01/10/2019 08:52, Anand Jain wrote: > Ping? > > > On 9/25/19 4:07 PM, Anand Jain wrote: >> To help debug device scan issues, add verbose option to btrfs device >> scan. >> >> Signed-off-by: Anand Jain >> --- >> v2: Use bool instead of int as a btrfs_scan_device() argument. >> >> cmds/device.c | 8 ++-- >> cmds/filesystem.c | 2 +- >> common/device-scan.c | 4 +++- >> common/device-scan.h | 3 ++- >> common/utils.c | 2 +- >> disk-io.c | 2 +- >> 6 files changed, 14 insertions(+), 7 deletions(-) >> >> diff --git a/cmds/device.c b/cmds/device.c >> index 24158308a41b..9b715ffc42a3 100644 >> --- a/cmds/device.c >> +++ b/cmds/device.c >> @@ -313,6 +313,7 @@ static int cmd_device_scan(const struct cmd_struct >> *cmd, int argc, char **argv) >> int all = 0; >> int ret = 0; >> int forget = 0; >> + bool verbose = false; >> optind = 0; >> while (1) { >> @@ -323,7 +324,7 @@ static int cmd_device_scan(const struct cmd_struct >> *cmd, int argc, char **argv) >> { NULL, 0, NULL, 0} >> }; >> - c = getopt_long(argc, argv, "du", long_options, NULL); >> + c = getopt_long(argc, argv, "duv", long_options, NULL); >> if (c < 0) >> break; >> switch (c) { >> @@ -333,6 +334,9 @@ static int cmd_device_scan(const struct cmd_struct >> *cmd, int argc, char **argv) >> case 'u': >> forget = 1; >> break; >> + case 'v': >> + verbose = true; >> + break; >> default: >> usage_unknown_option(cmd, argv); >> } >> @@ -354,7 +358,7 @@ static int cmd_device_scan(const struct cmd_struct >> *cmd, int argc, char **argv) >> } >> } else { >> printf("Scanning for Btrfs filesystems\n"); >> - ret = btrfs_scan_devices(); >> + ret = btrfs_scan_devices(verbose); >> error_on(ret, "error %d while scanning", ret); >> ret = btrfs_register_all_devices(); >> error_on(ret, Shouldn't "--verbose" be accepted as a long version of the option? That would mean adding it to long_options. The usage message cmd_device_scan_usage needs to be updated to include the new option(s). I have tested this on my systems (4.19 kernel) and it not only works well, it is useful to get the list of devices it finds. If you wish, feel free to add: Tested-by: Graham Cobb Graham >> diff --git a/cmds/filesystem.c b/cmds/filesystem.c >> index 4f22089abeaa..02d47a43a792 100644 >> --- a/cmds/filesystem.c >> +++ b/cmds/filesystem.c >> @@ -746,7 +746,7 @@ devs_only: >> else >> ret = 1; >> } else { >> - ret = btrfs_scan_devices(); >> + ret = btrfs_scan_devices(false); >> } >> if (ret) { >> diff --git a/common/device-scan.c b/common/device-scan.c >> index 48dbd9e19715..a500edf0f7d7 100644 >> --- a/common/device-scan.c >> +++ b/common/device-scan.c >> @@ -360,7 +360,7 @@ void free_seen_fsid(struct seen_fsid >> *seen_fsid_hash[]) >> } >> } >> -int btrfs_scan_devices(void) >> +int btrfs_scan_devices(bool verbose) >> { >> int fd = -1; >> int ret; >> @@ -389,6 +389,8 @@ int btrfs_scan_devices(void) >> continue; >> /* if we are here its definitely a btrfs disk*/ >> strncpy_null(path, blkid_dev_devname(dev)); >> + if (verbose) >> + printf("blkid: btrfs device: %s\n", path); >> fd = open(path, O_RDONLY); >> if (fd < 0) { >> diff --git a/common/device-scan.h b/common/device-scan.h >> index eda2bae5c6c4..3e473c48d1af 100644 >> --- a/common/device-scan.h >> +++ b/common/device-scan.h >> @@ -1,6 +1,7 @@ >> #ifndef __DEVICE_SCAN_H__ >> #define __DEVICE_SCAN_H__ >> +#include >> #include "kerncompat.h" >> #include "ioctl.h" >> @@ -29,7 +30,7 @@ struct seen_fsid { >> int fd; >> }; >> -int btrfs_scan_devices(void); >> +int btrfs_scan_devices(bool verbose); >> int btrfs_register_one_device(const char *fname); >> int btrfs_register_all_devices(void); >> int btrfs_add_to_fsid(struct btrfs_trans_handle *trans, >&g
Re: [BTRFS Raid5 error during Scrub.
On 29/09/2019 22:38, Robert Krig wrote: > I'm running Debian Buster with Kernel 5.2. > Btrfs-progs v4.20.1 I am running Debian testing (bullseye) and have chosen not to install the 5.2 kernel yet because the version of it in bullseye (linux-image-5.2.0-2-amd64) is based on 5.2.9 and (as far as I can tell) does not contain the BTRFS corruption fix. I do not know which version of the 5.2 kernel is in buster but you may want to check that it contains a backport of the BTRFS fix or downgrade to the 4.19 kernel until you can be sure. I note that linux-image-5.2.0-3-amd64 is in unstable and is based on 5.2.17 so should have the fix. I presume it will make its way to testing soon. If anyone can confirm which versions of the Debian kernel package the 5.2 corruption fixes are in, it would be helpful.
Re: Feature requests: online backup - defrag - change RAID level
On 09/09/2019 13:18, Qu Wenruo wrote: > > > On 2019/9/9 下午7:25, zedlr...@server53.web-hosting.com wrote: >> What I am complaining about is that at one point in time, after issuing >> the command: >> btrfs balance start -dconvert=single -mconvert=single >> and before issuing the 'btrfs delete', the system could be in a too >> fragile state, with extents unnecesarily spread out over two drives, >> which is both a completely unnecessary operation, and it also seems to >> me that it could be dangerous in some situations involving potentially >> malfunctioning drives. > > In that case, you just need to replace that malfunctioning device other > than fall back to SINGLE. Actually, this case is the (only) one of the three that I think would be very useful (backup is better handled by having a choice of userspace tools to choose from - I use btrbk - and does anyone really care about defrag any more?). I did, recently, have a case where I had started to move my main data disk to a raid1 setup but my new disk started reporting errors. I didn't have a spare disk (and didn't have a spare SCSI slot for another disk anyway). So, I wanted to stop using the new disk and revert to my former (m=dup, d=single) setup as quickly as possible. I spent time trying to find a way to do that balance without risking the single copy of some of the data being stored on the failing disk between starting the balance and completing the remove. That has two problems: obviously having the single copy on the failing disk is bad news but, also, it increases the time taken for the subsequent remove which has to copy that data back to the remaining disk (where there used to be a perfectly good copy which was subsequently thrown away during the balance). In the end, I took the risk and the time of the two steps. In my case, I had good backups, and actually most of my data was still in a single profile on the old disk (because the errors starting happening before I had done the balance to change the profile of all the old data from single to raid1). But a balance -dconvert=single-but-force-it-to-go-on-disk-1 would have been useful. (Actually a "btrfs device mark-for-removal" command would be better - allow a failing device to be retained for a while, and used to provide data, but ignore it when looking to store data). Graham
Re: Massive filesystem corruption since kernel 5.2 (ARCH)
On 30/07/2019 23:44, Swâmi Petaramesh wrote: > Still, losing a given FS with subvols, snapshots etc, may be very > annoying and very time consuming rebuilding. I believe that in one of the earlier mails, Qu said that you can probably mount the corrupted fs readonly and read everything. If that is the case then, if I were in your position, I would probably buy another disk, create a a new fs, and then use one of the subvol preserving btrfs clone utilities to clone the readonly disk onto the new disk. Not cheap, and would still take some time, but at least it could be automated.
Re: "btrfs: harden agaist duplicate fsid" spams syslog
On 12/07/2019 14:35, Patrik Lundquist wrote: > On Fri, 12 Jul 2019 at 14:48, Anand Jain wrote: >> I am unable to reproduce, I have tried with/without dm-crypt on both >> oraclelinux and opensuse (I am yet to try debian). > > I'm using Debian testing 4.19.0-5-amd64 without problem. Raid1 with 5 > LUKS disks. Mounting with the UUID but not(!) automounted. > > Running "btrfs device scan --all-devices" doesn't trigger the fsid move. Thanks Patrik. So is it something to do with LVM, not dm-crypt, I wonder? Note that in my case I am using LVM-over-dm-crypt, and it is the LV that I mount, not the encrypted partition.
Re: "btrfs: harden agaist duplicate fsid" spams syslog
On 12/07/2019 13:46, Anand Jain wrote: > I am unable to reproduce, I have tried with/without dm-crypt on both > oraclelinux and opensuse (I am yet to try debian). I understand. I am going to be away for a week but I am happy to look into trying to create a smaller reproducer (for example in a vm) once I get back. > The patch in $subject is fair, but changing device path indicates > there is some problem in the system. However, I didn't expect > same device pointed by both /dev/dm-x and /dev/mapper/abc would > contended. It is weird, because there are other symlinks also pointing to the device. In my case, lvm sets up both /dev/mapper/cryptdata4tb--vg-backup and /dev/cryptdata4tb-vg/backup as symlinks to ../dm-13 but only the first one fights with /dev/dm-13 for the devid. > One fix for this is to make it a ratelimit print. But then the same > thing happens without notice. If you monitor /proc/self/mounts > probably you will notice that mount device changes every ~2mins. I haven't managed to catch it. Presumably because, according to the logs, it seems to switch the devices back again within less than a second. > I will be more keen to find the module which is causing this issue, > that is calling 'btrfs dev scan' every ~2mins or trying to mount > an unmounted device without understanding that its mapper is already > mounted. Any ideas how we can determine that? Can I try something like stopping udev for 5 minutes to see if it stops? Or will that break my system (I can't schedule any downtime until after I am back)? Note (in case it is relevant) this is a systemd system so udev is actually systemd-udevd.service. Thanks Graham
Re: "btrfs: harden agaist duplicate fsid" spams syslog
On 11/07/2019 03:46, Anand Jain wrote: > Now the question I am trying to understand, why same device is being > scanned every 2 mins, even though its already mount-ed. I am guessing > its toggling the same device paths trying to mount the device-path > which is not mounted. So autofs's check for the device mount seems to > be path based. > > Would you please provide your LVM configs and I believe you are using > dm-mapping too. What are the device paths used in the fstab and in grub. > And do you see these messages for all devices of > 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 or just devid 4? Would you please > provide more logs at least a complete cycle of the repeating logs. My setup is quite complex, with three btrfs-over-LVM-over-LUKS filesystems, so I will explain it fully in a separate message in case it is important. Let me first answer your questions regarding 4d1ba5af-8b89-4cb5-96c6-55d1f028a202, which was the example I used in my initial log extract. 4d1b...a202 is a filesystem with a main mount point of /mnt/backup2/: black:~# btrfs fi show /mnt/backup2/ Label: 'snap12tb' uuid: 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 Total devices 2 FS bytes used 10.97TiB devid1 size 10.82TiB used 10.82TiB path /dev/sdc3 devid4 size 3.62TiB used 199.00GiB path /dev/mapper/cryptdata4tb--vg-backup In this particular filesystem, it has two devices: one is a real disk partition (/dev/sdc3), the other is an LVM logical volume. It has also had other LVM devices added and removed at various times, but this is the current setup. Note: when I added this LV, I used path /dev/mapper/cryptdata4tb--vg-backup. black:~# lvdisplay /dev/cryptdata4tb-vg/backup --- Logical volume --- LV Path/dev/cryptdata4tb-vg/backup LV Namebackup VG Namecryptdata4tb-vg LV UUIDTZaWfo-goG1-GsNV-GCZL-rpbz-IW0H-gNmXBf LV Write Accessread/write LV Creation host, time black, 2019-07-10 10:40:28 +0100 LV Status available # open 1 LV Size3.62 TiB Current LE 949089 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:13 The LVM logical volume is exposed as /dev/mapper/cryptdata4tb--vg-backup which is a symlink (set up by LVM, I believe) to /dev/dm-13. For the 4d1b...a202 filesystem I currently only see the messages for devid 4. But I presume that is because devid 1 is a real device, which only appears in /dev once. I did, for a while, have two LV devices in this filesystem and, looking at the old logs, I can see that every 2 minutes the swapping between /dev/mapper/whatever and /dev/dm-N was happening for both LV devids (but not for the physical device devid) This particular device is not a root device and I do not believe it is referenced in grub or initramfs. It is mounted in /etc/fstab/: LABEL=snap12tb /mnt/backup2btrfs defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,space_cache=v2 0 3 Note that /dev/disk/by-label/snap12tb is a symlink to the dm-N alias of the LV device (set up by LVM or udev or something - not by me): black:~# ls -l /dev/disk/by-label/snap12tb lrwxrwxrwx 1 root root 11 Jul 11 18:18 /dev/disk/by-label/snap12tb -> ../../dm-13 Here is a log extract of the cycling messages for the 4d1b...a202 filesystem: Jul 11 18:46:28 black kernel: [116657.825658] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 11 18:46:28 black kernel: [116658.048042] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup Jul 11 18:46:29 black kernel: [116659.157392] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 11 18:46:29 black kernel: [116659.337504] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup Jul 11 18:48:28 black kernel: [116777.727262] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 11 18:48:28 black kernel: [116778.019874] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup Jul 11 18:48:29 black kernel: [116779.157038] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 11 18:48:30 black kernel: [116779.364959] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup Jul 11 18:50:28 black kerne
"btrfs: harden agaist duplicate fsid" spams syslog
Anand's Nov 2018 patch "btrfs: harden agaist duplicate fsid" has recently percolated through to my Debian buster server system. And it is spamming my log files. Each of my btrfs filesystem devices logs 4 messages every 2 minutes. Here is an example of the 4 messages related to one device: Jul 10 19:32:27 black kernel: [33017.407252] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 10 19:32:27 black kernel: [33017.522242] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup Jul 10 19:32:29 black kernel: [33018.797161] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13 Jul 10 19:32:29 black kernel: [33019.061631] BTRFS info (device sdc3): device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup What is happening here is that each device is actually an LVM logical volume, and it is known by a /dev/mapper name and a /dev/dm name. And every 2 minutes something cause btrfs to notice that there are two names for the same device and it swaps them around. Logging a message to say it has done so. And doing it 4 times. I presume that the swapping doesn't cause any problem. I wonder slightly whether ordering guarantees and barriers all work correctly but I haven't noticed any problems. I also assume it has been doing this for a while -- just silently before this patch came along. Is btrfs noticing this itself or is something else (udev or systemd, for example) triggering it? Should I worry about it? Is there any way to not have my log files full of this? Graham [This started with a Debian testing kernel a couple of months ago. Current uname -a gives: Linux black 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5 (2019-06-19) x86_64 GNU/Linux]
Re: snapshot rollback
On 05/07/2019 12:47, Remi Gauvin wrote: > On 2019-07-05 7:06 a.m., Ulli Horlacher wrote: > >> >> Ok, it seems my idea (replacing the original root subvolume with a >> snapshot) is not possible. >> > ... > It is common practice with installers now to mount your root and home on > a subvolume for exactly this reason. (And you can convert your current > system as well. Boot your system with a removable boot media of your > choice, create a subvolume named @. Move all existing folders into this > new subvolume. Edit the @/boot/grub/grub.cfg file so your Linux boot > menu has the @ added to the paths of Linux root and initrd. Personally, I use a slightly different strategy. My basic principle is that no btrfs filesystem should have any files or directories in its root -- only subvolumes. This makes it easier to do stuff with snapshots if I want to. For system disks I use a variant of the "@" approach. I create two subvolumes when I create a system disk: rootfs and varfs (I separate the two because I use different btrbk and other backup configurations for / and /var). I then use btrfs subvolume set-default to set rootfs as the default mount so I don't have to tell grub, etc about the subvolume (I should mention that I put /boot in a separate partition, not using btrfs). In /etc/fstab I mount /var with subvol=varfs. I also mount the whole filesystem (using subvolid=5) into a separate mount point (/mnt/rootdisk) so I can easily get at and manipulate all the top-level subvolumes. Data disks are similar. I create a "home" subvolume at the top level in the data disk which gets mounted into /home. Below /home, most directories are also subvolumes (for example, one for my main account so I can backup that separately from other parts of /home). I mount the data disk itself (using subvolid=5) into a separate mount point: /mnt/datadisk -- which I only use for doing work with messing around with the subvolume structure. It sounds more complicated than it is, although it is not supported by any distro installers that I am aware of. And you should expect to get a few config things wrong while setting it up and will need to have an alternative boot to use to get things working (a USB disk or an older system disk). Particularly if you want to use btrfs-over-LVM-over-LUKS. And don't forget to fully update grub when you think is working and then test it again without your old/temporary boot disk in place! Basically, I make many different subvolumes and use /mount to put them into the places they should be in the filesystem (except for / for which I use set-default). The real btrfs root directory for each filesystem is mounted (using subvolid=5) into a separate place for doing filesystem operations. I then have a cron script which checks that every directory within the top level of each btrfs filesystem (and within /home) is a subvolume and warns me if it isn't (I use special dotfiles within the few top-level directories which I don't want to be their own subvolumes). Contact me directly if you would find my personal "how to set up my system and root disks, for debian, using btrfs-over-lvm-over-luks" notes useful. Graham
Re: Btrfs progs pre-release 5.2-rc1
On 28/06/2019 18:40, David Sterba wrote: > Hi, > > this is a pre-release of btrfs-progs, 5.2-rc1. > > The proper release is scheduled to next Friday, +7 days (2019-07-05), but can > be postponed if needed. > > Scrub status has been reworked: > > UUID: bf8720e0-606b-4065-8320-b48df2e8e669 > Scrub started:Fri Jun 14 12:00:00 2019 > Status: running > Duration: 0:14:11 > Time left:0:04:04 > ETA: Fri Jun 14 12:18:15 2019 > Total to scrub: 182.55GiB > Bytes scrubbed: 141.80GiB > Rate: 170.63MiB/s > Error summary:csum=7 > Corrected: 0 > Uncorrectable: 7 > Unverified: 0 Is it possible to include my recently submitted patch to scrub to correct handling of last_physical and fix skipping much of the disk on scrub cancel/resume? Graham
Re: [PATCH] btrfs-progs: scrub: Fix scrub cancel/resume not to skip most of the disk
On 18/06/2019 09:08, Graham R. Cobb wrote: > When a scrub completes or is cancelled, statistics are updated for reporting > in a later btrfs scrub status command and for resuming the scrub. Most > statistics (such as bytes scrubbed) are additive so scrub adds the statistics > from the current run to the saved statistics. > > However, the last_physical statistic is not additive. The value from the > current run should replace the saved value. The current code incorrectly > adds the last_physical from the current run to the previous saved value. > > This bug causes the resume point to be incorrectly recorded, so large areas > of the disk are skipped when the scrub resumes. As an example, assume a disk > had 100 bytes and scrub was cancelled and resumed each time 10% (10 > bytes) had been scrubbed. > > Run | Start byte | bytes scrubbed | kernel last_physical | saved last_physical > 1 | 0 | 10 | 10 | 10 > 2 | 10 | 10 | 20 | 30 > 3 | 30 | 10 | 40 | 70 > 4 | 70 | 10 | 80 | 150 > 5 |150 | 0 | immediately completes| completed > > In this example, only 40% of the disk is actually scrubbed. > > This patch changes the saved/displayed last_physical to track the last > reported value from the kernel. > > Signed-off-by: Graham R. Cobb Ping? This fix is important for anyone who interrupts and resumes scrubs -- which will happen more and more as filesystems get bigger. It is a small fix and would be good to get out to distros. Graham > --- > cmds-scrub.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/cmds-scrub.c b/cmds-scrub.c > index f21d2d89..2800e796 100644 > --- a/cmds-scrub.c > +++ b/cmds-scrub.c > @@ -171,6 +171,10 @@ static void print_scrub_summary(struct > btrfs_scrub_progress *p) > fs_stat->p.name += p->name; \ > } while (0) > > +#define _SCRUB_FS_STAT_COPY(p, name, fs_stat) do { \ > + fs_stat->p.name = p->name; \ > +} while (0) > + > #define _SCRUB_FS_STAT_MIN(ss, name, fs_stat)\ > do { \ > if (fs_stat->s.name > ss->name) { \ > @@ -209,7 +213,7 @@ static void add_to_fs_stat(struct btrfs_scrub_progress *p, > _SCRUB_FS_STAT(p, malloc_errors, fs_stat); > _SCRUB_FS_STAT(p, uncorrectable_errors, fs_stat); > _SCRUB_FS_STAT(p, corrected_errors, fs_stat); > - _SCRUB_FS_STAT(p, last_physical, fs_stat); > + _SCRUB_FS_STAT_COPY(p, last_physical, fs_stat); > _SCRUB_FS_STAT_ZMIN(ss, t_start, fs_stat); > _SCRUB_FS_STAT_ZMIN(ss, t_resumed, fs_stat); > _SCRUB_FS_STAT_ZMAX(ss, duration, fs_stat); > @@ -683,6 +687,8 @@ static int scrub_writev(int fd, char *buf, int max, const > char *fmt, ...) > > #define _SCRUB_SUM(dest, data, name) dest->scrub_args.progress.name = > \ > data->resumed->p.name + data->scrub_args.progress.name > +#define _SCRUB_COPY(dest, data, name) dest->scrub_args.progress.name = > \ > + data->scrub_args.progress.name > > static struct scrub_progress *scrub_resumed_stats(struct scrub_progress > *data, > struct scrub_progress *dest) > @@ -703,7 +709,7 @@ static struct scrub_progress *scrub_resumed_stats(struct > scrub_progress *data, > _SCRUB_SUM(dest, data, malloc_errors); > _SCRUB_SUM(dest, data, uncorrectable_errors); > _SCRUB_SUM(dest, data, corrected_errors); > - _SCRUB_SUM(dest, data, last_physical); > + _SCRUB_COPY(dest, data, last_physical); > dest->stats.canceled = data->stats.canceled; > dest->stats.finished = data->stats.finished; > dest->stats.t_resumed = data->stats.t_start; >
Re: [PATCH RFC] btrfs-progs: scrub: Correct tracking of last_physical across scrub cancel/resume
On 08/06/2019 00:55, Graham R. Cobb wrote: > When a scrub completes or is cancelled, statistics are updated for reporting > in a later btrfs scrub status command. Most statistics (such as bytes > scrubbed) > are additive so scrub adds the statistics from the current run to the > saved statistics. > > However, the last_physical statistic is not additive. The value from the > current run should replace the saved value. The current code incorrectly > adds the last_physical from the current run to the saved value. > > This bug not only affects user status reporting but also has the effect that > subsequent resumes start from the wrong place and large amounts of the > filesystem are not scrubbed. > > This patch changes the saved last_physical to track the last reported value > from the kernel. > > Signed-off-by: Graham R. Cobb No comments received on this RFC PATCH. I will resubmit it shortly as a non-RFC PATCH, with a slightly improved summary and changelog. Graham
Re: Scrub resume failure
On 06/06/2019 15:26, Graham Cobb wrote: > However, after a few cancel/resume cycles, the scrub terminates. No > errors are reported but one of the resumes will just immediately > terminate claiming the scrub is done. It isn't. Nowhere near. I believe I have found the problem. It is a bug in the scrub command. When a scrub completes or is cancelled, the utility updates the saved statistics for reporting using btrfs scrub status. These statistics include the last_physical value returned from the ioctl, which is then used by the resume code to specify the start for the next run. Most statistics (such as bytes scrubbed, error counts, etc) are maintained by adding the values from the current run to the saved values. However, the last_physical value should not be added: it should replace the saved value. The current code incorrectly adds it to the saved value, meaning that large amounts of the filesystem are missed out on the next run. I have created a patch, which I will send in a separate message. As I have not submitted patches to this list before, I will send it as a PATCH RFC and would welcome comments. Graham
Scrub resume failure
I have a btrfs filesystem which I want to scrub. This is a multi-TB filesystem and will take well over 24 hours to scrub. Unfortunately, the scrub turns out to be quite intrusive into the system (even when making sure it is very low priority for ionice and nice). Operations on other disks run excessively slowly, causing timeouts on important actions like mail delivery (causing bounces). So, I break it up. I run it for some interval (hours), with the time-critical services stopped. Then I cancel the scrub and let mail delivery run for a while. Then I stop mail again and resume the scrub for another interval, etc. This works and solves the mail bounce problem. However, after a few cancel/resume cycles, the scrub terminates. No errors are reported but one of the resumes will just immediately terminate claiming the scrub is done. It isn't. Nowhere near. The disk being scrubbed is in use during all this. It doesn't get a heavy load but it is my main backup disk and various backups happen, some of them involving snapshots being created and deleted. Glancing at the use of the ioctl in the btrfs-progs code, I assume the resume is using the last_physical from the last run as the start for the next. Does that break if the filesystem has changed and that is no longer a used block or something? If so, I think that makes resume useless. If this is not expected behaviour I will do more work to analyse and reproduce. Graham
Re: Used disk size of a received subvolume?
On 17/05/2019 17:39, Steven Davies wrote: > On 17/05/2019 16:28, Graham Cobb wrote: > >> That is why I created my "extents-list" stuff. This is a horrible hack >> (one day I will rewrite it using the python library) which lets me >> answer questions like: "how much space am I wasting by keeping >> historical snapshots", "how much data is being shared between two >> subvolumes", "how much of the data in my latest snapshot is unique to >> that snapshot" and "how much space would I actually free up if I removed >> (just) these particular directories". None of which can be answered from >> the existing btrfs command line tools (unless I have missed something). > > I have my own horrible hack to do something like this; if you ever get > around to implementing it in Python could you share the code? > Sure. The current hack (using shell and command line tools) is at https://github.com/GrahamCobb/extents-lists. If the python version ever materialises I expect it will end up there as well.
Re: Used disk size of a received subvolume?
On 17/05/2019 14:57, Axel Burri wrote: > btrfs fi du shows me the information wanted, but only for the last > received subvolume (as you said it changes over time, and any later > child will share data with it). For all others, it merely shows "this > is what gets freed if you delete this subvolume". It doesn't even show you that: it is possible to have shared (not exclusive) data which is only shared between files within the subvolume, and which will be freed if the subvolume is deleted. And, of course, the obvious problem that if you only count exclusive then no one is being charged for all the shared segments ("Oh, my backup is getting a bit expensive. Hmm. I know! I will back up all my files to two different destinations, and make sure btrfs is sharing the data between both locations! Then no one pays for it! Whoopee!") In my opinion, the shared/exclusive information in btrfs fi du is worse than useless: it confuses people who think it means something different from what it does. And, in btrfs, it isn't really useful to know whether something is "exclusive" or not -- what people care about is always something else (which is dependent on **where** it is shared, and by whom). The biggest problem is that you haven't defined what **you** (in your particular use case) mean by the "size" of a subvolume. For btrfs that doesn't have any single obvious definition. Most commonly, I think, people mean "how much space on disk would be freed up if I deleted this subvolume and all subvolumes contained within it", although quite often they mean the similar (but not identical) "how much space on disk would be freed up if I deleted just this subvolume". And sometimes they actually mean "how much space on disk would be freed up if I deleted this subvolume, the subvolumes contained with in, and all the snapshots I have taken but are lying around forgotten about in some other directory tree somewhere". But often they mean something else completely, such as "how much space is taken up by the data which was originally created in this subvolume but which has been cloned into all sorts of places now and may not even be referred to from this subvolume any more" (typically this is the case if you want to charge the subvolume owner for the data usage). And, of course, another reading of your question would be "how much data was transferred during this send/receive operation" (relevant if you are running a backup service and want to charge people by how much they are sending to the service rather than the amount of data stored). That is why I created my "extents-list" stuff. This is a horrible hack (one day I will rewrite it using the python library) which lets me answer questions like: "how much space am I wasting by keeping historical snapshots", "how much data is being shared between two subvolumes", "how much of the data in my latest snapshot is unique to that snapshot" and "how much space would I actually free up if I removed (just) these particular directories". None of which can be answered from the existing btrfs command line tools (unless I have missed something). > And it is pretty slow: on my backup disk (spinning rust, ~2000 > subvolumes, ~100 sharing data), btrfs fi du takes around 5min for a > subvolume of 20GB, while btrfs find-new takes only seconds. Yes. Answering the real questions involves taking the FIEMAP data for every file involved (which, for some questions, is actually every file on the disk!) so it takes a very long time. Days for my multi-terabyte backup disk. > Summing up, what I'm looking for would be something like: > > btrfs fi du -s --exclusive-relative-to= You can do that with FIEMAP data. Feel free to look extents-lists. Also feel free to shout "this is a gross hack" and scream at me! If you really just need it for two subvols like that extents-expr -s - will tell you how much space is in extents used in but not used in . Graham
Re: Btrfs send with parent different size depending on source of files.
On 18/02/2019 19:58, André Malm wrote: > What causes the extent to be incomplete? And can I avoid it? Does it matter? I presume the send is working OK, it is just that it sends a little more data than it needs to. Or have you seen any data loss? Graham
Re: experiences running btrfs on external USB disks?
On 04/12/2018 12:38, Austin S. Hemmelgarn wrote: > In short, USB is _crap_ for fixed storage, don't use it like that, even > if you are using filesystems which don't appear to complain. That's useful advice, thanks. Do you (or anyone else) have any experience of using btrfs over iSCSI? I was thinking about this for three different use cases: 1) Giving my workstation a data disk that is actually a partition on a server -- keeping all the data on the big disks on the server and reducing power consumption (just a small boot SSD in the workstation). 2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI mirror to provide redundancy without putting more disks in the local system. Of course, this would mean that one of the RAID1 copies would have higher latency than the other. 3) Like case 1 but actually exposing an LVM logical volume from the server using iSCSI, rather than a simple disk partition. I would then put both encryption and RAID running on the server below that logical volume. NBD could also be an alternative to iSCSI in these cases as well. Any thoughts? Graham
Re: btrfs fi du unreliable?
On 29/08/18 14:31, Jorge Bastos wrote: > Thanks, that makes sense, so it's only possible to see how much space > a snapshot is using with quotas enable, I remember reading that > somewhere before, though there was a new way after reading this latest > post . My extents lists scripts (https://github.com/GrahamCobb/extents-lists) can tell you the answers to questions like this. In particular, see the extents-to-remove script. However, be aware of the warning in the documentation: --- Be warned: the last three examples take a very LONG TIME (and require a lot of space in $TMPDIR) as they effectively have to get the file extents for almost every file on the disk (and sort them multiple times). They take over 12 hours on my system! --- I don't know if there are any better tools which work faster.
Re: Recommendations for balancing as part of regular maintenance?
On 08/01/18 16:34, Austin S. Hemmelgarn wrote: > Ideally, I think it should be as generic as reasonably possible, > possibly something along the lines of: > > A: While not strictly necessary, running regular filtered balances (for > example `btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`, > see `man btrfs-balance` for more info on what the options mean) can help > keep a volume healthy by mitigating the things that typically cause > ENOSPC errors. Full balances by contrast are long and expensive > operations, and should be done only as a last resort. That recommendation is similar to what I do and it works well for my use case. I would recommend it to anyone with my usage, but cannot say how well it would work for other uses. In my case, I run balances like that once a week: some weeks nothing happens, other weeks 5 or 10 blocks may get moved. For reference, my use case is for two separate btrfs filesystems each on a single large disk (so no RAID) -- the disks are 6TB and 12TB, both around 80% used -- one is my main personal data disk, the other is my main online backup disk. The data disk receives all email delivery (so lots of small files, coming and going), stores TV programs as PVR storage (many GB sized files, each one written once, which typically stick around for a while and eventually get deleted) and is where I do my software development (sources and build objects). No (significant) database usage. I am guessing this is pretty typical personal user usage (although it doesn't store any operating system files). The only unusual thing is that I have it set up as about 20 subvolumes, and each one has frequent snapshots (maybe 200 or so subvolumes in total at any time). The online backup disk receives backups from all my systems in three main forms: btrfs snapshots (send/receive), rsnapshot copies (rsync), and DAR archives. Most get updated daily. It contains several hundred snapshots (most received from the data disk). It would be interesting to hear if similar balancing is seen as useful for other very different cases (RAID use, databases or VM disks, etc). Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Improve subvolume usability for a normal user
On 05/12/17 18:01, Goffredo Baroncelli wrote: > On 12/05/2017 04:42 PM, Graham Cobb wrote: >> On 05/12/17 12:41, Austin S. Hemmelgarn wrote: >>> On 2017-12-05 03:43, Qu Wenruo wrote: >>>> >>>> >>>> On 2017年12月05日 16:25, Misono, Tomohiro wrote: >>>>> Hello all, >>>>> >>>>> I want to address some issues of subvolume usability for a normal user. >>>>> i.e. a user can create subvolumes, but >>>>> - Cannot delete their own subvolume (by default) >>>>> - Cannot tell subvolumes from directories (in a straightforward way) >>>>> - Cannot check the quota limit when qgroup is enabled >>>>> >>>>> Here I show the initial thoughts and approaches to this problem. >>>>> I want to check if this is a right approach or not before I start >>>>> writing code. >>>>> >>>>> Comments are welcome. >>>>> Tomohiro Misono >>>>> >>>>> == >>>>> - Goal and current problem >>>>> The goal of this RFC is to give a normal user more control to their >>>>> own subvolumes. >>>>> Currently the control to subvolumes for a normal user is restricted >>>>> as below: >>>>> >>>>> +-+--+--+ >>>>> | command | root | user | >>>>> +-+--+--+ >>>>> | sub create | Y | Y | >>>>> | sub snap | Y | Y | >>>>> | sub del | Y | N | >>>>> | sub list | Y | N | >>>>> | sub show | Y | N | >>>>> | qgroup show | Y | N | >>>>> +-+--+--+ >>>>> >>>>> In short, I want to change this as below in order to improve user's >>>>> usability: >>>>> >>>>> +-+--++ >>>>> | command | root | user | >>>>> +-+--++ >>>>> | sub create | Y | Y | >>>>> | sub snap | Y | Y | >>>>> | sub del | Y | N -> Y | >>>>> | sub list | Y | N -> Y | >>>>> | sub show | Y | N -> Y | >>>>> | qgroup show | Y | N -> Y | >>>>> +-+--++ >>>>> >>>>> In words, >>>>> (1) allow deletion of subvolume if a user owns it, and >>>>> (2) allow getting subvolume/quota info if a user has read access to it >>>>> (sub list/qgroup show just lists the subvolumes which are readable >>>>> by the user) >>>>> >>>>> I think other commands not listed above (qgroup limit, send/receive >>>>> etc.) should >>>>> be done by root and not be allowed for a normal user. >>>>> >>>>> >>>>> - Outside the scope of this RFC >>>>> There is a qualitative problem to use qgroup for limiting user disk >>>>> amount; >>>>> quota limit can easily be averted by creating a subvolume. I think >>>>> that forcing >>>>> inheriting quota of parent subvolume is a solution, but I won't >>>>> address nor >>>>> discuss this here. >>>>> >>>>> >>>>> - Proposal >>>>> (1) deletion of subvolume >>>>> >>>>> I want to change the default behavior to allow a user to delete >>>>> their own >>>>> subvolumes. >>>>> This is not the same behavior as when user_subvol_rm_alowed >>>>> mount option is >>>>> specified; in that case a user can delete the subvolume to which >>>>> they have >>>>> write+exec right. >>>>> Since snapshot creation is already restricted to the subvolume >>>>> owner, it is >>>>> consistent that only the owner of the subvolume (or root) can >>>>> delete it. >>>>> The implementation should be straightforward. >>>> >>>> Personally speaking, I prefer to do the complex owner check in user >>>> daemon. >>>> >>>> And do the privilege in user daemon (call it btrfsd for example). >>>> >>>> So btrfs-progs will works in 2 modes, if root calls it, do as it used >>>> to do. >>>> I
Re: [RFC] Improve subvolume usability for a normal user
On 05/12/17 12:41, Austin S. Hemmelgarn wrote: > On 2017-12-05 03:43, Qu Wenruo wrote: >> >> >> On 2017年12月05日 16:25, Misono, Tomohiro wrote: >>> Hello all, >>> >>> I want to address some issues of subvolume usability for a normal user. >>> i.e. a user can create subvolumes, but >>> - Cannot delete their own subvolume (by default) >>> - Cannot tell subvolumes from directories (in a straightforward way) >>> - Cannot check the quota limit when qgroup is enabled >>> >>> Here I show the initial thoughts and approaches to this problem. >>> I want to check if this is a right approach or not before I start >>> writing code. >>> >>> Comments are welcome. >>> Tomohiro Misono >>> >>> == >>> - Goal and current problem >>> The goal of this RFC is to give a normal user more control to their >>> own subvolumes. >>> Currently the control to subvolumes for a normal user is restricted >>> as below: >>> >>> +-+--+--+ >>> | command | root | user | >>> +-+--+--+ >>> | sub create | Y | Y | >>> | sub snap | Y | Y | >>> | sub del | Y | N | >>> | sub list | Y | N | >>> | sub show | Y | N | >>> | qgroup show | Y | N | >>> +-+--+--+ >>> >>> In short, I want to change this as below in order to improve user's >>> usability: >>> >>> +-+--++ >>> | command | root | user | >>> +-+--++ >>> | sub create | Y | Y | >>> | sub snap | Y | Y | >>> | sub del | Y | N -> Y | >>> | sub list | Y | N -> Y | >>> | sub show | Y | N -> Y | >>> | qgroup show | Y | N -> Y | >>> +-+--++ >>> >>> In words, >>> (1) allow deletion of subvolume if a user owns it, and >>> (2) allow getting subvolume/quota info if a user has read access to it >>> (sub list/qgroup show just lists the subvolumes which are readable >>> by the user) >>> >>> I think other commands not listed above (qgroup limit, send/receive >>> etc.) should >>> be done by root and not be allowed for a normal user. >>> >>> >>> - Outside the scope of this RFC >>> There is a qualitative problem to use qgroup for limiting user disk >>> amount; >>> quota limit can easily be averted by creating a subvolume. I think >>> that forcing >>> inheriting quota of parent subvolume is a solution, but I won't >>> address nor >>> discuss this here. >>> >>> >>> - Proposal >>> (1) deletion of subvolume >>> >>> I want to change the default behavior to allow a user to delete >>> their own >>> subvolumes. >>> This is not the same behavior as when user_subvol_rm_alowed >>> mount option is >>> specified; in that case a user can delete the subvolume to which >>> they have >>> write+exec right. >>> Since snapshot creation is already restricted to the subvolume >>> owner, it is >>> consistent that only the owner of the subvolume (or root) can >>> delete it. >>> The implementation should be straightforward. >> >> Personally speaking, I prefer to do the complex owner check in user >> daemon. >> >> And do the privilege in user daemon (call it btrfsd for example). >> >> So btrfs-progs will works in 2 modes, if root calls it, do as it used >> to do. >> If normal user calls it, proxy the request to btrfsd, and btrfsd does >> the privilege checking and call ioctl (with root privilege). >> >> Then no impact to kernel, all complex work is done in user space. > Exactly how hard is it to just check ownership of the root inode of a > subvolume from the ioctl context? You could just as easily push all the > checking out to the VFS layer by taking an open fd for the subvolume > root (and probably implicitly closing it) instead of taking a path, and > that would give you all the benefits of ACL's and whatever security > modules the local system is using. +1 - stop inventing new access control rules for each different action! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why isnt NOCOW attributes propogated on snapshot transfers?
On 16/10/17 14:28, David Sterba wrote: > On Sun, Oct 15, 2017 at 04:19:23AM +0300, Cerem Cem ASLAN wrote: >> `btrfs send | btrfs receive` removes NOCOW attributes. Is it a bug or >> a feature? If it's a feature, how can we keep these attributes if we >> need to? > > This is a known defficiency of send protocol v1. And there are more, > listed on > https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive#Send_stream_v2_draft It is not mentioned on the list (and I haven't tested to find out)... but if xattr are supported in the V1 protocol, any chance of `send` converting these file flags to xattr, and `receive` converting back? No need for a protocol bump and would even preserve the information (although not the effect) in the case of an old receiver. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add option to only list parent subvolumes
On 30/09/17 19:17, Holger Hoffstätte wrote: > On 09/30/17 19:56, Holger Hoffstätte wrote: >> shell hackery as alternative. Anyway, I was sure that at the time the >> other letters sounded even worse/were taken, but that may just have been >> in my head. ;-) >> >> I just rechecked and -S is still available, so that's good. > > Except that it isn't really, since there is already an 'S' > case in cmds-subvolume.c as shortcut to --sort: That's a shame (and it is also a shame to waste a single letter option without documenting it!). I still would encourage you to avoid -P. I think there is user confusion by "parent" having more than one meaning even within btrfs. And I feel it also tends to perpetuate the mistaken belief that snapshots are somehow "special", and different from other subvolumes (rather than just a piece of information about how two subvolumes are related). It also allows -P to be used one day for the "search by parent UUID" feature. Given the constraints, I would suggest -n. It is mostly arbitrary but it is the second letter of snapshot and also the first of "not a snapshot". Thanks for considering. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add option to only list parent subvolumes
On 30/09/17 14:08, Holger Hoffstätte wrote: > A "root" subvolume is identified by a null parent UUID, so adding a new > subvolume filter and flag -P ("Parent") does the trick. I don't like the naming. The flag you are proposing is really nothing to do with whether a subvolume is a parent or not: it is about whether it is a snapshot or not (many subvolumes are both snapshots and also parents of other snapshots, and many non-snapshots are not the parent of any subvolumes). I have two suggestions: 1) Use -S (meaning "not a snapshot", the inverse of -s). Along with this change. I would change the usage text to say something like: -s list subvolumes originally created as snapshots -S list subvolumes originally created not as snapshots Presumably specifying both -s and -S should be an error. 2) Add a -P (parent) option but make it take an argument: the UUID of the parent to match. This would display only subvolumes originally created as snapshots of the specified subvolume (which may or may not still exist, of course). A null value ('' -- or a special text like 'NULL' or 'NONE' if you prefer) would create the search you were looking for: subvolumes with a null Parent UUID. The second option is more code, of course, but I see being able to list all the snapshots of a particular subvolume as quite useful. If you do choose the second option you need to decide what to do if the -P is specified more than once. Probably treat it as an error (unless you want to allow a list of UUIDs any of which can match). You might also want to reject an attempt to specify both -s and -P. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: difference between -c and -p for send-receive?
On 19/09/17 01:41, Dave wrote: > Would it be correct to say the following? Like Duncan, I am just a user, and I haven't checked the code. I recommend Duncan's explanation, but in case you are looking for something simpler, how about thinking with the following analogy... Think of -p as like doing an incremental backup: it tells send to just send the instructions for the changes to get from the "parent" subvolume to the current subvolume. Without -p it is like a full backup: everything in the current subvolume is sent. -c is different: it says "and by the way, these files also already exist on the destination so they might be useful to skip actually sending some of the file contents". Imagine that whenever a file content is about to be sent (whether incremental or full), btrfs-send checks to see if the data is in one of the -c subvolumes and, if it is, it sends "get the data by reflinking to this file over here" instead of sending the data itself. -c is really just an optimisation to save sending data if you know the data is already available somewhere else on the destination. Be aware that this is really just an analogy (like "hard linking" is an analogy for reflinking using the clone range ioctl). Duncan's email provides more real details. In particular, this analogy doesn't explain the original questioner's problem. In the analogy, -c might work without the files actually being present on the source (as long as they are on the destination). But, in reality, because the underlying mechanism is extent range cloning, the files have to be present on **both** the source and the destination in order for btrfs-send to work out what commands to send. By the way, like Duncan, I was surprised that the man page suggests that -c without -p causes one of the clones to be treated as a parent. I have not checked the code to see if that is actually how it works. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: parent determination failed (btrfs send-receive)
On 18/09/17 07:10, Dave wrote: > For my understanding, what are the restrictions on deleting snapshots? > > What scenarios can lead to "ERROR: parent determination failed"? The man page for btrfs-send is reasonably clear on the requirements btrfs imposes. If you want to use incremental sends (i.e. the -c or -p options) then the specified snapshots must exist on both the source and destination. If you don't have a suitable existing snapshot then don't use -c or -p and just do a full send. > I use snap-sync to create and send snapshots. > > GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external > drive > https://github.com/wesbarnett/snap-sync I am not familiar with this tool. Your question should be sent to the author of the tool, if that is what is deciding what -p and -c options are being used. Personally I use and recommend btrbk. I have never had this issue and the configuration options let me limit the snapshots it saves on both the source and destination disks separately (so I keep fewer on the source than on the backup disk). Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 14/08/17 16:53, Austin S. Hemmelgarn wrote: > Quite a few applications actually _do_ have some degree of secondary > verification or protection from a crash. I am glad your applications do and you have no need of this feature. You are welcome not to use it. I, on the other hand, definitely want this feature and would have it enabled by default on all my systems despite the need for manual actions after some unclean shutdowns. > Go look at almost any database > software. It usually will not have checksumming, but it will almost > always have support for a journal, which is enough to cover the > particular data loss scenario we're talking about (unexpected unclean > shutdown). No, the problem we are talking about is the data-at-rest corruption that checksumming is designed to deal with. That is why I want it. The unclean shutdown is a side issue that means there is a trade-off to using it. No one is suggesting that checksums are any significant help with the unclean shutdown case, just that the existence of that atomicity issue does not **prevent** them being very useful for the function for which they were designed. The degree to which any particular sysadmin will choose to enable or disable checksums on nodatacow files will depend on how much they value the checksum protection vs. the impact of manually fixing problems after some unclean shutdowns. In my particular case, many of these nodatacow files are large, very long-lived and only in use intermittently. I would like my monthly "btrfs scrub" to know they haven't gone bad but they are extremely unlikely to be in the middle of a write during an unclean shutdown so I am likely to have very few false errors. They are all backed up, but without checksumming I don't know that the backup needs to be restored (or even that I am not backing up now-bad data). Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 14/08/17 15:23, Austin S. Hemmelgarn wrote: > Assume you have higher level verification. But almost no applications do. In real life, the decision making/correction process will be manual and labour-intensive (for example, running fsck on a virtual disk or restoring a file from backup). > Would you rather not be able > to read the data regardless of if it's correct or not, or be able to > read it and determine yourself if it's correct or not? It must be controllable on a per-file basis, of course. For the tiny number of files where the app can both spot the problem and correct it (for example if it has a journal) the current behaviour could be used. But, on MY system, I absolutely would **always** select the first option (-EIO). I need to know that a potential problem may have occurred and will take manual action to decide what to do. Of course, this also needs a special utility (as Christoph proposed) to be able to force the read (to allow me to examine the data) and to be able to reset the checksum (although that is presumably as simple as rewriting the data). This is what happens normally with any filesystem when a disk block goes bad, but with the additional benefit of being able to examine a "possibly valid" version of the data block before overwriting it. > Looking at this from a different angle: Without background, what would > you assume the behavior to be for this? For most people, the assumption > would be that this provides the same degree of data safety that the > checksums do when the data is CoW. Exactly. The naive expectation is that turning off datacow does not prevent the bitrot checking from working. Also, the naive expectation (for any filesystem operation) is that if there is any doubt about the reliability of the data, the error is reported for the user to deal with. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel btrfs file system wedged -- is it toast?
On 21/07/17 07:06, Paul Jackson wrote: > What in god green's earth can kernel file system code be > doing that takes fifteen minutes (so far, in this case) or > fifty minutes (in the case I first reported on this thread? I find that just doing a balance on a disk with lots of snapshots can cause this sort of effect. If I understand correctly, this is because btrfs does not have an efficient structure to help find all the references in different subvolumes to an extent which is being manipulated and many trees have to be searched. My understanding may be wrong but, in any case, the effect is that many operations can take massive amounts of processing time if there are lots of shared extents. This caused my early experiments with using btrfs snapshots on my main data disk (full of mail files, etc) to make the system lock up for many **hours** at a time (preventing mail processing, etc). I took the advice on this list to significantly decrease the number of snapshots I kept on the disk (I keep more on a separate backup disk, which has no day-to-day transactions happening, and can also better tolerate any issues). I also created a very hacky script to try to limit the impact of the balances which I do weekly. See https://github.com/GrahamCobb/btrfs-balance-slowly if you are interested. Between these things, the serious disruption is now gone. Note: despite all the hangs, I never saw any disk corruption. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a collection of snapshot subvolumes
On 25/04/17 05:02, J. Hart wrote: > I have a remote machine with a filesystem for which I periodically take > incremental snapshots for historical reasons. These snapshots are > stored in an archival filesystem tree on a file server. Older snapshots > are removed and newer ones added on a rotational basis. I need to be > able to backup this archive by syncing it with a set of backup drives. > Due to the size, I need to back it up incrementally rather than sending > the entire content each time. Due to the snapshot rotation, I need to > be able to update the state of the archive backup filesystem as a whole, > in much the same manner that rsync handles file trees. If I have understood your requirement correctly, this seems to be exactly matched to the capabilities of btrbk. I use btrbk to maintain a similar backup disk which contains a full copy of my main data disk along with various snapshots. > It seems that I cannot use "btrfs send", as the archive directory > contains the snapshots as subvolumes. I'm not sure that you mean. If your problem is that btrfs send does not cross subvolume boundaries then that is true: you would need to configure btrbk to back up each subvolume. I have a cron job that checks that all subvolumes (except the snapshots btrbk creates) are listed in my btrbk configuration file. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
On 27/03/17 13:00, J. Hart wrote: > That is a very interesting idea. I'll try some experiments with this. You might want to look into two tools which I have found useful for similar backups: 1) rsnapshot -- this uses rsync for backing up multiple systems and has been stable for quite a long time. If the target disk is btrfs it is fairly easy to configure so that it uses btrfs snapshots to create and remove the snapshot directories, speeding up the process. This doesn't really use any complex btrfs features and has been stable for me even on my Debian stable (kernel 3.16.39) system. 2) btrbk -- this allows you to create and manage btrfs snapshots on the source disk as well as backup snapshots on a separate btrfs disk. You can separately control how many snapshots you keep online on both the source and the backup disk. This is particularly useful for cases where you want to take very frequent snapshots (say hourly) for which rsync may be too slow (and rsync does not take a consistent snapshot, of course). There are many other tools, of course (I also take daily backups with dar to an ext4 system, without using any btrfs features at all, just in case a new version of btrfs suddenly decided to correct all copies of IHATEBTRFS on the disk to ILOVEBTRFS, for example :-) ). Graham Note to self: re-read this message periodically to check that feature hasn't appeared yet. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and cyrus mail server
On 08/02/17 18:38, Libor Klepáč wrote: > I'm interested in using: ... > - send/receive for offisite backup I don't particularly recommend that. I do use send/receive for onsite backups (I actually use btrbk). But for offsite I use a traditional backup tool (I use dar). For three main reasons: 1) Paranoia: I want a backup that does not use btrfs just in case there turned out to be some problem with btrfs which could corrupt the backup. I can't think of anything but I did say it was paranoia! 2) send/receive in incremental mode (the obvious way to use it for offsite backups) relies on the target being up to date and properly synchronised with the source. If, for any reason, it gets out of sync, you have to start again with sending a full backup - a lot of data. Traditional backup formats are more forgiving and having a corrupted incremental does not normally prevent you getting access to data stored in the other incrementals. This would particularly be a risk if you thought about storing the actual send streams instead of doing the receive: a single bit error in one could make all the subsequent streams useless. 3) send/receive doesn't work particularly well with encryption. I store my offsite backups in a cloud service and I want them encrypted both in transit and when stored. To get the same with send/receive requires putting together your own encrypted communication channel (e.g. using ssh) and requires that you have a remote server, with an encrypted filesystem receiving the data (and it has to be accessible in the clear on that server). Traditional backups can just be stored offsite as encrypted files without ever having to be in the clear anywhere except onsite. Just my reasons. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 05/02/17 12:08, Kai Krakow wrote: > Wrong. If you tend to not be in control of the permissions below a > mountpoint, you prevent access to it by restricting permissions on a > parent directory of the mountpoint. It's that easy and it always has > been. That is standard practice. While your backup is running, you have > no control of it - thus use this standard practice! Sorry, you are missing the point. This isn't about backups, it is about snapshots. To the sysadmin who is not a developer and does not know how receive is actually implemented, send/receive appears to work exactly like taking a readonly snapshot, but between two different disks. That is the mental model they have of the process. Taking a snapshot does not require hiding the target: it either works or it doesn't, and it cannot be interfered with. The sysadmin's natural expectation is that send/receive works the same way. You may say, from your position of knowledge about how it is implemented, that is an unrealistic expectation but it is a natural and common expectation. I very firmly believe that 80% of ordinary btrfs sysadmins would be surprised by this behaviour. But, in any case, we can all agree that this unexpected behaviour needs to be documented. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 03/02/17 16:01, Austin S. Hemmelgarn wrote: > Ironically, I ended up having time sooner than I thought. The message > doesn't appear to be in any of the archives yet, but the message ID is: > <20170203134858.75210-1-ahferro...@gmail.com> Ah. I didn't notice it until after I had sent my message. > I actually like how you explained things a bit better though, so if you > are OK with it I'll update the patch I sent using your description (and > credit you in the commit message too of course). You are welcome to use any of my phrasing or approach, of course! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 03/02/17 12:44, Austin S. Hemmelgarn wrote: > I can look at making a patch for this, but it may be next week before I > have time (I'm not great at multi-tasking when it comes to software > development, and I'm in the middle of helping to fix a bug in Ansible > right now). That would be great, Austin! It is about 15 years since I last submitted a patch under kernel development patch rules and things have changed a fair bit in that time. So if you are set up to do it that sounds good. As a starting point, I have created a suggested text (patch attached). diff --git a/Documentation/btrfs-receive.asciidoc b/Documentation/btrfs-receive.asciidoc index 6be4aa6..db525d9 100644 --- a/Documentation/btrfs-receive.asciidoc +++ b/Documentation/btrfs-receive.asciidoc @@ -31,7 +31,7 @@ the stream, and print the stream metadata, one operation per line. 3. default subvolume has changed or you didn't mount the filesystem at the toplevel subvolume -A subvolume is made read-only after the receiving process finishes succesfully. +A subvolume is made read-only after the receiving process finishes succesfully (see BUGS below). `Options` @@ -73,6 +73,16 @@ EXIT STATUS *btrfs receive* returns a zero exit status if it succeeds. Non zero is returned in case of failure. +BUGS + +*btrfs receive* sets the subvolume read-only after it completes successfully. +However, while the receive is in progress, users who have write access to files +or directories in the receiving 'path' can add, remove or modify files, in which +case the resulting read-only subvolume will not be a copy of the sending subvolume. + +If the intention is to create an exact copy, the receiving 'path' should be protected +from access by users until the receive has completed and the subvolume set to read-only. + AVAILABILITY *btrfs* is part of btrfs-progs.
Re: btrfs receive leaves new subvolume modifiable during operation
On 02/02/17 00:02, Duncan wrote: > If it's a workaround, then many of the Linux procedures we as admins and > users use every day are equally workarounds. Setting 007 perms on a dir > that doesn't have anything immediately security vulnerable in it, simply > to keep other users from even potentially seeing or being able to write > to something N layers down the subdir tree, is standard practice. No. There is no need to normally place a read-only snapshot below a no-execute directory just to prevent write access to it. That is not part of the admin's expectation. > Which is my point. This is no different than standard security practice, > that an admin should be familiar with and using without even having to > think about it. Btrfs is simply making the same assumptions that > everyone else does, that an admin knows what they are doing and sets the > upstream permissions with that in mind. If they don't, how is that > btrfs' fault? Because btrfs intends the receive snapshot to be read-only. That is the expectation of the sysadmin. It is an important and useful feature which makes send/receive very useful for creating user-readable-but-not-modifiable backups (without it, send/receive are useful for many things but less useful for creating backups). That feature has a bug. Just because you don't personally use the feature, doesn't mean it isn't a bug! Many of us do rely on that feature. Even though it is security-related, I agree it isn't the highest priority btrfs bug. It can probably wait until receive is being worked on for other reasons. But if it isn't going to be fixed any time soon, it should be documented in the Wiki and the man page, with the suggested workround for anyone who needs to make sure the receive won't be tampered with. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 01/02/17 22:27, Duncan wrote: > Graham Cobb posted on Wed, 01 Feb 2017 17:43:32 + as excerpted: > >> This first bug is more serious because it appears to allow a >> non-privileged user to disrupt the correct operation of receive, >> creating a form of denial-of-service of a send/receive based backup >> process. If I decided that I didn't want my pron collection (or my >> incriminating emails) appearing in the backups I could just make sure >> that I removed them from the receive snapshots while they were still >> writeable. > > I'll prefix this question by noting that my own use-case doesn't use send/ > receive, so while I know about it in general from following the list, > I've no personal experience with it... > > With that said, couldn't the entire problem be eliminated by properly > setting the permissions on a directory/subvol upstream of the received > snapshot? I (honestly) don't know. But even if that does work, it is clearly only a workround for the bug. Where in the documentation does it warn the system manager about the problem? Where does it tell them that they had better make sure they only receive into a directory tree which does not allow users read or execute access (not just not write access!)? What if part of the point of the backup strategy is that user's have read access to these snapshots so they can restore their own files? The possibility of a knowledgeable system manager being able to workround the problem by limiting how they use it doesn't stop it being a bug. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 01/02/17 12:28, Austin S. Hemmelgarn wrote: > On 2017-02-01 00:09, Duncan wrote: >> Christian Lupien posted on Tue, 31 Jan 2017 18:32:58 -0500 as excerpted: >> >>> I have been testing btrfs send/receive. I like it. >>> >>> During those tests I discovered that it is possible to access and modify >>> (add files, delete files ...) of the new receive snapshot during the >>> transfer. After the transfer it becomes readonly but it could already >>> have been modified. >>> >>> So you can end up with a source and a destination which are not the >>> same. Therefore during a subsequent incremental transfers I can get >>> receive to crash (trying to unlink a file that is not in the parent but >>> should). >>> >>> Is this behavior by design or will it be prevented in the future? >>> >>> I can of course just not modify the subvolume during receive but is >>> there a way to make sure no user/program modifies it? >> >> I'm just a btrfs-using list regular not a dev, but AFAIK, the behavior is >> likely to be by design and difficult to change, because the send stream >> is simply a stream of userspace-context commands for receive to act upon, >> and any other suitably privileged userspace program could run the same >> commands. (If your btrfs-progs is new enough receive even has a dump >> option, that prints the metadata operations in human readable form, one >> operation per line.) >> >> So making the receive snapshot read-only during the transfer would >> prevent receive itself working. > That's correct. Fixing this completely would require implementing > receive on the kernel side, which is not a practical option for multiple > reasons. I am with Christian on this. Both the effects he discovered go against my expectation of how send/receive would or should work. This first bug is more serious because it appears to allow a non-privileged user to disrupt the correct operation of receive, creating a form of denial-of-service of a send/receive based backup process. If I decided that I didn't want my pron collection (or my incriminating emails) appearing in the backups I could just make sure that I removed them from the receive snapshots while they were still writeable. You may be right that fixing this would require receive in the kernel, and that is undesirable, although it seems to me that it should be possible to do something like allow receive to create the snapshot with a special flag that would cause the kernel to treat it as read-only to any requests not delivered through the same file descriptor, or something like that (or, if that can't be done, at least require root access to make any changes). In any case, I believe it should be treated as a bug, even if low priority, with an explicit warning about the possible corruption of receive-based backups in the btrfs-receive man page. >>> I can also get in the same kind of trouble by modifying a parent (after >>> changing its property temporarily to ro=false). send/receive is checking >>> that the same parent uuid is available on both sides but not that >>> generation has not changed. Of course in this case it requires direct >>> user intervention. Never changing the ro property of subvolumes would >>> prevent the problem. >>> >>> Again is this by design? >> >> Again, yes. The ability to toggle snapshots between ro/rw is a useful >> feature and was added deliberately. This one would seem to me to be much >> like the (no doubt apocryphal) guy who went to the doctor complaining >> that when he beat his head against the wall, it hurt. The doctor said, >> "Stop doing that then." > Agreed, especially considering that some of the most interesting > use-cases for send/receive (which requires the sent subvolume to be > read-only) require the subvolume to be made writable again on the other > end. I agree that there are good reasons why subvolumes should be switchable between ro and rw. However, receive should detect and issue warnings when this problem has happened (for example by checking the generation). Again, this may be low priority, and may need to wait for a send stream format change, but it can't be claimed that this is correct behaviour. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs recovery
On 30/01/17 22:37, Michael Born wrote: > Also, I'm not interested in restoring the old Suse 13.2 system. I just > want some configuration files from it. If all you really want is to get some important information from some specific config files, and it is so important it is worth an hour or so of your time, you could consider a brute-force method such as just grep-ing the whole image file for a string you know should appear in the relevant config file and dumping the blocks around those locations to see if you can see the data you need. Unfortunately this won't work if you had file compression on. Or if there is no reasonably unique text to search for, of course. Just a thought. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Not TLS] Re: mount option nodatacow for VMs on SSD?
On 28/11/16 02:56, Duncan wrote: > It should still be worth turning on autodefrag on an existing somewhat > fragmented filesystem. It just might take some time to defrag files you > do modify, and won't touch those you don't, which in some cases might > make it worth defragging those manually. Or simply create new > filesystems, mount them with autodefrag, and copy everything over so > you're starting fresh, as I do. Could that "copy" be (a series of) send/receive, so that snapshots and reflinks are preserved? Does autodefrag work in that case or does the send/receive somehow override that and end up preserving the original (fragmented) extent structure? Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: Add command to check if balance op is req
On 28/10/16 16:20, David Sterba wrote: > I tend to agree with this approach. The usecase, with some random sample > balance options: > > $ btrfs balance start --analyze -dusage=10 -musage=5 /path Wouldn't a "balance analyze" command be better than "balance start --analyze"? I would have guessed the latter started the balance but printed some analysis as well (before or, probably more usefully, afterwards). There might, of course, be some point in a (future) $ btrfs balance start --if-needed -dusage=10 -musage=5 /path command. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental send robustness question
On 13/10/16 00:47, Sean Greenslade wrote: > I may just end up doing that. Hugo's responce gave me some crazy ideas > involving a custom build of split that waits for a command after each > output file fills, which would of course require an equally weird build > of cat that would stall the pipe indefinitely until all the files showed > up. Driving the HDD over would probably be a little simpler. =P I am sure it is, if that is an option. I had considered doing something similar: doing an initial send to a big file on a spare disk, then sending the disk to Amazon to import using their Import/Export Disk service, then creating a server and a filesystem in AWS and doing a btrfs receive from the imported file. The plan would then be to do incremental sends over my home broadband line for subsequent backups. >>> And while we're at it, what are the failure modes for incremental sends? >>> Will it throw an error if the parents don't match, or will there just be >>> silent failures? >> >> Create a list of possibilities, create some test filesystems, try it. > > I may just do that, presuming I can find the spare time. Given that I'm > building a backup solution around this tech, it would definitely bolster > my confidence in it if I knew what its failure modes looked like. That's a good idea. In the end I decided that relying on btrfs send for my offsite cold storage backups was probably not a good idea. Btrfs is great, and I heavily use snapshotting and send/receive locally. But there is always a small nagging fear that a btrfs bug could introduce some problem which can survive send/receive and mean the backup was corrupted as well (or that an incremental won't load for some reason). For that reason I decided to deliberately use a different technology for backups. I now use dar to create backups and then upload the files to a cloud cold storage service for safekeeping. There are other reasons as well: encryption is easier to handle, cold storage for files is cheaper than having disk images which need to be online to load the incrementals, no need for a virtual server, handling security for backups from different servers with different levels of risk is easier, etc. There are also downsides: verifying the backups are readable/restorable is harder, bandwidth usage is less efficient (dar sends more data than btrfs send would as it is working at the file level, not the extent level). By the way, to test out your various failure modes I recommend creating some small btrfs filesystems on loop devices -- just be careful to make sure you create each one from scratch and do not copy disk images (so that they all have unique UIDs). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On 20/09/16 19:53, Alexandre Poux wrote: > As for moving data to an another volume, since it's only data and > nothing fancy (no subvolume or anything), a simple rsync would do the trick. > My problem in this case is that I don't have enough available space > elsewhere to move my data. > That's why I'm trying this hard to recover the partition... I am sure you have already thought about this, but... it might be easier, and even maybe faster, to backup the data to a cloud server, then recreate and download again. Backblaze B2 is very cheap for upload and storage (don't know about download charges, though). And rclone works well to handle rsync-style copies (although you might want to use tar or dar if you need to preserve file attributes). And if that works, rclone + B2 might make a reasonable offsite backup solution for the future! Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Security implications of btrfs receive?
On 07/09/16 16:06, Austin S. Hemmelgarn wrote: > It hasn't, because there's not any way it can be completely fixed. This > particular case is an excellent example of why it's so hard to fix. To > close this particular hole, BTRFS itself would have to become aware of > whether whoever is running an ioctl is running in a chroot or not, which > is non-trivial to determine to begin with, and even harder when you > factor in the fact that chroot() is a VFS level thing, not a underlying > filesystem thing, while ioctls are much lower level. Actually, I think the btrfs-receive case might be a little easier to fix and, in my quick reading of the code before doing a test, I thought it even did work this way... I think the fix would be to require that the action of cloning disk blocks required user-mode to provide an FD which has read access to the source blocks. There is little problem with allowing the ioctl to identify the matching subvolume for a UUID but it should require user mode to open the source file for read access using a path before allowing any blocks to be cloned. That way, any VFS-level checks would be done. And, what is more, btrfs-receive could do a file path check to make sure the file being cloned from is within the path that was provided on the command line. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Security implications of btrfs receive?
On 07/09/16 16:20, Austin S. Hemmelgarn wrote: > I should probably add to this that you shouldn't be accepting > send/receive data streams from untrusted sources anyway. While it > probably won't crash your system, it's not intended for use as something > like a network service. If you're sending a subvolume over an untrusted > network, you should be tunneling it through SSH or something similar, > and then using that to provide source verification and data integrity > guarantees, and if you can't trust the system's your running backups > for, then you have bigger issues to deal with. In my personal case I'm not talking about accepting streams from untrusted sources (although that is also a perfectly reasonable question to discuss). My concern is if one of my (well managed and trusted but never perfect) systems is hacked, can the intruder use that as an entry to attack others of my systems? In particular, I never trust my systems which live on the internet with automated access to my personal systems (without a human providing additional passwords/keys) although I do allow some automated accesses the other way around. I am trying to determine if sharing btrfs-send-based backups would open a vulnerability. There are articles on the web suggesting that centralised btrfs-send-based backups are a good idea (using ssh access with separate keys for each system which automatically invoke btrfs-receive into a system-specific path). My tests so far suggest that this may not be as secure as the articles imply. In any case, I think this is a topic worth investigating further, if any graduate student is looking for a PhD topic! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Security implications of btrfs receive?
Thanks to Austin and Duncan for their replies. On 06/09/16 13:15, Austin S. Hemmelgarn wrote: > On 2016-09-05 05:59, Graham Cobb wrote: >> Does the "path" argument of btrfs-receive mean that *all* operations are >> confined to that path? For example, if a UUID or transid is sent which >> refers to an entity outside the path will that other entity be affected >> or used? > As far as I know, no, it won't be affected. >> Is it possible for a file to be created containing shared >> extents from outside the path? > As far as I know, the only way for this to happen is if you're > referencing a parent subvolume for a relative send that is itself > sharing extents outside of the path. From a practical perspective, > unless you're doing deduplication on the receiving end, the this > shouldn't be possible. Unfortunately that is not the case. I decided to do some tests to see what happens. It is possible for a receive into one path to reference and access a subvolume from a different path on the same btrfs disk. I have created a bash script to demonstrate this at: https://gist.github.com/GrahamCobb/c7964138057e4e092a75319c9fb240a3 This does require the attacker to know the (source) subvolume UUID they want to copy. I am not sure how hard UUIDs are to guess. By the way, this is exactly the same whether or not the --chroot option is specified on the "btrfs receive" command. The next question this raises for me is whether this means that processes in a chroot or in a container (or in a mandatory access controls environment) can access files outside the chroot/container if they know the UUID of the subvolume? After all, btrfs-receive uses IOCTLs that any program running as root can use. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Security implications of btrfs receive?
Does anyone know of a security analysis of btrfs receive? I assume that just using btrfs receive requires root (is that so?). But I was thinking of setting up a backup server which would receive snapshots from various client systems, each in their own path, and I wondered how much the security of the backup server (and other clients' backups) was dependent on the security of the client. Does the "path" argument of btrfs-receive mean that *all* operations are confined to that path? For example, if a UUID or transid is sent which refers to an entity outside the path will that other entity be affected or used? Is it possible for a file to be created containing shared extents from outside the path? Is it possible to confuse/affect filesystem metadata which would affect the integrity of subvolumes or files outside the path or prevent other clients from doing something legitimate? Do the answers change if the --chroot option is given? I am confused about the -m option -- does that mean that the root mount point has to be visible in the chroot? Lastly, even if receive is designed to be very secure, it is possible that it could trigger/use code paths in the btrfs kernel code which are not normally used during normal file operations and so could trigger bugs not normally seen. Has any work been done on testing for that (for example tests using malicious streams, including ones which btrfs-send cannot generate)? I am just wondering whether any work has been done/published on this area. Regards Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extents for a particular subvolume
On 03/08/16 22:55, Graham Cobb wrote: > On 03/08/16 21:37, Adam Borowski wrote: >> On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote: >>> Are there any btrfs commands (or APIs) to allow a script to create a >>> list of all the extents referred to within a particular (mounted) >>> subvolume? And is it a reasonably efficient process (i.e. doesn't >>> involve backrefs and, preferably, doesn't involve following directory >>> trees)? In case anyone else is interested in this, I ended up creating some simple scripts to allow me to do this. They are slow because they walk the directory tree and they use filefrag to get the extent data, but they do let me answer questions like: * How much space am I wasting by keeping historical snapshots? * How much data is being shared between two subvolumes * How much of the data in my latest snapshot is unique to that snapshot? * How much data would I actually free up if I removed (just) these particular subvolumes? If they are useful to anyone else you can find them at: https://github.com/GrahamCobb/extents-lists If anyone knows of more efficient ways to get this information please let me know. And, of course, feel free to suggest improvements/bugfixes! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extents for a particular subvolume
On 03/08/16 21:37, Adam Borowski wrote: > On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote: >> Are there any btrfs commands (or APIs) to allow a script to create a >> list of all the extents referred to within a particular (mounted) >> subvolume? And is it a reasonably efficient process (i.e. doesn't >> involve backrefs and, preferably, doesn't involve following directory >> trees)? > > Since the size of your output is linear to the number of extents which is > between the number of files and sum of their sizes, I see no gain in > trying to avoid following the directory tree. Thanks for the help, Adam. There are a lot of files and a lot of directories - find, "ls -R" and similar operations take a very long time. I was hoping that I could query some sort of extent tree for the subvolume and get the answer back in seconds instead of multiple minutes. But I can follow the directory tree if I need to. >> I am not looking to relate the extents to files/inodes/paths. My >> particular need, at the moment, is to work out how much of two snapshots >> is shared data, but I can think of other uses for the information. > > Thus, unlike the question you asked above, you're not interested in _all_ > extents, merely those which changed. > > You may want to look at "btrfs subv find-new" and "btrfs send --no-data". Unfortunately, the subvolumes do not have an ancestor-descendent relationship (although they do have some common ancestors), so I don't think find-new is much help (as far as I can see). But just looking at the size of the output from "send -c" would work well enough for the particular problem I am trying to solve tonight! Although I will need to take read-only snapshots of the subvolumes to allow send to work. Thanks for the suggestion. I would still be interested in the extent list, though. The main problem with find-new and send is that they don't tell me how much has been deleted, only added. I am thinking about using the extents to get a much better handle on what is using up space and what I could recover if I removed (or moved to another volume) various groups of related subvolumes. Thanks again for the help. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Extents for a particular subvolume
Are there any btrfs commands (or APIs) to allow a script to create a list of all the extents referred to within a particular (mounted) subvolume? And is it a reasonably efficient process (i.e. doesn't involve backrefs and, preferably, doesn't involve following directory trees)? I am not looking to relate the extents to files/inodes/paths. My particular need, at the moment, is to work out how much of two snapshots is shared data, but I can think of other uses for the information. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fi defrag: change default extent target size to 32 MiB
On 28/07/16 12:17, David Sterba wrote: > diff --git a/cmds-filesystem.c b/cmds-filesystem.c > index ef1f550b51c0..6b381c582ea7 100644 > --- a/cmds-filesystem.c > +++ b/cmds-filesystem.c > @@ -968,7 +968,7 @@ static const char * const cmd_filesystem_defrag_usage[] = > { > "-f flush data to disk immediately after defragmenting", > "-s start defragment only from byte onward", > "-l len defragment only up to len bytes", > - "-t sizetarget extent size hint", > + "-t sizetarget extent size hint (default: 32 MiB)", As a user... might it be better to say the default is 32M as that is the format the option requires? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 21/07/16 09:19, Qu Wenruo wrote: > We don't usually get such large extent tree dump from a real world use > case. Let us know if you want some more :-) I have a heavily used single disk BTRFS filesystem with about 3.7TB in use and about 9 million extents. I am happy to provide an extent dump if it is useful to you. Particularly if you don't need me to actually unmount it (i.e. you can live with some inconsistencies). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is "btrfs balance start" truly asynchronous?
On 21/06/16 12:51, Austin S. Hemmelgarn wrote: > The scrub design works, but the whole state file thing has some rather > irritating side effects and other implications, and developed out of > requirements that aren't present for balance (it might be nice to check > how many chunks actually got balanced after the fact, but it's not > absolutely necessary). Actually, that would be **really** useful. I have been experimenting with cancelling balances after a certain time (as part of my "balance-slowly" script). I have got it working, just using bash scripting, but it means my script does not know whether any work has actually been done by the balance run which was cancelled (if no work was done, but it timed out anyway, there is probably no point trying again with the same timeout later!). Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reducing impact of periodic btrfs balance
On 19/05/16 02:33, Qu Wenruo wrote: > > > Graham Cobb wrote on 2016/05/18 14:29 +0100: >> A while ago I had a "no space" problem (despite fi df, fi show and fi >> usage all agreeing I had over 1TB free). But this email isn't about >> that. >> >> As part of fixing that problem, I tried to do a "balance -dusage=20" on >> the disk. I was expecting it to have system impact, but it was a major >> disaster. The balance didn't just run for a long time, it locked out >> all activity on the disk for hours. A simple "touch" command to create >> one file took over an hour. > > It seems that balance blocked a transaction for a long time, which makes > your touch operation to wait for that transaction to end. I have been reading volumes.c. But I don't have a feel for which transactions are likely to be the things blocking for a really long time (hours). If this can occur, I think the warnings to users about balance need to be extended to include this issue. Currently the user mode code warns users that unfiltered balances may take a long time, but it doesn't warn that the disk may be unusable during that time. >> 3) My btrfs-balance-slowly script would work better if there was a >> time-based limit filter for balance, not just the current count-based >> filter. I would like to be able to say, for example, run balance for no >> more than 10 minutes (completing the operation in progress, of course) >> then return. > > As btrfs balance is done in block group unit, I'm afraid such thing > would be a little tricky to implement. It would be really easy to add a jiffies-based limit into the checks in should_balance_chunk. Of course, this would only test the limit in between block groups but that is what I was looking for -- a time-based version of the current limit filter. On the other hand, the time limit could just be added into the user mode code: after the timer expires it could issue a "balance pause". Would the effect be identical in terms of timing, resources required, etc? Would it be better to do a "balance pause" or a "balance cancel"? The goal would be to suspend balance processing and allow the system to do something else for a while (say 20 minutes) and then go back to doing more balance later. What is the difference between resuming a paused balance compared to starting a new balance? Bearing in mind that this is a heavily used disk so we can expect lots of transactions to have happened in the meantime (otherwise we wouldn't need this capability)? Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Not TLS] Re: Reducing impact of periodic btrfs balance
On 19/05/16 05:09, Duncan wrote: > So to Graham, are these 1.5K snapshots all of the same subvolume, or > split into snapshots of several subvolumes? If it's all of the same > subvolume or of only 2-3 subvolumes, you still have some work to do in > terms of getting down to recommended snapshot levels. Also, if you have > quotas on and don't specifically need them, try turning them off and see > if that alone makes it workable. I have just under 20 subvolumes but the snapshots are only taken if something has changed (actually I use btrbk: I am not sure if it takes the snapshot and then removes it if nothing changed or whether it knows not to even take it). The most frequently changing subvolumes have just under 400 snapshots each. I have played with snapshot retention and think it unlikely I would want to reduce it further. I have quotas turned off. At least, I am not using quotas -- how can I double check it is really turned off? I know that very large numbers of snapshots are not recommended, and I expected the balance to be slow. I was quite prepared for it to take many days. My full backups take several days and even incrementals take several hours. What I did not expect, and think is a MUCH more serious problem, is that the balance prevented use of the disk, holding up all writes to the disk for (quite literally) hours each. I have not seen that effect mentioned anywhere! That means that for a large, busy data disk, it is impossible to do a balance unless the server is taken down to single-user mode for the time the balance takes (presumably still days). I assume this would also apply to doing a RAID rebuild (I am not using multiple disks at the moment). At the moment I am still using my previous backup strategy, alongside the snapshots (that is: rsync-based rsnapshots to another disk daily and with fairly long retentions, and separate daily full/incremental backups using dar to a nas in another building). I was hoping the btrfs snapshots might replace the daily rsync snapshots but it doesn't look like that will work out. Thanks to all for the replies. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Reducing impact of periodic btrfs balance
Hi, I have a 6TB btrfs filesystem I created last year (about 60% used). It is my main data disk for my home server so it gets a lot of usage (particularly mail). I do frequent snapshots (using btrbk) so I have a lot of snapshots (about 1500 now, although it was about double that until I cut back the retention times recently). A while ago I had a "no space" problem (despite fi df, fi show and fi usage all agreeing I had over 1TB free). But this email isn't about that. As part of fixing that problem, I tried to do a "balance -dusage=20" on the disk. I was expecting it to have system impact, but it was a major disaster. The balance didn't just run for a long time, it locked out all activity on the disk for hours. A simple "touch" command to create one file took over an hour. More seriously, because of that, mail was being lost: all mail delivery timed out and the timeout error was interpreted as a fatal delivery error causing mail to be discarded, mailing lists to cancel subscriptions, etc. The balance never completed, of course. I eventually got it cancelled. I have since managed to complete the "balance -dusage=20" by running it repeatedly with "limit=N" (for small N). I wrote a script to automate that process, and rerun it every week. If anyone is interested, the script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly Out of that experience, I have a couple of thoughts about how to possibly make balance more friendly. 1) It looks like the balance process seems to (effectively) lock all file (extent?) creation for long periods of time. Would it be possible for balance to make more effort to yield locks to allow other processes/threads to get in to continue to create/write files while it is running? 2) btrfs scrub has options to set ionice options. Could balance have something similar? Or would reducing the IO priority make things worse because locks would be held for longer? 3) My btrfs-balance-slowly script would work better if there was a time-based limit filter for balance, not just the current count-based filter. I would like to be able to say, for example, run balance for no more than 10 minutes (completing the operation in progress, of course) then return. 4) My btrfs-balance-slowly script would be more reliable if there was a way to get an indication of whether there was more work to be done, instead of parsing the output for the number of relocations. Any thoughts about these? Or other things I could be doing to reduce the impact on my services? Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html