from:"Graham Cobb"

btrfs-progs top-level options man pages and usage

2021-04-14 Thread Graham Cobb



I have a cron job which frequently deletes a subvolume and I decided I
wanted to silence the output. I remembered there was a -q option and
thought I would just quickly glance at the documentation for it to check
there wasn't some reason I had not put that in the script when I first
wrote it some time ago.

That opened up an Alice-in-Wonderland rabbit hole... the documentation
for the common command options in btrfs-progs is not just awful, I ended
up very confused about what I was seeing...

There are several issues:

1) The man pages do not describe the top-level btrfs command options, or
their equivalents at subcommand level.

1a) btrfs.asciidoc makes no mention of -q, --quiet, -v, --verbose or
even of the concept of top-level btrfs command options. It just explains
how the subcommand structure works.

1b) btrfs-subvolume.asciidoc makes no mention of -q or --quiet. However,
it *does* mention -v and --verbose but with the completely cryptic (if
you are not a btrfs-progs developer) description "(deprecated) alias for
global '-v' option". What global '-v' option is that then as it is not
documented in btrfs.asciidoc? And what about '-q'?

I haven't looked at the other subcommand pages.

2) The built-in help in btrfs and btrfs-subvolume do not describe the
top level command options.

2a) 'btrfs' shows a usage message that shows the -q, --quiet, -v and
--verbose options but with no information on them. 'btrfs --help'
provides no further information. 'btrfs --help --full' does, but it is
almost 800 lines long.

2b) 'btrfs subvolume' doesn't even mention these options in its usage
message. Nor does it mention the --help option or the --help --full
option as global options or as subcommand options. 'btrfs subvolume
--help' provides exactly the same output. 'btrfs subvolume --help
--full' does at least mention the options - if anyone can ever guess
that it exists.

Again, I haven't looked at the other subcommands.

3) The difference between global options and subcommand options is very
unfortunate, and very confusing. I *hate* the concept of global options
(as implemented) -- in my mental model the 'btrfs' command is really
just a prefix to group together various btrfs-related commands and I
don't even *think* about ever inserting an option between btrfs and the
subcommand. In my mental model, the command is 'btrfs subvolume'. In my
mind, if the command was 'btrfs' then the syntax would more naturally be
'btrfs create subvolume' (like 'Siri, call David') instead of 'btrfs
subvolume create'.

3a) One particularly unfortunate effect is that 'btrfs subv del -v XXX'
works but 'btrfs subv del -q XXX' does not. I consider myself very
experienced with btrfs but even after drafting the first version of this
email I changed my script to do this and was surprised when it didn't work.

3b) Another confusing effect is that both 'btrfs -v subv del XXX' and
'btrfs subv del -v XXX' work but 'btrfs subv -v del XXX' does not.

I think fixing the man page to document the global options is important
and I am happy to try to create a suitable patch. Does anyone else feel
the other issues should be fixed?

Graham

Re: no memory is freed after snapshots are deleted

2021-03-10 Thread Graham Cobb

On 10/03/2021 12:07, telsch wrote:
> Dear devs,
> 
> after my root partiton was full, i deleted the last monthly snapshots. 
> however, no memory was freed.
> so far rebalancing helped:
> 
>   btrfs balance start -v -musage=0 /
>   btrfs balance start -v -dusage=0 /
> 
> i have deleted all snapshots, but no memory is being freed this time.

Don't forget that, in general, deleting a snapshot does nothing - if the
original files are still there (or any other snapshots of the same files
are still there). In my experience, if you *really* need space urgently
you are best of starting with deleting some big files *and* all the
snapshots containing them, rather than starting by deleting snapshots.

If you are doing balances with low space, I find it useful to watch
dmesg to see if the balance is hitting problems finding space to even
free things up.

However, one big advantage of btrfs is that you can easily temporarily
add a small amount of space while you sort things out. Just plug in a
USB memory stick, and add it to the filesystem using 'btrfs device add'.

I don't recommend leaving it as part of the filesystem for long - it is
too easy for the memory stick to fail, or for you remove it forgetting
how important it is, but it can be useful when you are trying to do
things like remove snapshots and files or run balance. Don't forget to
use btrfs device remove to remove it - not just unplugging it!

Re: Re: nfs subvolume access?

2021-03-10 Thread Graham Cobb

On 10/03/2021 08:09, Ulli Horlacher wrote:
> On Wed 2021-03-10 (07:59), Hugo Mills wrote:
> 
>>> On tsmsrvj I have in /etc/exports:
>>>
>>> /data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash)
>>>
>>> This is a btrfs subvolume with snapshots:
>>>
>>> root@tsmsrvj:~# btrfs subvolume list /data
>>> ID 257 gen 35 top level 5 path fex
>>> ID 270 gen 36 top level 257 path fex/spool
>>> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
>>> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
>>> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
>>> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
>>>
>>> root@tsmsrvj:~# find /data/fex | wc -l
>>> 489887
> 
>>I can't remember if this is why, but I've had to put a distinct
>> fsid field in each separate subvolume being exported:
>>
>> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
> 
> I must export EACH subvolume?!

I have had similar problems. I *think* the current case is that modern
NFS, using NFS V4, can cope with the whole disk being accessible without
giving each subvolume its own FSID (which I have stopped doing).

HOWEVER, I think that find (and anything else which uses fsids and inode
numbers) will see subvolumes as having duplicated inodes.

> The snapshots are generated automatically (via cron)!
> I cannot add them to /etc/exports

Well, you could write some scripts... but I don't think it is necessary.
I *think* it is only necessary if you want `find` to be able to cross
between subvolumes on the NFS mounted disks.

However, I am NOT an NFS expert, nor have I done a lot of work on this.
I might be wrong. But I do NFS mount my snapshots disk remotely and use
it. And I do see occasional complaints from find, but I live with it.

Re: Large multi-device BTRFS array (usually) fails to mount on boot.

2021-02-19 Thread Graham Cobb



On 19/02/2021 17:42, Joshua wrote:
> February 3, 2021 3:16 PM, "Graham Cobb"  wrote:
> 
>> On 03/02/2021 21:54, jos...@mailmag.net wrote:
>>
>>> Good Evening.
>>>
>>> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having 
>>> problems mounting on
>>> boot without timing out. This causes the system to drop to emergency mode. 
>>> I am then able to mount
>>> the array in emergency mode and all data appears fine, but upon reboot it 
>>> fails again.
>>>
>>> I actually first had this problem around a year ago, and initially put 
>>> considerable effort into
>>> extending the timeout in systemd, as I believed that to be the problem. 
>>> However, all the methods I
>>> attempted did not work properly or caused the system to continue booting 
>>> before the array was
>>> mounted, causing all sorts of issues. Eventually, I was able to almost 
>>> completely resolve it by
>>> defragmenting the extent tree and subvolume tree for each subvolume. (btrfs 
>>> fi defrag
>>> /mountpoint/subvolume/) This seemed to reduce the time required to mount, 
>>> and made it mount on boot
>>> the majority of the time.
>>
>> Not what you asked, but adding "x-systemd.mount-timeout=180s" to the
>> mount options in /etc/fstab works reliably for me to extend the timeout.
>> Of course, my largest filesystem is only 20TB, across only two devices
>> (two lvm-over-LUKS, each on separate physical drives) but it has very
>> heavy use of snapshot creation and deletion. I also run with commit=15
>> as power is not too reliable here and losing power is the most frequent
>> cause of a reboot.
> 
> Thanks for the suggestion, but I have not been able to get this method to 
> work either.
> 
> Here's what my fstab looks like, let me know if this is not what you meant!
> 
> UUID={snip} / ext4  errors=remount-ro 0 0
> UUID={snip} /mnt/data btrfs 
> defaults,noatime,compress-force=zstd:2,x-systemd.mount-timeout=300s 0 0

Hmmm. The line from my fstab is:

LABEL=lvmdata   /mnt/data   btrfs
defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,commit=15,space_cache=v2,x-systemd.mount-timeout=180s,nofail
  0   3

I note that I do have "nofail" in there, although it doesn't fail for me
so I assume it shouldn't make a difference.

I can't swear that the disk is currently taking longer to mount than the
systemd default (and I will not be in a position to reboot this system
any time soon to check). But I am quite sure this made a difference when
I added it.

Not sure why it isn't working for you, unless it is some systemd
problem. It isn't systemd giving up and dropping to emergency because of
some other startup problem that occurs before the mount is finished, is
it? I could believe systemd cancels any mounts in progress when that
happens.

Graham

Re: Large multi-device BTRFS array (usually) fails to mount on boot.

2021-02-03 Thread Graham Cobb

On 03/02/2021 21:54, jos...@mailmag.net wrote:
> Good Evening.
> 
> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having 
> problems mounting on boot without timing out. This causes the system to drop 
> to emergency mode. I am then able to mount the array in emergency mode and 
> all data appears fine, but upon reboot it fails again.
> 
> I actually first had this problem around a year ago, and initially put 
> considerable effort into extending the timeout in systemd, as I believed that 
> to be the problem. However, all the methods I attempted did not work properly 
> or caused the system to continue booting before the array was mounted, 
> causing all sorts of issues. Eventually, I was able to almost completely 
> resolve it by defragmenting the extent tree and subvolume tree for each 
> subvolume. (btrfs fi defrag /mountpoint/subvolume/) This seemed to reduce the 
> time required to mount, and made it mount on boot the majority of the time.
> 

Not what you asked, but adding "x-systemd.mount-timeout=180s" to the
mount options in /etc/fstab works reliably for me to extend the timeout.
Of course, my largest filesystem is only 20TB, across only two devices
(two lvm-over-LUKS, each on separate physical drives) but it has very
heavy use of snapshot creation and deletion. I also run with commit=15
as power is not too reliable here and losing power is the most frequent
cause of a reboot.

Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata

2021-01-23 Thread Graham Cobb

On 23/01/2021 17:21, Zygo Blaxell wrote:
> On Sat, Jan 23, 2021 at 02:55:52PM +0000, Graham Cobb wrote:
>> On 22/01/2021 22:42, Zygo Blaxell wrote:
>> ...
>>>> So the point is: what happens if the root subvolume is not mounted ?
>>>
>>> It's not an onerous requirement to mount the root subvol.  You can do (*)
>>>
>>> tmp="$(mktemp -d)"
>>> mount -osubvolid=5 /dev/btrfs "$tmp"
>>> setfattr -n 'btrfs...' -v... "$tmp"
>>> umount "$tmp"
>>> rmdir "$tmp"
>>
>> No! I may have other data on that disk which I do NOT want to become
>> accessible to users on this system (even for a short time). As a simple
>> example, imagine, a disk I carry around to take emergency backups of
>> other systems, but I need to change this attribute to make the emergency
>> backup of this system run as quickly as possible before the system dies.
>> Or a disk used for audit trails, where users should not be able to
>> modify their earlier data. Or where I suspect a security breach has
>> occurred. I need to be able to be confident that the only data
>> accessible on this system is data in the specific subvolume I have mounted.
> 
> Those are worthy goals, but to enforce them you'll have to block or filter
> the mount syscall with one of the usual sandboxing/container methods.
> 
> If you have that already set up, you can change root subvol xattrs from
> the supervisor side.  No users will have access if you don't give them
> access to the root subvol or the mount syscall on the restricted side
> (they might also need a block device FD belonging to the filesystem).
> 
> If you don't have the sandbox already set up, then root subvol access
> is a thing your users can already do, and it may be time to revisit the
> assumptions behind your security architecture.

I'm not talking about root. I mean unpriv'd users (who can't use mount)!
If I (as root) mount the whole disk, those users may be able to read or
modify files from parts of that disk to which they should not have
access. That may be  why I am not mounting the whole disk in the first
place.

I gave a few very simple examples, but I can think of many more cases
where a disk may contain files which users might be able to access if
the disk was mounted (maybe the disk has subvols used by many different
systems but UIDs are not coordinated, or ...).  And, of course, if they
can open a FD during the brief time it is mounted, they can stop it
being unmounted again.

No. If I have chosen to mount just a subvol, it is because I don't want
to mount the whole disk.

Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata

2021-01-23 Thread Graham Cobb

On 22/01/2021 22:42, Zygo Blaxell wrote:
...
>> So the point is: what happens if the root subvolume is not mounted ?
> 
> It's not an onerous requirement to mount the root subvol.  You can do (*)
> 
>   tmp="$(mktemp -d)"
>   mount -osubvolid=5 /dev/btrfs "$tmp"
>   setfattr -n 'btrfs...' -v... "$tmp"
>   umount "$tmp"
>   rmdir "$tmp"

No! I may have other data on that disk which I do NOT want to become
accessible to users on this system (even for a short time). As a simple
example, imagine, a disk I carry around to take emergency backups of
other systems, but I need to change this attribute to make the emergency
backup of this system run as quickly as possible before the system dies.
Or a disk used for audit trails, where users should not be able to
modify their earlier data. Or where I suspect a security breach has
occurred. I need to be able to be confident that the only data
accessible on this system is data in the specific subvolume I have mounted.

Also, the backup problem is a very real problem - abusing xattrs for
filesystem controls really screws up writing backup procedures to
correctly backup xattrs used to describe or manage data (or for any
other purpose).

I suppose btrfs can internally store it in an xattr if it wants, as long
as any values set are just ignored and changes happen through some other
operation (e.g. sysfs). It still might confuse programs like rsync which
would try to reset the values each time a sync is done.

NVME experience?

2021-01-16 Thread Graham Cobb

I am about to deploy my first btrfs filesystems on NVME. Does anyone
have any hints or advice? Initially they will be root disks, but I am
thinking about also moving home disks and other frequently used data to
NVME, but probably not backups and other cold data.

I am mostly wondering about non-functionals like typical failure modes,
life and wear, etc. Which might affect decisions like how to split
different filesystems among the drives, whether to mix NVME with other
drives (SSD or HDD), etc.

Are NVME drives just SSDs with a different interface? With similar
lifetimes (by bytes written, or another measure)? And similar typical
failure modes?

Are they better or worse in terms of firmware bugs? Error detection for
corrupted data? SMART or other indicators that they are starting to fail
and should be replaced?

I assume that they do not (in practice) suffer from "faulty cable" problems.

Anyway, I am hoping someone has experiences to share which might be useful.

Graham

Re: [PATCH 2/2] btrfs: send: fix invalid commands for inodes with changed type but same gen

2021-01-12 Thread Graham Cobb

On 12/01/2021 11:27, Filipe Manana wrote:
> ...
> In other words, what I think we should have is a check that forbids
> using two roots for an incremental send that are not snapshots of the
> same subvolume (have different parent uuids).

Are you suggesting that rule should also apply for clone sources (-c
option)? Or are clone sources handled differently from the parent in the
code?

Re: Re: Re: cloning a btrfs drive with send and receive: clone is bigger than the original?

2021-01-10 Thread Graham Cobb

On 10/01/2021 07:41, cedric.dew...@eclipso.eu wrote:
> I've tested some more.
> 
> Repeatedly sending the difference between two consecutive snapshots creates a 
> structure on the target drive where all the snapshots share data. So 10 
> snapshots of 10 files of 100MB takes up 1GB, as expected.
> 
> Repeatedly sending the difference between the first snapshot and each next 
> snapshot creates a structure on the target drive where the snapshots are 
> independent, so they don't share any data. How can that be avoided?

If you send a snapshot B with a parent A, any files not present in A
will be created in the copy of B. The fact that you already happen to
have a copy of the files somewhere else on the target is not known to
either the sender or the receiver - how would it be?

If you want the send process to take into account *other* snapshots that
have previously been sent, you need to tell send to also use those
snapshots as clone sources. That is what the -c option is for.

Alternatively, use a deduper on the destination after the receive has
finished and let it work out what can be shared.

Re: synchronize btrfs snapshots over a unreliable, slow connection

2021-01-07 Thread Graham Cobb

On 07/01/2021 03:09, Zygo Blaxell wrote:
...
> I would only attempt to put the archives into long-term storage after> 
> verifying that they produce correct output when fed to btrfs receive;>
otherwise, you could find out too late that a months-old archive was>
damaged, incomplete, or incorrect, and restores after that point are no>
longer possible.> > Once that verification has been done and the subvol
is no longer needed> for incremental sends, you can delete the subvol
and keep the archive(s)> that produced it.
Personally, I wouldn't do that. Particularly if this was my only or main
backup. I don't think btrfs has many tests that new versions of
"receive" can correctly process old archives - let alone an incremental
sequence of them generated by versions of "send" with bugs fixed years
before.

If it was me, I would always keep the "latest" subvol online, or at
least as a newly created full (not incremental) send archive.

Re: synchronize btrfs snapshots over a unreliable, slow connection

2021-01-05 Thread Graham Cobb

On 05/01/2021 08:34, Forza wrote:
> 
> 
> On 2021-01-04 21:51, cedric.dew...@eclipso.eu wrote:
>> I have a master NAS that makes one read only snapshot of my data per
>> day. I want to transfer these snapshots to a slave NAS over a slow,
>> unreliable internet connection. (it's a cheap provider). This rules
>> out a "btrfs send -> ssh -> btrfs receive" construction, as that can't
>> be resumed.
>>
>> Therefore I want to use rsync to synchronize the snapshots on the
>> master NAS to the slave NAS.
>>
>> My thirst thought is something like this:
>> 1) create a read-only snapshot on the master NAS:
>> btrfs subvolume snapshot -r /mnt/nas/storage
>> /mnt/nas/storage_snapshots/storage-$(date +%Y_%m_%d-%H%m)
>> 2) send that data to the slave NAS like this:
>> rsync --partial -var --compress --bwlimit=500KB -e "ssh -i
>> ~/slave-nas.key" /mnt/nas/storage_snapshots/storage-$(date
>> +%Y_%m_%d-%H%m) cedric@123.123.123.123/nas/storage
>> 3) Restart rsync until all data is copied (by checking the error code
>> of rsync, is it's 0 then all data has been transferred)
>> 4) Create the read-only snapshot on the slave NAS with the same name
>> as in step 1.

Seems like a reasonable approach to me, but see comment below.

>> Does somebody already has a script that does this?

I don't.

>> Is there a problem with this approach that I have not yet considered?

Not a problem as such, but you could also consider using something like
rsnapshot (or reimplementing your own version by using rsync
--link-dest) instead of relying on btrfs snapshots on the slave NAS.
That way you don't need btrfs on that NAS at all if you don't want. I
used that approach as the (old) NAS I was using had a very old linux
version and didn't even run btrfs.

> One option is to store the send stream as a compressed file and rsync
> that file over and do a shasum or similar on it.

I have looked into that in the past and eventually decided against it.

My main concern was being too reliant on very complex and less used
features of btrfs, including one which has had several bugs in the past
(send/receive). I decided my backups needed to be reliable and robust
more than they need to be optimally efficient.

I had even considered just saving the original send stream, and the
subsequent incremental sends (all compressed) - until I realised that
any tiny corruption or bug in even one of those streams could make the
later streams completely unrestorable.

In the end, I decided to use a very boring (but powerful and
well-maintained), widely used, conventional backup tool (specifically,
dar, under the control of the dar_automatic_backup script) and I copy
the dar archives (compressed and encrypted) onto my offsite backup
server (actually, now, I store them in S3, using rclone). They are also
convenient to occasionally put on a disk which I can give to a friend to
put at the back of their cupboard somewhere in case I need it (faster
and cheaper to access than S3)!

In my case, I had some spare disks and plenty of bandwidth so I also use
rsnapshot from my onsite NAS to an offsite NAS. But that is for
convenience (not having to have dar read through all the archives) - I
consider the S3 dar archives my "main" disaster-recovery backup.

> btrbk[2] is a Btrfs backup tool that also can store snapshots as
> archives on remote location. You may want to have a look at that too.

I use btrbk for local snapshots (storing snapshots of all my systems on
my main server system). But I consider those convenient copies for
restoring files deleted by mistake, or restoring earlier configurations,
not backups (for example, a serious electrical problem or fire in the
server machine could destroy both the original disk and the snapshot disk).

Your situation is different, of course - so just some things to consider.

Re: synchronize a btrfs snapshot via network with resume support?

2021-01-01 Thread Graham Cobb

On 01/01/2021 14:42, cedric.dew...@eclipso.eu wrote:
...
> I'm looking for a program that can synchronize a btrfs snapshot via a 
> network, and supports resuming of interrupted transfers.

Not an answer to your question... the way I would solve your problem is
to do the "btrfs send" to a local file on master, reliably transfer the
file, then do the "btrfs receive" on slave from the file. Reliable file
transfer programs exist (you can even just use rsync --inplace if I
remember correctly).

Unfortunately, that requires that you have enough space on both ends to
store the btrfs-send file.

> 
> Buttersink [1] claims it can do Resumable, checksummed multi-part uploads to 
> S3 as a back-end, but not between two PC's via ssh.

I don't use Buttersink. But if it can do that without storing the file
locally on either end (which I would be slightly surprised about) then
it sounds like that might be the way to do it: essentially you would be
paying AWS for the temporary filespace needed.

Graham

Re: AW: WG: How to properly setup for snapshots

2020-12-21 Thread Graham Cobb

On 21/12/2020 20:45, Claudius Ellsel wrote:
> I had a closer look at snapper now and have installed and set it up. This 
> seems to be really the easiest way for me, I guess. My main confusion was 
> probably that I was unsure whether I had to create a subvolume prior to this 
> or not, which got sorted out now. The situation is apparently still not 
> ideal, as to my current understanding it would have been preferable to set up 
> a subvolume first at root level directly after creating the files system. 
> However, as this is only the data drive in the machine (OS runs on an ext4 
> SSD) it is at least not planned to simply swap the entire volume to a 
> snapshot but rather to restore snapshots at file / folder level, where 
> snapper can also be used for.
> 
> One can possibly also safely achieve the same with only using btrfs 
> commandline tools (I made the mistake of not thinking about read only 
> snapshots when I wrote about my fear to delete / modify mounted snapshots). 
> Still I have a better feeling when using snapper, as I hope it is less easy 
> to screw things up with it :)

I have never used snapper but I know it is a popular tool. But there are
many other tools - with their own advantages, disadvantages and ways of
working. You may want to experiment with several of them. Personally I
use btrbk for automation of snapshots.

Just remember that btrfs snapshots are not a backup tool. At best, they
are a convenience tool, for quickly restoring deleted or modified files
(or full subvolumes) to an earlier (snapshotted) state without having to
access your backups. But they are completely useless if a hardware or
software problem (kernel bug, disk problem, system memory error, faulty
cable, etc) corrupts the filesystem. You don't have a backup solution
unless you are copying the files to another physical disk, preferably on
a different system. btrbk can help with that as well (but it is just
automating btrfs send and btrfs receive underneath).

Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-17 Thread Graham Cobb

On 17/10/2019 16:57, Chris Murphy wrote:
> On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB  
> wrote:
>>
>> It would be interesting to know the pros and cons of this setup that
>> you are suggesting vs zfs.
>> +zfs detects and corrects bitrot (
>> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
>> +zfs has working raid56
>> -modules out of kernel for license incompatibilities (a big minus)
>>
>> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
>> to find any conclusive doc about it right now)
> 
> Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.

Presumably this is dependent on checksums? So neither detection nor
fixup happen for NOCOW files? Even a scrub won't notice because scrub
doesn't attempt to compare both copies unless the first copy has a bad
checksum -- is that correct?

> 
>> I'm one of those that is waiting for the write hole bug to be fixed in
>> order to use raid5 on my home setup. It's a shame it's taking so long.
> 
> For what it's worth, the write hole is considered to be rare.
> https://lwn.net/Articles/665299/
> 
> Further, the write hole means a) parity is corrupt or stale compared
> to data stripe elements which is caused by a crash or powerloss during
> writes, and b) subsequently there is a missing device or bad sector in
> the same stripe as the corrupt/stale parity stripe element. The effect
> of b) is that reconstruction from parity is necessary, and the effect
> of a) is that it's reconstructed incorrectly, thus corruption. But
> Btrfs detects this corruption, whether it's metadata or data. The
> corruption isn't propagated in any case. But it makes the filesystem
> fragile if this happens with metadata. Any parity stripe element
> staleness likely results in significantly bad reconstruction in this
> case, and just can't be worked around, even btrfs check probably can't
> fix it. If the write hole problem happens with data block group, then
> EIO. But the good news is that this isn't going to result in silent
> data or file system metadata corruption. For sure you'll know about
> it.

If I understand correctly, metadata always has checksums so that is true
for filesystem structure. But for no-checksum files (such as nocow
files) the corruption will be silent, won't it?

Graham

Re: [Not TLS] Re: [PATCH 3/4] btrfs: include non-missing as a qualifier for the latest_bdev

2019-10-04 Thread Graham Cobb

On 04/10/2019 09:11, Nikolay Borisov wrote:
> 
> 
> On 4.10.19 г. 10:50 ч., Anand Jain wrote:
>> btrfs_free_extra_devids() reorgs fs_devices::latest_bdev
>> to point to the bdev with greatest device::generation number.
>> For a typical-missing device the generation number is zero so
>> fs_devices::latest_bdev will never point to it.
>>
>> But if the missing device is due to alienating [1], then
>> device::generation is not-zero and if it is >= to rest of
>> device::generation in the list, then fs_devices::latest_bdev
>> ends up pointing to the missing device and reports the error
>> like this [2]
>>
>> [1]
>> mkfs.btrfs -fq /dev/sdd && mount /dev/sdd /btrfs
>> mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc
>> sleep 3 # avoid racing with udev's useless scans if needed
>> btrfs dev add -f /dev/sdb /btrfs
> 
> Hm, here I think the correct way is to refuse adding /dev/sdb to an
> existing fs if it's detected to be part of a different one. I.e it
> should require wipefs to be done.

I disagree. -f means "force overwrite of existing filesystem on the
given disk(s)". It shouldn't be any different whether the existing fs is
btrfs or something else.

Graham

Intro to fstests environment?

2019-10-03 Thread Graham Cobb

Hi,

I seem to have another case where scrub gets confused when it is
cancelled and restarted many times (or, maybe, it is my error or
something). I will look into it further but, instead of just hacking
away at my script to work out what is going on, I thought I might try to
create a regression test for it this time.

I have read the kdave/fstests READMEs and the wiki. Is there any other
documentation or advice I should read? Of course, I will look at
existing test scripts as well.

I don't suppose anyone has a convenient VM image or similar which I can
use as a starting point?

Graham

Re: [PATCH v2 RESEND] btrfs-progs: add verbose option to btrfs device scan

2019-10-01 Thread Graham Cobb

On 01/10/2019 08:52, Anand Jain wrote:
> Ping?
> 
> 
> On 9/25/19 4:07 PM, Anand Jain wrote:
>> To help debug device scan issues, add verbose option to btrfs device
>> scan.
>>
>> Signed-off-by: Anand Jain 
>> ---
>> v2: Use bool instead of int as a btrfs_scan_device() argument.
>>
>>   cmds/device.c    | 8 ++--
>>   cmds/filesystem.c    | 2 +-
>>   common/device-scan.c | 4 +++-
>>   common/device-scan.h | 3 ++-
>>   common/utils.c   | 2 +-
>>   disk-io.c    | 2 +-
>>   6 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/cmds/device.c b/cmds/device.c
>> index 24158308a41b..9b715ffc42a3 100644
>> --- a/cmds/device.c
>> +++ b/cmds/device.c
>> @@ -313,6 +313,7 @@ static int cmd_device_scan(const struct cmd_struct
>> *cmd, int argc, char **argv)
>>   int all = 0;
>>   int ret = 0;
>>   int forget = 0;
>> +    bool verbose = false;
>>     optind = 0;
>>   while (1) {
>> @@ -323,7 +324,7 @@ static int cmd_device_scan(const struct cmd_struct
>> *cmd, int argc, char **argv)
>>   { NULL, 0, NULL, 0}
>>   };
>>   -    c = getopt_long(argc, argv, "du", long_options, NULL);
>> +    c = getopt_long(argc, argv, "duv", long_options, NULL);
>>   if (c < 0)
>>   break;
>>   switch (c) {
>> @@ -333,6 +334,9 @@ static int cmd_device_scan(const struct cmd_struct
>> *cmd, int argc, char **argv)
>>   case 'u':
>>   forget = 1;
>>   break;
>> +    case 'v':
>> +    verbose = true;
>> +    break;
>>   default:
>>   usage_unknown_option(cmd, argv);
>>   }
>> @@ -354,7 +358,7 @@ static int cmd_device_scan(const struct cmd_struct
>> *cmd, int argc, char **argv)
>>   }
>>   } else {
>>   printf("Scanning for Btrfs filesystems\n");
>> -    ret = btrfs_scan_devices();
>> +    ret = btrfs_scan_devices(verbose);
>>   error_on(ret, "error %d while scanning", ret);
>>   ret = btrfs_register_all_devices();
>>   error_on(ret,

Shouldn't "--verbose" be accepted as a long version of the option? That
would mean adding it to long_options.

The usage message cmd_device_scan_usage needs to be updated to include
the new option(s).

I have tested this on my systems (4.19 kernel) and it not only works
well, it is useful to get the list of devices it finds. If you wish,
feel free to add:

Tested-by: Graham Cobb 

Graham

>> diff --git a/cmds/filesystem.c b/cmds/filesystem.c
>> index 4f22089abeaa..02d47a43a792 100644
>> --- a/cmds/filesystem.c
>> +++ b/cmds/filesystem.c
>> @@ -746,7 +746,7 @@ devs_only:
>>   else
>>   ret = 1;
>>   } else {
>> -    ret = btrfs_scan_devices();
>> +    ret = btrfs_scan_devices(false);
>>   }
>>     if (ret) {
>> diff --git a/common/device-scan.c b/common/device-scan.c
>> index 48dbd9e19715..a500edf0f7d7 100644
>> --- a/common/device-scan.c
>> +++ b/common/device-scan.c
>> @@ -360,7 +360,7 @@ void free_seen_fsid(struct seen_fsid
>> *seen_fsid_hash[])
>>   }
>>   }
>>   -int btrfs_scan_devices(void)
>> +int btrfs_scan_devices(bool verbose)
>>   {
>>   int fd = -1;
>>   int ret;
>> @@ -389,6 +389,8 @@ int btrfs_scan_devices(void)
>>   continue;
>>   /* if we are here its definitely a btrfs disk*/
>>   strncpy_null(path, blkid_dev_devname(dev));
>> +    if (verbose)
>> +    printf("blkid: btrfs device: %s\n", path);
>>     fd = open(path, O_RDONLY);
>>   if (fd < 0) {
>> diff --git a/common/device-scan.h b/common/device-scan.h
>> index eda2bae5c6c4..3e473c48d1af 100644
>> --- a/common/device-scan.h
>> +++ b/common/device-scan.h
>> @@ -1,6 +1,7 @@
>>   #ifndef __DEVICE_SCAN_H__
>>   #define __DEVICE_SCAN_H__
>>   +#include 
>>   #include "kerncompat.h"
>>   #include "ioctl.h"
>>   @@ -29,7 +30,7 @@ struct seen_fsid {
>>   int fd;
>>   };
>>   -int btrfs_scan_devices(void);
>> +int btrfs_scan_devices(bool verbose);
>>   int btrfs_register_one_device(const char *fname);
>>   int btrfs_register_all_devices(void);
>>   int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
>&g

Re: [BTRFS Raid5 error during Scrub.

2019-09-30 Thread Graham Cobb

On 29/09/2019 22:38, Robert Krig wrote:
> I'm running Debian Buster with Kernel 5.2.
> Btrfs-progs v4.20.1

I am running Debian testing (bullseye) and have chosen not to install
the 5.2 kernel yet because the version of it in bullseye
(linux-image-5.2.0-2-amd64) is based on 5.2.9 and (as far as I can tell)
does not contain the BTRFS corruption fix.

I do not know which version of the 5.2 kernel is in buster but you may
want to check that it contains a backport of the BTRFS fix or downgrade
to the 4.19 kernel until you can be sure.

I note that linux-image-5.2.0-3-amd64 is in unstable and is based on
5.2.17 so should have the fix. I presume it will make its way to testing
soon.

If anyone can confirm which versions of the Debian kernel package the
5.2 corruption fixes are in, it would be helpful.

Re: Feature requests: online backup - defrag - change RAID level

2019-09-09 Thread Graham Cobb

On 09/09/2019 13:18, Qu Wenruo wrote:
> 
> 
> On 2019/9/9 下午7:25, zedlr...@server53.web-hosting.com wrote:
>> What I am complaining about is that at one point in time, after issuing
>> the command:
>> btrfs balance start -dconvert=single -mconvert=single
>> and before issuing the 'btrfs delete', the system could be in a too
>> fragile state, with extents unnecesarily spread out over two drives,
>> which is both a completely unnecessary operation, and it also seems to
>> me that it could be dangerous in some situations involving potentially
>> malfunctioning drives.
> 
> In that case, you just need to replace that malfunctioning device other
> than fall back to SINGLE.

Actually, this case is the (only) one of the three that I think would be
very useful (backup is better handled by having a choice of userspace
tools to choose from - I use btrbk - and does anyone really care about
defrag any more?).

I did, recently, have a case where I had started to move my main data
disk to a raid1 setup but my new disk started reporting errors. I didn't
have a spare disk (and didn't have a spare SCSI slot for another disk
anyway). So, I wanted to stop using the new disk and revert to my former
(m=dup, d=single) setup as quickly as possible.

I spent time trying to find a way to do that balance without risking the
single copy of some of the data being stored on the failing disk between
starting the balance and completing the remove. That has two problems:
obviously having the single copy on the failing disk is bad news but,
also, it increases the time taken for the subsequent remove which has to
copy that data back to the remaining disk (where there used to be a
perfectly good copy which was subsequently thrown away during the balance).

In the end, I took the risk and the time of the two steps. In my case, I
had good backups, and actually most of my data was still in a single
profile on the old disk (because the errors starting happening before I
had done the balance to change the profile of all the old data from
single to raid1).

But a balance -dconvert=single-but-force-it-to-go-on-disk-1 would have
been useful. (Actually a "btrfs device mark-for-removal" command would
be better - allow a failing device to be retained for a while, and used
to provide data, but ignore it when looking to store data).

Graham

Re: Massive filesystem corruption since kernel 5.2 (ARCH)

2019-07-30 Thread Graham Cobb

On 30/07/2019 23:44, Swâmi Petaramesh wrote:
> Still, losing a given FS with subvols, snapshots etc, may be very
> annoying and very time consuming rebuilding.

I believe that in one of the earlier mails, Qu said that you can
probably mount the corrupted fs readonly and read everything.

If that is the case then, if I were in your position, I would probably
buy another disk, create a a new fs, and then use one of the subvol
preserving btrfs clone utilities to clone the readonly disk onto the new
disk.

Not cheap, and would still take some time, but at least it could be
automated.

Re: "btrfs: harden agaist duplicate fsid" spams syslog

2019-07-12 Thread Graham Cobb

On 12/07/2019 14:35, Patrik Lundquist wrote:
> On Fri, 12 Jul 2019 at 14:48, Anand Jain  wrote:
>> I am unable to reproduce, I have tried with/without dm-crypt on both
>> oraclelinux and opensuse (I am yet to try debian).
> 
> I'm using Debian testing 4.19.0-5-amd64 without problem. Raid1 with 5
> LUKS disks. Mounting with the UUID but not(!) automounted.
> 
> Running "btrfs device scan --all-devices" doesn't trigger the fsid move.

Thanks Patrik. So is it something to do with LVM, not dm-crypt, I
wonder? Note that in my case I am using LVM-over-dm-crypt, and it is the
LV that I mount, not the encrypted partition.

Re: "btrfs: harden agaist duplicate fsid" spams syslog

2019-07-12 Thread Graham Cobb

On 12/07/2019 13:46, Anand Jain wrote:
> I am unable to reproduce, I have tried with/without dm-crypt on both
> oraclelinux and opensuse (I am yet to try debian).

I understand. I am going to be away for a week but I am happy to look
into trying to create a smaller reproducer (for example in a vm) once I
get back.

> The patch in $subject is fair, but changing device path indicates
> there is some problem in the system. However, I didn't expect
> same device pointed by both /dev/dm-x and /dev/mapper/abc would
> contended.

It is weird, because there are other symlinks also pointing to the
device. In my case, lvm sets up both /dev/mapper/cryptdata4tb--vg-backup
and /dev/cryptdata4tb-vg/backup as symlinks to ../dm-13 but only the
first one fights with /dev/dm-13 for the devid.

> One fix for this is to make it a ratelimit print. But then the same
> thing happens without notice. If you monitor /proc/self/mounts
> probably you will notice that mount device changes every ~2mins.

I haven't managed to catch it. Presumably because, according to the
logs, it seems to switch the devices back again within less than a second.

> I will be more keen to find the module which is causing this issue,
> that is calling 'btrfs dev scan' every ~2mins or trying to mount
> an unmounted device without understanding that its mapper is already
> mounted.

Any ideas how we can determine that?

Can I try something like stopping udev for 5 minutes to see if it stops?
Or will that break my system (I can't schedule any downtime until after
I am back)? Note (in case it is relevant) this is a systemd system so
udev is actually systemd-udevd.service.

Thanks
Graham

Re: "btrfs: harden agaist duplicate fsid" spams syslog

2019-07-11 Thread Graham Cobb

On 11/07/2019 03:46, Anand Jain wrote:
> Now the question I am trying to understand, why same device is being
> scanned every 2 mins, even though its already mount-ed. I am guessing
> its toggling the same device paths trying to mount the device-path
> which is not mounted. So autofs's check for the device mount seems to
> be path based.
> 
> Would you please provide your LVM configs and I believe you are using
> dm-mapping too. What are the device paths used in the fstab and in grub.
> And do you see these messages for all devices of
> 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 or just devid 4? Would you please
> provide more logs at least a complete cycle of the repeating logs.

My setup is quite complex, with three btrfs-over-LVM-over-LUKS
filesystems, so I will explain it fully in a separate message in case it
is important. Let me first answer your questions regarding
4d1ba5af-8b89-4cb5-96c6-55d1f028a202, which was the example I used in my
initial log extract.

4d1b...a202 is a filesystem with a main mount point of /mnt/backup2/:

black:~# btrfs fi show /mnt/backup2/
Label: 'snap12tb'  uuid: 4d1ba5af-8b89-4cb5-96c6-55d1f028a202
Total devices 2 FS bytes used 10.97TiB
devid1 size 10.82TiB used 10.82TiB path /dev/sdc3
devid4 size 3.62TiB used 199.00GiB path
/dev/mapper/cryptdata4tb--vg-backup

In this particular filesystem, it has two devices: one is a real disk
partition (/dev/sdc3), the other is an LVM logical volume. It has also
had other LVM devices added and removed at various times, but this is
the current setup.

Note: when I added this LV, I used path /dev/mapper/cryptdata4tb--vg-backup.

black:~# lvdisplay /dev/cryptdata4tb-vg/backup
  --- Logical volume ---
  LV Path/dev/cryptdata4tb-vg/backup
  LV Namebackup
  VG Namecryptdata4tb-vg
  LV UUIDTZaWfo-goG1-GsNV-GCZL-rpbz-IW0H-gNmXBf
  LV Write Accessread/write
  LV Creation host, time black, 2019-07-10 10:40:28 +0100
  LV Status  available
  # open 1
  LV Size3.62 TiB
  Current LE 949089
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   254:13

The LVM logical volume is exposed as /dev/mapper/cryptdata4tb--vg-backup
which is a symlink (set up by LVM, I believe) to /dev/dm-13.

For the 4d1b...a202 filesystem I currently only see the messages for
devid 4. But I presume that is because devid 1 is a real device, which
only appears in /dev once. I did, for a while, have two LV devices in
this filesystem and, looking at the old logs, I can see that every 2
minutes the swapping between /dev/mapper/whatever and /dev/dm-N was
happening for both LV devids (but not for the physical device devid)

This particular device is not a root device and I do not believe it is
referenced in grub or initramfs. It is mounted in /etc/fstab/:

LABEL=snap12tb  /mnt/backup2btrfs
defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,space_cache=v2
   0   3

Note that /dev/disk/by-label/snap12tb is a symlink to the dm-N alias of
the LV device (set up by LVM or udev or something - not by me):

black:~# ls -l /dev/disk/by-label/snap12tb
lrwxrwxrwx 1 root root 11 Jul 11 18:18 /dev/disk/by-label/snap12tb ->
../../dm-13

Here is a log extract of the cycling messages for the 4d1b...a202
filesystem:

Jul 11 18:46:28 black kernel: [116657.825658] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 11 18:46:28 black kernel: [116658.048042] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup
Jul 11 18:46:29 black kernel: [116659.157392] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 11 18:46:29 black kernel: [116659.337504] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup
Jul 11 18:48:28 black kernel: [116777.727262] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 11 18:48:28 black kernel: [116778.019874] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup
Jul 11 18:48:29 black kernel: [116779.157038] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 11 18:48:30 black kernel: [116779.364959] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup
Jul 11 18:50:28 black kerne

"btrfs: harden agaist duplicate fsid" spams syslog

2019-07-10 Thread Graham Cobb

Anand's Nov 2018 patch "btrfs: harden agaist duplicate fsid" has
recently percolated through to my Debian buster server system.

And it is spamming my log files.

Each of my btrfs filesystem devices logs 4 messages every 2 minutes.
Here is an example of the 4 messages related to one device:

Jul 10 19:32:27 black kernel: [33017.407252] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 10 19:32:27 black kernel: [33017.522242] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup
Jul 10 19:32:29 black kernel: [33018.797161] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/mapper/cryptdata4tb--vg-backup new:/dev/dm-13
Jul 10 19:32:29 black kernel: [33019.061631] BTRFS info (device sdc3):
device fsid 4d1ba5af-8b89-4cb5-96c6-55d1f028a202 devid 4 moved
old:/dev/dm-13 new:/dev/mapper/cryptdata4tb--vg-backup

What is happening here is that each device is actually an LVM logical
volume, and it is known by a /dev/mapper name and a /dev/dm name. And
every 2 minutes something cause btrfs to notice that there are two names
for the same device and it swaps them around. Logging a message to say
it has done so. And doing it 4 times.

I presume that the swapping doesn't cause any problem. I wonder slightly
whether ordering guarantees and barriers all work correctly but I
haven't noticed any problems.

I also assume it has been doing this for a while -- just silently before
this patch came along.

Is btrfs noticing this itself or is something else (udev or systemd, for
example) triggering it?

Should I worry about it?

Is there any way to not have my log files full of this?

Graham

[This started with a Debian testing kernel a couple of months ago.
Current uname -a gives: Linux black 4.19.0-5-amd64 #1 SMP Debian
4.19.37-5 (2019-06-19) x86_64 GNU/Linux]

Re: snapshot rollback

2019-07-05 Thread Graham Cobb

On 05/07/2019 12:47, Remi Gauvin wrote:
> On 2019-07-05 7:06 a.m., Ulli Horlacher wrote:
> 
>>
>> Ok, it seems my idea (replacing the original root subvolume with a
>> snapshot) is not possible. 
>>
> ...
> It is common practice with installers now to mount your root and home on
> a subvolume for exactly this reason.  (And you can convert your current
> system as well.  Boot your system with a removable boot media of your
> choice, create a subvolume named @.  Move all existing folders into this
> new subvolume.  Edit the @/boot/grub/grub.cfg file so your Linux boot
> menu has the @ added to the paths of Linux root and initrd.

Personally, I use a slightly different strategy. My basic principle is
that no btrfs filesystem should have any files or directories in its
root -- only subvolumes. This makes it easier to do stuff with snapshots
if I want to.

For system disks I use a variant of the "@" approach. I create two
subvolumes when I create a system disk: rootfs and varfs (I separate the
two because I use different btrbk and other backup configurations for /
and /var). I then use btrfs subvolume set-default to set rootfs as the
default mount so I don't have to tell grub, etc about the subvolume (I
should mention that I put /boot in a separate partition, not using btrfs).

In /etc/fstab I mount /var with subvol=varfs. I also mount the whole
filesystem (using subvolid=5) into a separate mount point
(/mnt/rootdisk) so I can easily get at and manipulate all the top-level
subvolumes.

Data disks are similar. I create a "home" subvolume at the top level in
the data disk which gets mounted into /home. Below /home, most
directories are also subvolumes (for example, one for my main account so
I can backup that separately from other parts of /home). I mount the
data disk itself (using subvolid=5) into a separate mount point:
/mnt/datadisk -- which I only use for doing work with messing around
with the subvolume structure.

It sounds more complicated than it is, although it is not supported by
any distro installers that I am aware of. And you should expect to get a
few config things wrong while setting it up and will need to have an
alternative boot to use to get things working (a USB disk or an older
system disk). Particularly if you want to use btrfs-over-LVM-over-LUKS.
And don't forget to fully update grub when you think is working and then
test it again without your old/temporary boot disk in place!

Basically, I make many different subvolumes and use /mount to put them
into the places they should be in the filesystem (except for / for which
I use set-default). The real btrfs root directory for each filesystem is
mounted (using subvolid=5) into a separate place for doing filesystem
operations.

I then have a cron script which checks that every directory within the
top level of each btrfs filesystem (and within /home) is a subvolume and
warns me if it isn't (I use special dotfiles within the few top-level
directories which I don't want to be their own subvolumes).

Contact me directly if you would find my personal "how to set up my
system and root disks, for debian, using btrfs-over-lvm-over-luks" notes
useful.

Graham

Re: Btrfs progs pre-release 5.2-rc1

2019-06-28 Thread Graham Cobb

On 28/06/2019 18:40, David Sterba wrote:
> Hi,
> 
> this is a pre-release of btrfs-progs, 5.2-rc1.
> 
> The proper release is scheduled to next Friday, +7 days (2019-07-05), but can
> be postponed if needed.
> 
> Scrub status has been reworked:
> 
>   UUID: bf8720e0-606b-4065-8320-b48df2e8e669
>   Scrub started:Fri Jun 14 12:00:00 2019
>   Status:   running
>   Duration: 0:14:11
>   Time left:0:04:04
>   ETA:  Fri Jun 14 12:18:15 2019
>   Total to scrub:   182.55GiB
>   Bytes scrubbed:   141.80GiB
>   Rate: 170.63MiB/s
>   Error summary:csum=7
> Corrected:  0
> Uncorrectable:  7
> Unverified: 0

Is it possible to include my recently submitted patch to scrub to
correct handling of last_physical and fix skipping much of the disk on
scrub cancel/resume?

Graham

Re: [PATCH] btrfs-progs: scrub: Fix scrub cancel/resume not to skip most of the disk

2019-06-25 Thread Graham Cobb

On 18/06/2019 09:08, Graham R. Cobb wrote:
> When a scrub completes or is cancelled, statistics are updated for reporting
> in a later btrfs scrub status command and for resuming the scrub. Most
> statistics (such as bytes scrubbed) are additive so scrub adds the statistics
> from the current run to the saved statistics.
> 
> However, the last_physical statistic is not additive. The value from the
> current run should replace the saved value. The current code incorrectly
> adds the last_physical from the current run to the previous saved value.
> 
> This bug causes the resume point to be incorrectly recorded, so large areas
> of the disk are skipped when the scrub resumes. As an example, assume a disk
> had 100 bytes and scrub was cancelled and resumed each time 10% (10
> bytes) had been scrubbed.
> 
> Run | Start byte | bytes scrubbed | kernel last_physical | saved last_physical
>   1 |  0 | 10 |   10 |  10
>   2 | 10 | 10 |   20 |  30
>   3 | 30 | 10 |   40 |  70
>   4 | 70 | 10 |   80 | 150
>   5 |150 |  0 | immediately completes| completed
> 
> In this example, only 40% of the disk is actually scrubbed.
> 
> This patch changes the saved/displayed last_physical to track the last
> reported value from the kernel.
> 
> Signed-off-by: Graham R. Cobb 

Ping? This fix is important for anyone who interrupts and resumes scrubs
-- which will happen more and more as filesystems get bigger. It is a
small fix and would be good to get out to distros.

Graham

> ---
>  cmds-scrub.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/cmds-scrub.c b/cmds-scrub.c
> index f21d2d89..2800e796 100644
> --- a/cmds-scrub.c
> +++ b/cmds-scrub.c
> @@ -171,6 +171,10 @@ static void print_scrub_summary(struct 
> btrfs_scrub_progress *p)
>   fs_stat->p.name += p->name; \
>  } while (0)
>  
> +#define _SCRUB_FS_STAT_COPY(p, name, fs_stat) do {   \
> + fs_stat->p.name = p->name;  \
> +} while (0)
> +
>  #define _SCRUB_FS_STAT_MIN(ss, name, fs_stat)\
>  do { \
>   if (fs_stat->s.name > ss->name) {   \
> @@ -209,7 +213,7 @@ static void add_to_fs_stat(struct btrfs_scrub_progress *p,
>   _SCRUB_FS_STAT(p, malloc_errors, fs_stat);
>   _SCRUB_FS_STAT(p, uncorrectable_errors, fs_stat);
>   _SCRUB_FS_STAT(p, corrected_errors, fs_stat);
> - _SCRUB_FS_STAT(p, last_physical, fs_stat);
> + _SCRUB_FS_STAT_COPY(p, last_physical, fs_stat);
>   _SCRUB_FS_STAT_ZMIN(ss, t_start, fs_stat);
>   _SCRUB_FS_STAT_ZMIN(ss, t_resumed, fs_stat);
>   _SCRUB_FS_STAT_ZMAX(ss, duration, fs_stat);
> @@ -683,6 +687,8 @@ static int scrub_writev(int fd, char *buf, int max, const 
> char *fmt, ...)
>  
>  #define _SCRUB_SUM(dest, data, name) dest->scrub_args.progress.name =
> \
>   data->resumed->p.name + data->scrub_args.progress.name
> +#define _SCRUB_COPY(dest, data, name) dest->scrub_args.progress.name =   
> \
> + data->scrub_args.progress.name
>  
>  static struct scrub_progress *scrub_resumed_stats(struct scrub_progress 
> *data,
> struct scrub_progress *dest)
> @@ -703,7 +709,7 @@ static struct scrub_progress *scrub_resumed_stats(struct 
> scrub_progress *data,
>   _SCRUB_SUM(dest, data, malloc_errors);
>   _SCRUB_SUM(dest, data, uncorrectable_errors);
>   _SCRUB_SUM(dest, data, corrected_errors);
> - _SCRUB_SUM(dest, data, last_physical);
> + _SCRUB_COPY(dest, data, last_physical);
>   dest->stats.canceled = data->stats.canceled;
>   dest->stats.finished = data->stats.finished;
>   dest->stats.t_resumed = data->stats.t_start;
>

Re: [PATCH RFC] btrfs-progs: scrub: Correct tracking of last_physical across scrub cancel/resume

2019-06-17 Thread Graham Cobb

On 08/06/2019 00:55, Graham R. Cobb wrote:
> When a scrub completes or is cancelled, statistics are updated for reporting
> in a later btrfs scrub status command. Most statistics (such as bytes 
> scrubbed)
> are additive so scrub adds the statistics from the current run to the
> saved statistics.
> 
> However, the last_physical statistic is not additive. The value from the
> current run should replace the saved value. The current code incorrectly
> adds the last_physical from the current run to the saved value.
> 
> This bug not only affects user status reporting but also has the effect that
> subsequent resumes start from the wrong place and large amounts of the
> filesystem are not scrubbed.
> 
> This patch changes the saved last_physical to track the last reported value
> from the kernel.
> 
> Signed-off-by: Graham R. Cobb 

No comments received on this RFC PATCH. I will resubmit it shortly as a
non-RFC PATCH, with a slightly improved summary and changelog.

Graham

Re: Scrub resume failure

2019-06-07 Thread Graham Cobb

On 06/06/2019 15:26, Graham Cobb wrote:
> However, after a few cancel/resume cycles, the scrub terminates. No
> errors are reported but one of the resumes will just immediately
> terminate claiming the scrub is done. It isn't. Nowhere near.

I believe I have found the problem. It is a bug in the scrub command.

When a scrub completes or is cancelled, the utility updates the saved
statistics for reporting using btrfs scrub status. These statistics
include the last_physical value returned from the ioctl, which is then
used by the resume code to specify the start for the next run.

Most statistics (such as bytes scrubbed, error counts, etc) are
maintained by adding the values from the current run to the saved
values. However, the last_physical value should not be added: it should
replace the saved value. The current code incorrectly adds it to the
saved value, meaning that large amounts of the filesystem are missed out
on the next run.

I have created a patch, which I will send in a separate message. As I
have not submitted patches to this list before, I will send it as a
PATCH RFC and would welcome comments.

Graham

Scrub resume failure

2019-06-06 Thread Graham Cobb

I have a btrfs filesystem which I want to scrub. This is a multi-TB
filesystem and will take well over 24 hours to scrub.

Unfortunately, the scrub turns out to be quite intrusive into the system
(even when making sure it is very low priority for ionice and nice).
Operations on other disks run excessively slowly, causing timeouts on
important actions like mail delivery (causing bounces).

So, I break it up. I run it for some interval (hours), with the
time-critical services stopped. Then I cancel the scrub and let mail
delivery run for a while. Then I stop mail again and resume the scrub
for another interval, etc.

This works and solves the mail bounce problem.

However, after a few cancel/resume cycles, the scrub terminates. No
errors are reported but one of the resumes will just immediately
terminate claiming the scrub is done. It isn't. Nowhere near.

The disk being scrubbed is in use during all this. It doesn't get a
heavy load but it is my main backup disk and various backups happen,
some of them involving snapshots being created and deleted.

Glancing at the use of the ioctl in the btrfs-progs code, I assume the
resume is using the last_physical from the last run as the start for the
next. Does that break if the filesystem has changed and that is no
longer a used block or something? If so, I think that makes resume useless.

If this is not expected behaviour I will do more work to analyse and
reproduce.

Graham

Re: Used disk size of a received subvolume?

2019-05-17 Thread Graham Cobb

On 17/05/2019 17:39, Steven Davies wrote:
> On 17/05/2019 16:28, Graham Cobb wrote:
> 
>> That is why I created my "extents-list" stuff. This is a horrible hack
>> (one day I will rewrite it using the python library) which lets me
>> answer questions like: "how much space am I wasting by keeping
>> historical snapshots", "how much data is being shared between two
>> subvolumes", "how much of the data in my latest snapshot is unique to
>> that snapshot" and "how much space would I actually free up if I removed
>> (just) these particular directories". None of which can be answered from
>> the existing btrfs command line tools (unless I have missed something).
> 
> I have my own horrible hack to do something like this; if you ever get
> around to implementing it in Python could you share the code?
> 

Sure. The current hack (using shell and command line tools) is at
https://github.com/GrahamCobb/extents-lists. If the python version ever
materialises I expect it will end up there as well.

Re: Used disk size of a received subvolume?

2019-05-17 Thread Graham Cobb

On 17/05/2019 14:57, Axel Burri wrote:
> btrfs fi du shows me the information wanted, but only for the last
> received subvolume (as you said it changes over time, and any later
> child will share data with it). For all others, it merely shows "this
> is what gets freed if you delete this subvolume".

It doesn't even show you that: it is possible to have shared (not
exclusive) data which is only shared between files within the subvolume,
and which will be freed if the subvolume is deleted. And, of course, the
obvious problem that if you only count exclusive then no one is being
charged for all the shared segments ("Oh, my backup is getting a bit
expensive. Hmm. I know! I will back up all my files to two different
destinations, and make sure btrfs is sharing the data between both
locations! Then no one pays for it! Whoopee!")

In my opinion, the shared/exclusive information in btrfs fi du is worse
than useless: it confuses people who think it means something different
from what it does. And, in btrfs, it isn't really useful to know whether
something is "exclusive" or not -- what people care about is always
something else (which is dependent on **where** it is shared, and by whom).

The biggest problem is that you haven't defined what **you** (in your
particular use case) mean by the "size" of a subvolume. For btrfs that
doesn't have any single obvious definition.

Most commonly, I think, people mean "how much space on disk would be
freed up if I deleted this subvolume and all subvolumes contained within
it", although quite often they mean the similar (but not identical) "how
much space on disk would be freed up if I deleted just this subvolume".
And sometimes they actually mean "how much space on disk would be freed
up if I deleted this subvolume, the subvolumes contained with in, and
all the snapshots I have taken but are lying around forgotten about in
some other directory tree somewhere".

But often they mean something else completely, such as "how much space
is taken up by the data which was originally created in this subvolume
but which has been cloned into all sorts of places now and may not even
be referred to from this subvolume any more" (typically this is the case
if you want to charge the subvolume owner for the data usage).

And, of course, another reading of your question would be "how much data
was transferred during this send/receive operation" (relevant if you are
running a backup service and want to charge people by how much they are
sending to the service rather than the amount of data stored).

That is why I created my "extents-list" stuff. This is a horrible hack
(one day I will rewrite it using the python library) which lets me
answer questions like: "how much space am I wasting by keeping
historical snapshots", "how much data is being shared between two
subvolumes", "how much of the data in my latest snapshot is unique to
that snapshot" and "how much space would I actually free up if I removed
(just) these particular directories". None of which can be answered from
the existing btrfs command line tools (unless I have missed something).

> And it is pretty slow: on my backup disk (spinning rust, ~2000
> subvolumes, ~100 sharing data), btrfs fi du takes around 5min for a
> subvolume of 20GB, while btrfs find-new takes only seconds.

Yes. Answering the real questions involves taking the FIEMAP data for
every file involved (which, for some questions, is actually every file
on the disk!) so it takes a very long time. Days for my multi-terabyte
backup disk.

> Summing up, what I'm looking for would be something like:
> 
>   btrfs fi du -s --exclusive-relative-to= 

You can do that with FIEMAP data. Feel free to look extents-lists. Also
feel free to shout "this is a gross hack" and scream at me!

If you really just need it for two subvols like that

extents-expr -s  - 

will tell you how much space is in extents used in  but not used
in .

Graham

Re: Btrfs send with parent different size depending on source of files.

2019-02-18 Thread Graham Cobb

On 18/02/2019 19:58, André Malm wrote:
> What causes the extent to be incomplete? And can I avoid it? 

Does it matter? I presume the send is working OK, it is just that it
sends a little more data than it needs to. Or have you seen any data loss?

Graham

Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Graham Cobb

On 04/12/2018 12:38, Austin S. Hemmelgarn wrote:
> In short, USB is _crap_ for fixed storage, don't use it like that, even
> if you are using filesystems which don't appear to complain.

That's useful advice, thanks.

Do you (or anyone else) have any experience of using btrfs over iSCSI? I
was thinking about this for three different use cases:

1) Giving my workstation a data disk that is actually a partition on a
server -- keeping all the data on the big disks on the server and
reducing power consumption (just a small boot SSD in the workstation).

2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI
mirror to provide  redundancy without putting more disks in the local
system. Of course, this would mean that one of the RAID1 copies would
have higher latency than the other.

3) Like case 1 but actually exposing an LVM logical volume from the
server using iSCSI, rather than a simple disk partition. I would then
put both encryption and RAID running on the server below that logical
volume.

NBD could also be an alternative to iSCSI in these cases as well.

Any thoughts?

Graham

Re: btrfs fi du unreliable?

2018-08-29 Thread Graham Cobb

On 29/08/18 14:31, Jorge Bastos wrote:
> Thanks, that makes sense, so it's only possible to see how much space
> a snapshot is using with quotas enable, I remember reading that
> somewhere before, though there was a new way after reading this latest
> post .

My extents lists scripts (https://github.com/GrahamCobb/extents-lists)
can tell you the answers to questions like this. In particular, see the
extents-to-remove script.

However, be aware of the warning in the documentation:
---
Be warned: the last three examples take a very LONG TIME (and require a
lot of space in $TMPDIR) as they effectively have to get the file
extents for almost every file on the disk (and sort them multiple
times). They take over 12 hours on my system!
---

I don't know if there are any better tools which work faster.

Re: Recommendations for balancing as part of regular maintenance?

2018-01-08 Thread Graham Cobb

On 08/01/18 16:34, Austin S. Hemmelgarn wrote:
> Ideally, I think it should be as generic as reasonably possible,
> possibly something along the lines of:
> 
> A: While not strictly necessary, running regular filtered balances (for
> example `btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`,
> see `man btrfs-balance` for more info on what the options mean) can help
> keep a volume healthy by mitigating the things that typically cause
> ENOSPC errors.  Full balances by contrast are long and expensive
> operations, and should be done only as a last resort.

That recommendation is similar to what I do and it works well for my use
case. I would recommend it to anyone with my usage, but cannot say how
well it would work for other uses. In my case, I run balances like that
once a week: some weeks nothing happens, other weeks 5 or 10 blocks may
get moved.

For reference, my use case is for two separate btrfs filesystems each on
a single large disk (so no RAID) -- the disks are 6TB and 12TB, both
around 80% used -- one is my main personal data disk, the other is my
main online backup disk.

The data disk receives all email delivery (so lots of small files,
coming and going), stores TV programs as PVR storage (many GB sized
files, each one written once, which typically stick around for a while
and eventually get deleted) and is where I do my software development
(sources and build objects). No (significant) database usage. I am
guessing this is pretty typical personal user usage (although it doesn't
store any operating system files). The only unusual thing is that I have
it set up as about 20 subvolumes, and each one has frequent snapshots
(maybe 200 or so subvolumes in total at any time).

The online backup disk receives backups from all my systems in three
main forms: btrfs snapshots (send/receive), rsnapshot copies (rsync),
and DAR archives. Most get updated daily. It contains several hundred
snapshots (most received from the data disk).

It would be interesting to hear if similar balancing is seen as useful
for other very different cases (RAID use, databases or VM disks, etc).

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Graham Cobb

On 05/12/17 18:01, Goffredo Baroncelli wrote:
> On 12/05/2017 04:42 PM, Graham Cobb wrote:
>> On 05/12/17 12:41, Austin S. Hemmelgarn wrote:
>>> On 2017-12-05 03:43, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2017年12月05日 16:25, Misono, Tomohiro wrote:
>>>>> Hello all,
>>>>>
>>>>> I want to address some issues of subvolume usability for a normal user.
>>>>> i.e. a user can create subvolumes, but
>>>>>   - Cannot delete their own subvolume (by default)
>>>>>   - Cannot tell subvolumes from directories (in a straightforward way)
>>>>>   - Cannot check the quota limit when qgroup is enabled
>>>>>
>>>>> Here I show the initial thoughts and approaches to this problem.
>>>>> I want to check if this is a right approach or not before I start
>>>>> writing code.
>>>>>
>>>>> Comments are welcome.
>>>>> Tomohiro Misono
>>>>>
>>>>> ==
>>>>> - Goal and current problem
>>>>> The goal of this RFC is to give a normal user more control to their
>>>>> own subvolumes.
>>>>> Currently the control to subvolumes for a normal user is restricted
>>>>> as below:
>>>>>
>>>>> +-+--+--+
>>>>> |   command   | root | user |
>>>>> +-+--+--+
>>>>> | sub create  | Y    | Y    |
>>>>> | sub snap    | Y    | Y    |
>>>>> | sub del | Y    | N    |
>>>>> | sub list    | Y    | N    |
>>>>> | sub show    | Y    | N    |
>>>>> | qgroup show | Y    | N    |
>>>>> +-+--+--+
>>>>>
>>>>> In short, I want to change this as below in order to improve user's
>>>>> usability:
>>>>>
>>>>> +-+--++
>>>>> |   command   | root | user   |
>>>>> +-+--++
>>>>> | sub create  | Y    | Y  |
>>>>> | sub snap    | Y    | Y  |
>>>>> | sub del | Y    | N -> Y |
>>>>> | sub list    | Y    | N -> Y |
>>>>> | sub show    | Y    | N -> Y |
>>>>> | qgroup show | Y    | N -> Y |
>>>>> +-+--++
>>>>>
>>>>> In words,
>>>>> (1) allow deletion of subvolume if a user owns it, and
>>>>> (2) allow getting subvolume/quota info if a user has read access to it
>>>>>   (sub list/qgroup show just lists the subvolumes which are readable
>>>>> by the user)
>>>>>
>>>>> I think other commands not listed above (qgroup limit, send/receive
>>>>> etc.) should
>>>>> be done by root and not be allowed for a normal user.
>>>>>
>>>>>
>>>>> - Outside the scope of this RFC
>>>>> There is a qualitative problem to use qgroup for limiting user disk
>>>>> amount;
>>>>> quota limit can easily be averted by creating a subvolume. I think
>>>>> that forcing
>>>>> inheriting quota of parent subvolume is a solution, but I won't
>>>>> address nor
>>>>> discuss this here.
>>>>>
>>>>>
>>>>> - Proposal
>>>>>   (1) deletion of subvolume
>>>>>
>>>>>    I want to change the default behavior to allow a user to delete
>>>>> their own
>>>>>    subvolumes.
>>>>>       This is not the same behavior as when user_subvol_rm_alowed
>>>>> mount option is
>>>>>    specified; in that case a user can delete the subvolume to which
>>>>> they have
>>>>>    write+exec right.
>>>>>       Since snapshot creation is already restricted to the subvolume
>>>>> owner, it is
>>>>>    consistent that only the owner of the subvolume (or root) can
>>>>> delete it.
>>>>>       The implementation should be straightforward.
>>>>
>>>> Personally speaking, I prefer to do the complex owner check in user
>>>> daemon.
>>>>
>>>> And do the privilege in user daemon (call it btrfsd for example).
>>>>
>>>> So btrfs-progs will works in 2 modes, if root calls it, do as it used
>>>> to do.
>>>> I

Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Graham Cobb

On 05/12/17 12:41, Austin S. Hemmelgarn wrote:
> On 2017-12-05 03:43, Qu Wenruo wrote:
>>
>>
>> On 2017年12月05日 16:25, Misono, Tomohiro wrote:
>>> Hello all,
>>>
>>> I want to address some issues of subvolume usability for a normal user.
>>> i.e. a user can create subvolumes, but
>>>   - Cannot delete their own subvolume (by default)
>>>   - Cannot tell subvolumes from directories (in a straightforward way)
>>>   - Cannot check the quota limit when qgroup is enabled
>>>
>>> Here I show the initial thoughts and approaches to this problem.
>>> I want to check if this is a right approach or not before I start
>>> writing code.
>>>
>>> Comments are welcome.
>>> Tomohiro Misono
>>>
>>> ==
>>> - Goal and current problem
>>> The goal of this RFC is to give a normal user more control to their
>>> own subvolumes.
>>> Currently the control to subvolumes for a normal user is restricted
>>> as below:
>>>
>>> +-+--+--+
>>> |   command   | root | user |
>>> +-+--+--+
>>> | sub create  | Y    | Y    |
>>> | sub snap    | Y    | Y    |
>>> | sub del | Y    | N    |
>>> | sub list    | Y    | N    |
>>> | sub show    | Y    | N    |
>>> | qgroup show | Y    | N    |
>>> +-+--+--+
>>>
>>> In short, I want to change this as below in order to improve user's
>>> usability:
>>>
>>> +-+--++
>>> |   command   | root | user   |
>>> +-+--++
>>> | sub create  | Y    | Y  |
>>> | sub snap    | Y    | Y  |
>>> | sub del | Y    | N -> Y |
>>> | sub list    | Y    | N -> Y |
>>> | sub show    | Y    | N -> Y |
>>> | qgroup show | Y    | N -> Y |
>>> +-+--++
>>>
>>> In words,
>>> (1) allow deletion of subvolume if a user owns it, and
>>> (2) allow getting subvolume/quota info if a user has read access to it
>>>   (sub list/qgroup show just lists the subvolumes which are readable
>>> by the user)
>>>
>>> I think other commands not listed above (qgroup limit, send/receive
>>> etc.) should
>>> be done by root and not be allowed for a normal user.
>>>
>>>
>>> - Outside the scope of this RFC
>>> There is a qualitative problem to use qgroup for limiting user disk
>>> amount;
>>> quota limit can easily be averted by creating a subvolume. I think
>>> that forcing
>>> inheriting quota of parent subvolume is a solution, but I won't
>>> address nor
>>> discuss this here.
>>>
>>>
>>> - Proposal
>>>   (1) deletion of subvolume
>>>
>>>    I want to change the default behavior to allow a user to delete
>>> their own
>>>    subvolumes.
>>>       This is not the same behavior as when user_subvol_rm_alowed
>>> mount option is
>>>    specified; in that case a user can delete the subvolume to which
>>> they have
>>>    write+exec right.
>>>       Since snapshot creation is already restricted to the subvolume
>>> owner, it is
>>>    consistent that only the owner of the subvolume (or root) can
>>> delete it.
>>>       The implementation should be straightforward.
>>
>> Personally speaking, I prefer to do the complex owner check in user
>> daemon.
>>
>> And do the privilege in user daemon (call it btrfsd for example).
>>
>> So btrfs-progs will works in 2 modes, if root calls it, do as it used
>> to do.
>> If normal user calls it, proxy the request to btrfsd, and btrfsd does
>> the privilege checking and call ioctl (with root privilege).
>>
>> Then no impact to kernel, all complex work is done in user space.
> Exactly how hard is it to just check ownership of the root inode of a
> subvolume from the ioctl context?  You could just as easily push all the
> checking out to the VFS layer by taking an open fd for the subvolume
> root (and probably implicitly closing it) instead of taking a path, and
> that would give you all the benefits of ACL's and whatever security
> modules the local system is using.  

+1 - stop inventing new access control rules for each different action!


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why isnt NOCOW attributes propogated on snapshot transfers?

2017-10-16 Thread Graham Cobb

On 16/10/17 14:28, David Sterba wrote:
> On Sun, Oct 15, 2017 at 04:19:23AM +0300, Cerem Cem ASLAN wrote:
>> `btrfs send | btrfs receive` removes NOCOW attributes. Is it a bug or
>> a feature? If it's a feature, how can we keep these attributes if we
>> need to?
> 
> This is a known defficiency of send protocol v1. And there are more,
> listed on
> https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive#Send_stream_v2_draft

It is not mentioned on the list (and I haven't tested to find out)...
but if xattr are supported in the V1 protocol, any chance of `send`
converting these file flags to xattr, and `receive` converting back? No
need for a protocol bump and would even preserve the information
(although not the effect) in the case of an old receiver.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: add option to only list parent subvolumes

2017-09-30 Thread Graham Cobb

On 30/09/17 19:17, Holger Hoffstätte wrote:
> On 09/30/17 19:56, Holger Hoffstätte wrote:
>> shell hackery as alternative. Anyway, I was sure that at the time the
>> other letters sounded even worse/were taken, but that may just have been
>> in my head. ;-)
>>
>> I just rechecked and -S is still available, so that's good.
> 
> Except that it isn't really, since there is already an 'S'
> case in cmds-subvolume.c as shortcut to --sort:

That's a shame (and it is also a shame to waste a single letter option
without documenting it!).

I still would encourage you to avoid -P. I think there is user confusion
by "parent" having more than one meaning even within btrfs. And I feel
it also tends to perpetuate the mistaken belief that snapshots are
somehow "special", and different from other subvolumes (rather than just
a piece of information about how two subvolumes are related). It also
allows -P to be used one day for the "search by parent UUID" feature.

Given the constraints, I would suggest -n. It is mostly arbitrary but it
is the second letter of snapshot and also the first of "not a snapshot".

Thanks for considering.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: add option to only list parent subvolumes

2017-09-30 Thread Graham Cobb

On 30/09/17 14:08, Holger Hoffstätte wrote:
> A "root" subvolume is identified by a null parent UUID, so adding a new
> subvolume filter and flag -P ("Parent") does the trick. 

I don't like the naming. The flag you are proposing is really nothing to
do with whether a subvolume is a parent or not: it is about whether it
is a snapshot or not (many subvolumes are both snapshots and also
parents of other snapshots, and many non-snapshots are not the parent of
any subvolumes).

I have two suggestions:

1) Use -S (meaning "not a snapshot", the inverse of -s). Along with this
change. I would change the usage text to say something like:

 -s list subvolumes originally created as snapshots
 -S list subvolumes originally created not as snapshots

Presumably specifying both -s and -S should be an error.

2) Add a -P (parent) option but make it take an argument: the UUID of
the parent to match. This would display only subvolumes originally
created as snapshots of the specified subvolume (which may or may not
still exist, of course). A null value ('' -- or a special text like
'NULL' or 'NONE' if you prefer) would create the search you were looking
for: subvolumes with a null Parent UUID.

The second option is more code, of course, but I see being able to list
all the snapshots of a particular subvolume as quite useful.

If you do choose the second option you need to decide what to do if the
-P is specified more than once. Probably treat it as an error (unless
you want to allow a list of UUIDs any of which can match). You might
also want to reject an attempt to specify both -s and -P.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: difference between -c and -p for send-receive?

2017-09-19 Thread Graham Cobb

On 19/09/17 01:41, Dave wrote:
> Would it be correct to say the following?

Like Duncan, I am just a user, and I haven't checked the code. I
recommend Duncan's explanation, but in case you are looking for
something simpler, how about thinking with the following analogy...

Think of -p as like doing an incremental backup: it tells send to just
send the instructions for the changes to get from the "parent" subvolume
to the current subvolume. Without -p it is like a full backup:
everything in the current subvolume is sent.

-c is different: it says "and by the way, these files also already exist
on the destination so they might be useful to skip actually sending some
of the file contents". Imagine that whenever a file content is about to
be sent (whether incremental or full), btrfs-send checks to see if the
data is in one of the -c subvolumes and, if it is, it sends "get the
data by reflinking to this file over here" instead of sending the data
itself. -c is really just an optimisation to save sending data if you
know the data is already available somewhere else on the destination.

Be aware that this is really just an analogy (like "hard linking" is an
analogy for reflinking using the clone range ioctl). Duncan's email
provides more real details.

In particular, this analogy doesn't explain the original questioner's
problem. In the analogy, -c might work without the files actually being
present on the source (as long as they are on the destination). But, in
reality, because the underlying mechanism is extent range cloning, the
files have to be present on **both** the source and the destination in
order for btrfs-send to work out what commands to send.

By the way, like Duncan, I was surprised that the man page suggests that
-c without -p causes one of the clones to be treated as a parent. I have
not checked the code to see if that is actually how it works.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-18 Thread Graham Cobb

On 18/09/17 07:10, Dave wrote:
> For my understanding, what are the restrictions on deleting snapshots?
> 
> What scenarios can lead to "ERROR: parent determination failed"?

The man page for btrfs-send is reasonably clear on the requirements
btrfs imposes. If you want to use incremental sends (i.e. the -c or -p
options) then the specified snapshots must exist on both the source and
destination. If you don't have a suitable existing snapshot then don't
use -c or -p and just do a full send.

> I use snap-sync to create and send snapshots.
> 
> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external 
> drive
> https://github.com/wesbarnett/snap-sync

I am not familiar with this tool. Your question should be sent to the
author of the tool, if that is what is deciding what -p and -c options
are being used.

Personally I use and recommend btrbk. I have never had this issue and
the configuration options let me limit the snapshots it saves on both
the source and destination disks separately (so I keep fewer on the
source than on the backup disk).

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Graham Cobb

On 14/08/17 16:53, Austin S. Hemmelgarn wrote:
> Quite a few applications actually _do_ have some degree of secondary
> verification or protection from a crash.  

I am glad your applications do and you have no need of this feature.
You are welcome not to use it. I, on the other hand, definitely want
this feature and would have it enabled by default on all my systems
despite the need for manual actions after some unclean shutdowns.

> Go look at almost any database
> software.  It usually will not have checksumming, but it will almost
> always have support for a journal, which is enough to cover the
> particular data loss scenario we're talking about (unexpected unclean
> shutdown).

No, the problem we are talking about is the data-at-rest corruption that
checksumming is designed to deal with. That is why I want it. The
unclean shutdown is a side issue that means there is a trade-off to
using it.

No one is suggesting that checksums are any significant help with the
unclean shutdown case, just that the existence of that atomicity issue
does not **prevent** them being very useful for the function for which
they were designed. The degree to which any particular sysadmin will
choose to enable or disable checksums on nodatacow files will depend on
how much they value the checksum protection vs. the impact of manually
fixing problems after some unclean shutdowns.

In my particular case, many of these nodatacow files are large, very
long-lived and only in use intermittently. I would like my monthly
"btrfs scrub" to know they haven't gone bad but they are extremely
unlikely to be in the middle of a write during an unclean shutdown so I
am likely to have very few false errors. They are all backed up, but
without checksumming I don't know that the backup needs to be restored
(or even that I am not backing up now-bad data).

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Graham Cobb

On 14/08/17 15:23, Austin S. Hemmelgarn wrote:
> Assume you have higher level verification.  

But almost no applications do. In real life, the decision
making/correction process will be manual and labour-intensive (for
example, running fsck on a virtual disk or restoring a file from backup).

> Would you rather not be able
> to read the data regardless of if it's correct or not, or be able to
> read it and determine yourself if it's correct or not?  

It must be controllable on a per-file basis, of course. For the tiny
number of files where the app can both spot the problem and correct it
(for example if it has a journal) the current behaviour could be used.

But, on MY system, I absolutely would **always** select the first option
(-EIO). I need to know that a potential problem may have occurred and
will take manual action to decide what to do. Of course, this also needs
a special utility (as Christoph proposed) to be able to force the read
(to allow me to examine the data) and to be able to reset the checksum
(although that is presumably as simple as rewriting the data).

This is what happens normally with any filesystem when a disk block goes
bad, but with the additional benefit of being able to examine a
"possibly valid" version of the data block before overwriting it.

> Looking at this from a different angle: Without background, what would
> you assume the behavior to be for this?  For most people, the assumption
> would be that this provides the same degree of data safety that the
> checksums do when the data is CoW.  

Exactly. The naive expectation is that turning off datacow does not
prevent the bitrot checking from working. Also, the naive expectation
(for any filesystem operation) is that if there is any doubt about the
reliability of the data, the error is reported for the user to deal with.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel btrfs file system wedged -- is it toast?

2017-07-21 Thread Graham Cobb

On 21/07/17 07:06, Paul Jackson wrote:
> What in god green's earth can kernel file system code be
> doing that takes fifteen minutes (so far, in this case) or
> fifty minutes (in the case I first reported on this thread?

I find that just doing a balance on a disk with lots of snapshots can
cause this sort of effect.  If I understand correctly, this is because
btrfs does not have an efficient structure to help find all the
references in different subvolumes to an extent which is being
manipulated and many trees have to be searched. My understanding may be
wrong but, in any case, the effect is that many operations can take
massive amounts of processing time if there are lots of shared extents.

This caused my early experiments with using btrfs snapshots on my main
data disk (full of mail files, etc) to make the system lock up for many
**hours** at a time (preventing mail processing, etc).

I took the advice on this list to significantly decrease the number of
snapshots I kept on the disk (I keep more on a separate backup disk,
which has no day-to-day transactions happening, and can also better
tolerate any issues).

I also created a very hacky script to try to limit the impact of the
balances which I do weekly. See
https://github.com/GrahamCobb/btrfs-balance-slowly if you are
interested. Between these things, the serious disruption is now gone.

Note: despite all the hangs, I never saw any disk corruption.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: backing up a collection of snapshot subvolumes

2017-04-25 Thread Graham Cobb

On 25/04/17 05:02, J. Hart wrote:
> I have a remote machine with a filesystem for which I periodically take
> incremental snapshots for historical reasons.  These snapshots are
> stored in an archival filesystem tree on a file server.  Older snapshots
> are removed and newer ones added on a rotational basis.  I need to be
> able to backup this archive by syncing it with a set of backup drives.
> Due to the size, I need to back it up incrementally rather than sending
> the entire content each time.  Due to the snapshot rotation, I need to
> be able to update the state of the archive backup filesystem as a whole,
> in much the same manner that rsync handles file trees.

If I have understood your requirement correctly, this seems to be
exactly matched to the capabilities of btrbk. I use btrbk to maintain a
similar backup disk which contains a full copy of my main data disk
along with various snapshots.

> It seems that I cannot use "btrfs send", as the archive directory
> contains the snapshots as subvolumes.

I'm not sure that you mean. If your problem is that btrfs send does not
cross subvolume boundaries then that is true: you would need to
configure btrbk to back up each subvolume. I have a cron job that checks
that all subvolumes (except the snapshots btrbk creates) are listed in
my btrbk configuration file.

Graham

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: backing up a file server with many subvolumes

2017-03-27 Thread Graham Cobb

On 27/03/17 13:00, J. Hart wrote:
> That is a very interesting idea.  I'll try some experiments with this.

You might want to look into two tools which I have found useful for
similar backups:

1) rsnapshot -- this uses rsync for backing up multiple systems and has
been stable for quite a long time. If the target disk is btrfs it is
fairly easy to configure so that it uses btrfs snapshots to create and
remove the snapshot directories, speeding up the process. This doesn't
really use any complex btrfs features and has been stable for me even on
my Debian stable (kernel 3.16.39) system.

2) btrbk -- this allows you to create and manage btrfs snapshots on the
source disk as well as backup snapshots on a separate btrfs disk. You
can separately control how many snapshots you keep online on both the
source and the backup disk. This is particularly useful for cases where
you want to take very frequent snapshots (say hourly) for which rsync
may be too slow (and rsync does not take a consistent snapshot, of course).

There are many other tools, of course (I also take daily backups with
dar to an ext4 system, without using any btrfs features at all, just in
case a new version of btrfs suddenly decided to correct all copies of
IHATEBTRFS on the disk to ILOVEBTRFS, for example :-) ).

Graham

Note to self: re-read this message periodically to check that feature
hasn't appeared yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and cyrus mail server

2017-02-08 Thread Graham Cobb

On 08/02/17 18:38, Libor Klepáč wrote:
> I'm interested in using:
...
>  - send/receive for offisite backup

I don't particularly recommend that. I do use send/receive for onsite
backups (I actually use btrbk). But for offsite I use a traditional
backup tool (I use dar). For three main reasons:

1) Paranoia: I want a backup that does not use btrfs just in case there
turned out to be some problem with btrfs which could corrupt the backup.
I can't think of anything but I did say it was paranoia!

2) send/receive in incremental mode (the obvious way to use it for
offsite backups) relies on the target being up to date and properly
synchronised with the source. If, for any reason, it gets out of sync,
you have to start again with sending a full backup - a lot of data.
Traditional backup formats are more forgiving and having a corrupted
incremental does not normally prevent you getting access to data stored
in the other incrementals. This would particularly be a risk if you
thought about storing the actual send streams instead of doing the
receive: a single bit error in one could make all the subsequent streams
useless.

3) send/receive doesn't work particularly well with encryption. I store
my offsite backups in a cloud service and I want them encrypted both in
transit and when stored. To get the same with send/receive requires
putting together your own encrypted communication channel (e.g. using
ssh) and requires that you have a remote server, with an encrypted
filesystem receiving the data (and it has to be accessible in the clear
on that server). Traditional backups can just be stored offsite as
encrypted files without ever having to be in the clear anywhere except
onsite.

Just my reasons.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-06 Thread Graham Cobb

On 05/02/17 12:08, Kai Krakow wrote:
> Wrong. If you tend to not be in control of the permissions below a
> mountpoint, you prevent access to it by restricting permissions on a
> parent directory of the mountpoint. It's that easy and it always has
> been. That is standard practice. While your backup is running, you have
> no control of it - thus use this standard practice!

Sorry, you are missing the point. This isn't about backups, it is about
snapshots.

To the sysadmin who is not a developer and does not know how receive is
actually implemented, send/receive appears to work exactly like taking a
readonly snapshot, but between two different disks. That is the mental
model they have of the process.

Taking a snapshot does not require hiding the target: it either works or
it doesn't, and it cannot be interfered with. The sysadmin's natural
expectation is that send/receive works the same way.

You may say, from your position of knowledge about how it is
implemented, that is an unrealistic expectation but it is a natural and
common expectation. I very firmly believe that 80% of ordinary btrfs
sysadmins would be surprised by this behaviour.

But, in any case, we can all agree that this unexpected behaviour needs
to be documented.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-03 Thread Graham Cobb

On 03/02/17 16:01, Austin S. Hemmelgarn wrote:
> Ironically, I ended up having time sooner than I thought.  The message
> doesn't appear to be in any of the archives yet, but the message ID is:
> <20170203134858.75210-1-ahferro...@gmail.com>

Ah. I didn't notice it until after I had sent my message.

> I actually like how you explained things a bit better though, so if you
> are OK with it I'll update the patch I sent using your description (and
> credit you in the commit message too of course).

You are welcome to use any of my phrasing or approach, of course!


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-03 Thread Graham Cobb

On 03/02/17 12:44, Austin S. Hemmelgarn wrote:
> I can look at making a patch for this, but it may be next week before I
> have time (I'm not great at multi-tasking when it comes to software
> development, and I'm in the middle of helping to fix a bug in Ansible
> right now).

That would be great, Austin! It is about 15 years since I last submitted
a patch under kernel development patch rules and things have changed a
fair bit in that time. So if you are set up to do it that sounds good.

As a starting point, I have created a suggested text (patch attached).




diff --git a/Documentation/btrfs-receive.asciidoc 
b/Documentation/btrfs-receive.asciidoc
index 6be4aa6..db525d9 100644
--- a/Documentation/btrfs-receive.asciidoc
+++ b/Documentation/btrfs-receive.asciidoc
@@ -31,7 +31,7 @@ the stream, and print the stream metadata, one operation per 
line.
 
 3. default subvolume has changed or you didn't mount the filesystem at the 
toplevel subvolume
 
-A subvolume is made read-only after the receiving process finishes succesfully.
+A subvolume is made read-only after the receiving process finishes succesfully 
(see BUGS below).
 
 `Options`
 
@@ -73,6 +73,16 @@ EXIT STATUS
 *btrfs receive* returns a zero exit status if it succeeds. Non zero is
 returned in case of failure.
 
+BUGS
+
+*btrfs receive* sets the subvolume read-only after it completes successfully.
+However, while the receive is in progress, users who have write access to files
+or directories in the receiving 'path' can add, remove or modify files, in 
which
+case the resulting read-only subvolume will not be a copy of the sending 
subvolume.
+
+If the intention is to create an exact copy, the receiving 'path' should be 
protected
+from access by users until the receive has completed and the subvolume set to 
read-only.
+
 AVAILABILITY
 
 *btrfs* is part of btrfs-progs.

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-02 Thread Graham Cobb

On 02/02/17 00:02, Duncan wrote:
> If it's a workaround, then many of the Linux procedures we as admins and 
> users use every day are equally workarounds.  Setting 007 perms on a dir 
> that doesn't have anything immediately security vulnerable in it, simply 
> to keep other users from even potentially seeing or being able to write 
> to something N layers down the subdir tree, is standard practice.

No. There is no need to normally place a read-only snapshot below a
no-execute directory just to prevent write access to it. That is not
part of the admin's expectation.

> Which is my point.  This is no different than standard security practice, 
> that an admin should be familiar with and using without even having to 
> think about it.  Btrfs is simply making the same assumptions that 
> everyone else does, that an admin knows what they are doing and sets the 
> upstream permissions with that in mind.  If they don't, how is that 
> btrfs' fault?

Because btrfs intends the receive snapshot to be read-only. That is the
expectation of the sysadmin. It is an important and useful feature which
makes send/receive very useful for creating
user-readable-but-not-modifiable backups (without it, send/receive are
useful for many things but less useful for creating backups). That
feature has a bug.

Just because you don't personally use the feature, doesn't mean it isn't
a bug! Many of us do rely on that feature.

Even though it is security-related, I agree it isn't the highest
priority btrfs bug. It can probably wait until receive is being worked
on for other reasons. But if it isn't going to be fixed any time soon,
it should be documented in the Wiki and the man page, with the suggested
workround for anyone who needs to make sure the receive won't be
tampered with.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-01 Thread Graham Cobb

On 01/02/17 22:27, Duncan wrote:
> Graham Cobb posted on Wed, 01 Feb 2017 17:43:32 + as excerpted:
> 
>> This first bug is more serious because it appears to allow a
>> non-privileged user to disrupt the correct operation of receive,
>> creating a form of denial-of-service of a send/receive based backup
>> process. If I decided that I didn't want my pron collection (or my
>> incriminating emails) appearing in the backups I could just make sure
>> that I removed them from the receive snapshots while they were still
>> writeable.
> 
> I'll prefix this question by noting that my own use-case doesn't use send/
> receive, so while I know about it in general from following the list, 
> I've no personal experience with it...
> 
> With that said, couldn't the entire problem be eliminated by properly 
> setting the permissions on a directory/subvol upstream of the received 
> snapshot?  

I (honestly) don't know. But even if that does work, it is clearly only
a workround for the bug. Where in the documentation does it warn the
system manager about the problem? Where does it tell them that they had
better make sure they only receive into a directory tree which does not
allow users read or execute access (not just not write access!)? What if
part of the point of the backup strategy is that user's have read access
to these snapshots so they can restore their own files?

The possibility of a knowledgeable system manager being able to
workround the problem by limiting how they use it doesn't stop it being
a bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-01 Thread Graham Cobb

On 01/02/17 12:28, Austin S. Hemmelgarn wrote:
> On 2017-02-01 00:09, Duncan wrote:
>> Christian Lupien posted on Tue, 31 Jan 2017 18:32:58 -0500 as excerpted:
>>
>>> I have been testing btrfs send/receive. I like it.
>>>
>>> During those tests I discovered that it is possible to access and modify
>>> (add files, delete files ...) of the new receive snapshot during the
>>> transfer. After the transfer it becomes readonly but it could already
>>> have been modified.
>>>
>>> So you can end up with a source and a destination which are not the
>>> same. Therefore during a subsequent incremental transfers I can get
>>> receive to crash (trying to unlink a file that is not in the parent but
>>> should).
>>>
>>> Is this behavior by design or will it be prevented in the future?
>>>
>>> I can of course just not modify the subvolume during receive but is
>>> there a way to make sure no user/program modifies it?
>>
>> I'm just a btrfs-using list regular not a dev, but AFAIK, the behavior is
>> likely to be by design and difficult to change, because the send stream
>> is simply a stream of userspace-context commands for receive to act upon,
>> and any other suitably privileged userspace program could run the same
>> commands.  (If your btrfs-progs is new enough receive even has a dump
>> option, that prints the metadata operations in human readable form, one
>> operation per line.)
>>
>> So making the receive snapshot read-only during the transfer would
>> prevent receive itself working.
> That's correct.  Fixing this completely would require implementing
> receive on the kernel side, which is not a practical option for multiple
> reasons.

I am with Christian on this. Both the effects he discovered go against
my expectation of how send/receive would or should work.

This first bug is more serious because it appears to allow a
non-privileged user to disrupt the correct operation of receive,
creating a form of denial-of-service of a send/receive based backup
process. If I decided that I didn't want my pron collection (or my
incriminating emails) appearing in the backups I could just make sure
that I removed them from the receive snapshots while they were still
writeable.

You may be right that fixing this would require receive in the kernel,
and that is undesirable, although it seems to me that it should be
possible to do something like allow receive to create the snapshot with
a special flag that would cause the kernel to treat it as read-only to
any requests not delivered through the same file descriptor, or
something like that (or, if that can't be done, at least require root
access to make any changes). In any case, I believe it should be treated
as a bug, even if low priority, with an explicit warning about the
possible corruption of receive-based backups in the btrfs-receive man page.

>>> I can also get in the same kind of trouble by modifying a parent (after
>>> changing its property temporarily to ro=false). send/receive is checking
>>> that the same parent uuid is available on both sides but not that
>>> generation has not changed. Of course in this case it requires direct
>>> user intervention. Never changing the ro property of subvolumes would
>>> prevent the problem.
>>>
>>> Again is this by design?
>>
>> Again, yes.  The ability to toggle snapshots between ro/rw is a useful
>> feature and was added deliberately.  This one would seem to me to be much
>> like the (no doubt apocryphal) guy who went to the doctor complaining
>> that when he beat his head against the wall, it hurt.  The doctor said,
>> "Stop doing that then."
> Agreed, especially considering that some of the most interesting
> use-cases for send/receive (which requires the sent subvolume to be
> read-only) require the subvolume to be made writable again on the other
> end.

I agree that there are good reasons why subvolumes should be switchable
between ro and rw. However, receive should detect and issue warnings
when this problem has happened (for example by checking the generation).
Again, this may be low priority, and may need to wait for a send stream
format change, but it can't be claimed that this is correct behaviour.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs recovery

2017-01-31 Thread Graham Cobb

On 30/01/17 22:37, Michael Born wrote:
> Also, I'm not interested in restoring the old Suse 13.2 system. I just
> want some configuration files from it.

If all you really want is to get some important information from some
specific config files, and it is so important it is worth an hour or so
of your time, you could consider a brute-force method such as just
grep-ing the whole image file for a string you know should appear in the
relevant config file and dumping the blocks around those locations to
see if you can see the data you need.

Unfortunately this won't work if you had file compression on. Or if
there is no reasonably unique text to search for, of course. Just a thought.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Not TLS] Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Graham Cobb

On 28/11/16 02:56, Duncan wrote:
> It should still be worth turning on autodefrag on an existing somewhat 
> fragmented filesystem.  It just might take some time to defrag files you 
> do modify, and won't touch those you don't, which in some cases might 
> make it worth defragging those manually.  Or simply create new 
> filesystems, mount them with autodefrag, and copy everything over so 
> you're starting fresh, as I do.

Could that "copy" be (a series of) send/receive, so that snapshots and
reflinks are preserved?  Does autodefrag work in that case or does the
send/receive somehow override that and end up preserving the original
(fragmented) extent structure?

Graham

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs-progs: Add command to check if balance op is req

2016-10-28 Thread Graham Cobb

On 28/10/16 16:20, David Sterba wrote:
> I tend to agree with this approach. The usecase, with some random sample
> balance options:
> 
>  $ btrfs balance start --analyze -dusage=10 -musage=5 /path

Wouldn't a "balance analyze" command be better than "balance start
--analyze"? I would have guessed the latter started the balance but
printed some analysis as well (before or, probably more usefully,
afterwards).

There might, of course, be some point in a (future)

$ btrfs balance start --if-needed -dusage=10 -musage=5 /path

command.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Incremental send robustness question

2016-10-13 Thread Graham Cobb

On 13/10/16 00:47, Sean Greenslade wrote:
> I may just end up doing that. Hugo's responce gave me some crazy ideas
> involving a custom build of split that waits for a command after each
> output file fills, which would of course require an equally weird build
> of cat that would stall the pipe indefinitely until all the files showed
> up. Driving the HDD over would probably be a little simpler. =P

I am sure it is, if that is an option.  I had considered doing something
similar: doing an initial send to a big file on a spare disk, then
sending the disk to Amazon to import using their Import/Export Disk
service, then creating a server and a filesystem in AWS and doing a
btrfs receive from the imported file. The plan would then be to do
incremental sends over my home broadband line for subsequent backups.

>>> And while we're at it, what are the failure modes for incremental sends?
>>> Will it throw an error if the parents don't match, or will there just be
>>> silent failures?
>>
>> Create a list of possibilities, create some test filesystems, try it.
> 
> I may just do that, presuming I can find the spare time. Given that I'm
> building a backup solution around this tech, it would definitely bolster
> my confidence in it if I knew what its failure modes looked like.

That's a good idea.  In the end I decided that relying on btrfs send for
my offsite cold storage backups was probably not a good idea.  Btrfs is
great, and I heavily use snapshotting and send/receive locally. But
there is always a small nagging fear that a btrfs bug could introduce
some problem which can survive send/receive and mean the backup was
corrupted as well (or that an incremental won't load for some reason).
For that reason I decided to deliberately use a different technology for
backups.  I now use dar to create backups and then upload the files to a
cloud cold storage service for safekeeping.

There are other reasons as well: encryption is easier to handle, cold
storage for files is cheaper than having disk images which need to be
online to load the incrementals, no need for a virtual server, handling
security for backups from different servers with different levels of
risk is easier, etc. There are also downsides: verifying the backups are
readable/restorable is harder, bandwidth usage is less efficient (dar
sends more data than btrfs send would as it is working at the file
level, not the extent level).

By the way, to test out your various failure modes I recommend creating
some small btrfs filesystems on loop devices -- just be careful to make
sure you create each one from scratch and do not copy disk images (so
that they all have unique UIDs).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Graham Cobb

On 20/09/16 19:53, Alexandre Poux wrote:
> As for moving data to an another volume, since it's only data and
> nothing fancy (no subvolume or anything), a simple rsync would do the trick.
> My problem in this case is that I don't have enough available space
> elsewhere to move my data.
> That's why I'm trying this hard to recover the partition...

I am sure you have already thought about this, but... it might be
easier, and even maybe faster, to backup the data to a cloud server,
then recreate and download again.

Backblaze B2 is very cheap for upload and storage (don't know about
download charges, though).  And rclone works well to handle rsync-style
copies (although you might want to use tar or dar if you need to
preserve file attributes).

And if that works, rclone + B2 might make a reasonable offsite backup
solution for the future!

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Security implications of btrfs receive?

2016-09-07 Thread Graham Cobb

On 07/09/16 16:06, Austin S. Hemmelgarn wrote:
> It hasn't, because there's not any way it can be completely fixed.  This
> particular case is an excellent example of why it's so hard to fix.  To
> close this particular hole, BTRFS itself would have to become aware of
> whether whoever is running an ioctl is running in a chroot or not, which
> is non-trivial to determine to begin with, and even harder when you
> factor in the fact that chroot() is a VFS level thing, not a underlying
> filesystem thing, while ioctls are much lower level.

Actually, I think the btrfs-receive case might be a little easier to fix
and, in my quick reading of the code before doing a test, I thought it
even did work this way...

I think the fix would be to require that the action of cloning disk
blocks required user-mode to provide an FD which has read access to the
source blocks.  There is little problem with allowing the ioctl to
identify the matching subvolume for a UUID but it should require user
mode to open the source file for read access using a path before
allowing any blocks to be cloned.  That way, any VFS-level checks would
be done.  And, what is more, btrfs-receive could do a file path check to
make sure the file being cloned from is within the path that was
provided on the command line.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Security implications of btrfs receive?

2016-09-07 Thread Graham Cobb

On 07/09/16 16:20, Austin S. Hemmelgarn wrote:
> I should probably add to this that you shouldn't be accepting
> send/receive data streams from untrusted sources anyway.  While it
> probably won't crash your system, it's not intended for use as something
> like a network service.  If you're sending a subvolume over an untrusted
> network, you should be tunneling it through SSH or something similar,
> and then using that to provide source verification and data integrity
> guarantees, and if you can't trust the system's your running backups
> for, then you have bigger issues to deal with.

In my personal case I'm not talking about accepting streams from
untrusted sources (although that is also a perfectly reasonable question
to discuss).  My concern is if one of my (well managed and trusted but
never perfect) systems is hacked, can the intruder use that as an entry
to attack others of my systems?

In particular, I never trust my systems which live on the internet with
automated access to my personal systems (without a human providing
additional passwords/keys) although I do allow some automated accesses
the other way around.  I am trying to determine if sharing
btrfs-send-based backups would open a vulnerability.

There are articles on the web suggesting that centralised
btrfs-send-based backups are a good idea (using ssh access with separate
keys for each system which automatically invoke btrfs-receive into a
system-specific path).  My tests so far suggest that this may not be as
secure as the articles imply.

In any case, I think this is a topic worth investigating further, if any
graduate student is looking for a PhD topic!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Security implications of btrfs receive?

2016-09-06 Thread Graham Cobb

Thanks to Austin and Duncan for their replies.

On 06/09/16 13:15, Austin S. Hemmelgarn wrote:
> On 2016-09-05 05:59, Graham Cobb wrote:
>> Does the "path" argument of btrfs-receive mean that *all* operations are
>> confined to that path?  For example, if a UUID or transid is sent which
>> refers to an entity outside the path will that other entity be affected
>> or used?
> As far as I know, no, it won't be affected.
>> Is it possible for a file to be created containing shared
>> extents from outside the path?
> As far as I know, the only way for this to happen is if you're
> referencing a parent subvolume for a relative send that is itself
> sharing extents outside of the path.  From a practical perspective,
> unless you're doing deduplication on the receiving end, the this
> shouldn't be possible.

Unfortunately that is not the case.  I decided to do some tests to see
what happens.  It is possible for a receive into one path to reference
and access a subvolume from a different path on the same btrfs disk.  I
have created a bash script to demonstrate this at:

https://gist.github.com/GrahamCobb/c7964138057e4e092a75319c9fb240a3

This does require the attacker to know the (source) subvolume UUID they
want to copy.  I am not sure how hard UUIDs are to guess.

By the way, this is exactly the same whether or not the --chroot option
is specified on the "btrfs receive" command.

The next question this raises for me is whether this means that
processes in a chroot or in a container (or in a mandatory access
controls environment) can access files outside the chroot/container if
they know the UUID of the subvolume?  After all, btrfs-receive uses
IOCTLs that any program running as root can use.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Security implications of btrfs receive?

2016-09-05 Thread Graham Cobb

Does anyone know of a security analysis of btrfs receive?

I assume that just using btrfs receive requires root (is that so?).  But
I was thinking of setting up a backup server which would receive
snapshots from various client systems, each in their own path, and I
wondered how much the security of the backup server (and other clients'
backups) was dependent on the security of the client.

Does the "path" argument of btrfs-receive mean that *all* operations are
confined to that path?  For example, if a UUID or transid is sent which
refers to an entity outside the path will that other entity be affected
or used? Is it possible for a file to be created containing shared
extents from outside the path? Is it possible to confuse/affect
filesystem metadata which would affect the integrity of subvolumes or
files outside the path or prevent other clients from doing something
legitimate?

Do the answers change if the --chroot option is given?  I am confused
about the -m option -- does that mean that the root mount point has to
be visible in the chroot?

Lastly, even if receive is designed to be very secure, it is possible
that it could trigger/use code paths in the btrfs kernel code which are
not normally used during normal file operations and so could trigger
bugs not normally seen.  Has any work been done on testing for that (for
example tests using malicious streams, including ones which btrfs-send
cannot generate)?

I am just wondering whether any work has been done/published on this area.

Regards
Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extents for a particular subvolume

2016-08-15 Thread Graham Cobb

On 03/08/16 22:55, Graham Cobb wrote:
> On 03/08/16 21:37, Adam Borowski wrote:
>> On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote:
>>> Are there any btrfs commands (or APIs) to allow a script to create a
>>> list of all the extents referred to within a particular (mounted)
>>> subvolume?  And is it a reasonably efficient process (i.e. doesn't
>>> involve backrefs and, preferably, doesn't involve following directory
>>> trees)?

In case anyone else is interested in this, I ended up creating some
simple scripts to allow me to do this.  They are slow because they walk
the directory tree and they use filefrag to get the extent data, but
they do let me answer questions like:

* How much space am I wasting by keeping historical snapshots?
* How much data is being shared between two subvolumes
* How much of the data in my latest snapshot is unique to that snapshot?
* How much data would I actually free up if I removed (just) these
particular subvolumes?

If they are useful to anyone else you can find them at:

https://github.com/GrahamCobb/extents-lists

If anyone knows of more efficient ways to get this information please
let me know. And, of course, feel free to suggest improvements/bugfixes!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extents for a particular subvolume

2016-08-03 Thread Graham Cobb

On 03/08/16 21:37, Adam Borowski wrote:
> On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote:
>> Are there any btrfs commands (or APIs) to allow a script to create a
>> list of all the extents referred to within a particular (mounted)
>> subvolume?  And is it a reasonably efficient process (i.e. doesn't
>> involve backrefs and, preferably, doesn't involve following directory
>> trees)?
> 
> Since the size of your output is linear to the number of extents which is
> between the number of files and sum of their sizes, I see no gain in
> trying to avoid following the directory tree.

Thanks for the help, Adam.  There are a lot of files and a lot of
directories - find, "ls -R" and similar operations take a very long
time. I was hoping that I could query some sort of extent tree for the
subvolume and get the answer back in seconds instead of multiple minutes.

But I can follow the directory tree if I need to.

>> I am not looking to relate the extents to files/inodes/paths.  My
>> particular need, at the moment, is to work out how much of two snapshots
>> is shared data, but I can think of other uses for the information.
> 
> Thus, unlike the question you asked above, you're not interested in _all_
> extents, merely those which changed.
> 
> You may want to look at "btrfs subv find-new" and "btrfs send --no-data".

Unfortunately, the subvolumes do not have an ancestor-descendent
relationship (although they do have some common ancestors), so I don't
think find-new is much help (as far as I can see).

But just looking at the size of the output  from "send -c" would work
well enough for the particular problem I am trying to solve tonight!
Although I will need to take read-only snapshots of the subvolumes to
allow send to work. Thanks for the suggestion.

I would still be interested in the extent list, though.  The main
problem with find-new and send is that they don't tell me how much has
been deleted, only added.  I am thinking about using the extents to get
a much better handle on what is using up space and what I could recover
if I removed (or moved to another volume) various groups of related
subvolumes.

Thanks again for the help.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Extents for a particular subvolume

2016-08-03 Thread Graham Cobb

Are there any btrfs commands (or APIs) to allow a script to create a
list of all the extents referred to within a particular (mounted)
subvolume?  And is it a reasonably efficient process (i.e. doesn't
involve backrefs and, preferably, doesn't involve following directory
trees)?

I am not looking to relate the extents to files/inodes/paths.  My
particular need, at the moment, is to work out how much of two snapshots
is shared data, but I can think of other uses for the information.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: fi defrag: change default extent target size to 32 MiB

2016-07-28 Thread Graham Cobb

On 28/07/16 12:17, David Sterba wrote:
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index ef1f550b51c0..6b381c582ea7 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -968,7 +968,7 @@ static const char * const cmd_filesystem_defrag_usage[] = 
> {
>   "-f flush data to disk immediately after defragmenting",
>   "-s start   defragment only from byte onward",
>   "-l len defragment only up to len bytes",
> - "-t sizetarget extent size hint",
> + "-t sizetarget extent size hint (default: 32 MiB)",

As a user... might it be better to say the default is 32M as that is the
format the option requires?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2016-07-21 Thread Graham Cobb

On 21/07/16 09:19, Qu Wenruo wrote:
> We don't usually get such large extent tree dump from a real world use
> case.

Let us know if you want some more :-)

I have a heavily used single disk BTRFS filesystem with about 3.7TB in
use and about 9 million extents.  I am happy to provide an extent dump
if it is useful to you.  Particularly if you don't need me to actually
unmount it (i.e. you can live with some inconsistencies).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Graham Cobb

On 21/06/16 12:51, Austin S. Hemmelgarn wrote:
> The scrub design works, but the whole state file thing has some rather
> irritating side effects and other implications, and developed out of
> requirements that aren't present for balance (it might be nice to check
> how many chunks actually got balanced after the fact, but it's not
> absolutely necessary).

Actually, that would be **really** useful.  I have been experimenting
with cancelling balances after a certain time (as part of my
"balance-slowly" script).  I have got it working, just using bash
scripting, but it means my script does not know whether any work has
actually been done by the balance run which was cancelled (if no work
was done, but it timed out anyway, there is probably no point trying
again with the same timeout later!).

Graham

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reducing impact of periodic btrfs balance

2016-05-26 Thread Graham Cobb

On 19/05/16 02:33, Qu Wenruo wrote:
> 
> 
> Graham Cobb wrote on 2016/05/18 14:29 +0100:
>> A while ago I had a "no space" problem (despite fi df, fi show and fi
>> usage all agreeing I had over 1TB free).  But this email isn't about
>> that.
>>
>> As part of fixing that problem, I tried to do a "balance -dusage=20" on
>> the disk.  I was expecting it to have system impact, but it was a major
>> disaster.  The balance didn't just run for a long time, it locked out
>> all activity on the disk for hours.  A simple "touch" command to create
>> one file took over an hour.
> 
> It seems that balance blocked a transaction for a long time, which makes
> your touch operation to wait for that transaction to end.

I have been reading volumes.c.  But I don't have a feel for which
transactions are likely to be the things blocking for a really long time
(hours).

If this can occur, I think the warnings to users about balance need to
be extended to include this issue.  Currently the user mode code warns
users that unfiltered balances may take a long time, but it doesn't warn
that the disk may be unusable during that time.

>> 3) My btrfs-balance-slowly script would work better if there was a
>> time-based limit filter for balance, not just the current count-based
>> filter.  I would like to be able to say, for example, run balance for no
>> more than 10 minutes (completing the operation in progress, of course)
>> then return.
> 
> As btrfs balance is done in block group unit, I'm afraid such thing
> would be a little tricky to implement.

It would be really easy to add a jiffies-based limit into the checks in
should_balance_chunk.  Of course, this would only test the limit in
between block groups but that is what I was looking for -- a time-based
version of the current limit filter.

On the other hand, the time limit could just be added into the user mode
code: after the timer expires it could issue a "balance pause".  Would
the effect be identical in terms of timing, resources required, etc?

Would it be better to do a "balance pause" or a "balance cancel"?  The
goal would be to suspend balance processing and allow the system to do
something else for a while (say 20 minutes) and then go back to doing
more balance later.  What is the difference between resuming a paused
balance compared to starting a new balance? Bearing in mind that this is
a heavily used disk so we can expect lots of transactions to have
happened in the meantime (otherwise we wouldn't need this capability)?

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Not TLS] Re: Reducing impact of periodic btrfs balance

2016-05-19 Thread Graham Cobb

On 19/05/16 05:09, Duncan wrote:
> So to Graham, are these 1.5K snapshots all of the same subvolume, or 
> split into snapshots of several subvolumes?  If it's all of the same 
> subvolume or of only 2-3 subvolumes, you still have some work to do in 
> terms of getting down to recommended snapshot levels.  Also, if you have 
> quotas on and don't specifically need them, try turning them off and see 
> if that alone makes it workable.

I have just under 20 subvolumes but the snapshots are only taken if
something has changed (actually I use btrbk: I am not sure if it takes
the snapshot and then removes it if nothing changed or whether it knows
not to even take it).  The most frequently changing subvolumes have just
under 400 snapshots each.  I have played with snapshot retention and
think it unlikely I would want to reduce it further.

I have quotas turned off.  At least, I am not using quotas -- how can I
double check it is really turned off?

I know that very large numbers of snapshots are not recommended, and I
expected the balance to be slow.  I was quite prepared for it to take
many days.  My full backups take several days and even incrementals take
several hours. What I did not expect, and think is a MUCH more serious
problem, is that the balance prevented use of the disk, holding up all
writes to the disk for (quite literally) hours each.  I have not seen
that effect mentioned anywhere!

That means that for a large, busy data disk, it is impossible to do a
balance unless the server is taken down to single-user mode for the time
the balance takes (presumably still days).  I assume this would also
apply to doing a RAID rebuild (I am not using multiple disks at the moment).

At the moment I am still using my previous backup strategy, alongside
the snapshots (that is: rsync-based rsnapshots to another disk daily and
with fairly long retentions, and separate daily full/incremental backups
using dar to a nas in another building).  I was hoping the btrfs
snapshots might replace the daily rsync snapshots but it doesn't look
like that will work out.

Thanks to all for the replies.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reducing impact of periodic btrfs balance

2016-05-18 Thread Graham Cobb

Hi,

I have a 6TB btrfs filesystem I created last year (about 60% used).  It
is my main data disk for my home server so it gets a lot of usage
(particularly mail). I do frequent snapshots (using btrbk) so I have a
lot of snapshots (about 1500 now, although it was about double that
until I cut back the retention times recently).

A while ago I had a "no space" problem (despite fi df, fi show and fi
usage all agreeing I had over 1TB free).  But this email isn't about that.

As part of fixing that problem, I tried to do a "balance -dusage=20" on
the disk.  I was expecting it to have system impact, but it was a major
disaster.  The balance didn't just run for a long time, it locked out
all activity on the disk for hours.  A simple "touch" command to create
one file took over an hour.

More seriously, because of that, mail was being lost: all mail delivery
timed out and the timeout error was interpreted as a fatal delivery
error causing mail to be discarded, mailing lists to cancel
subscriptions, etc. The balance never completed, of course.  I
eventually got it cancelled.

I have since managed to complete the "balance -dusage=20" by running it
repeatedly with "limit=N" (for small N).  I wrote a script to automate
that process, and rerun it every week.  If anyone is interested, the
script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly

Out of that experience, I have a couple of thoughts about how to
possibly make balance more friendly.

1) It looks like the balance process seems to (effectively) lock all
file (extent?) creation for long periods of time.  Would it be possible
for balance to make more effort to yield locks to allow other
processes/threads to get in to continue to create/write files while it
is running?

2) btrfs scrub has options to set ionice options.  Could balance have
something similar?  Or would reducing the IO priority make things worse
because locks would be held for longer?

3) My btrfs-balance-slowly script would work better if there was a
time-based limit filter for balance, not just the current count-based
filter.  I would like to be able to say, for example, run balance for no
more than 10 minutes (completing the operation in progress, of course)
then return.

4) My btrfs-balance-slowly script would be more reliable if there was a
way to get an indication of whether there was more work to be done,
instead of parsing the output for the number of relocations.

Any thoughts about these?  Or other things I could be doing to reduce
the impact on my services?

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

74 matches

Mail list logo