Re: Filesystem Corruption

2018-12-03 Thread remi
On Mon, Dec 3, 2018, at 4:31 AM, Stefan Malte Schumacher wrote:

> I have noticed an unusual amount of crc-errors in downloaded rars,
> beginning about a week ago. But lets start with the preliminaries. I
> am using Debian Stretch.
> Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4
> (2018-08-21) x86_64 GNU/Linux
> 
> [5390748.884929] Buffer I/O error on dev dm-0, logical block
> 976701312, async page read


Excuse me for butting when there are *many* more qualified people on this list.

But assuming the rar crc errors are related to your unexplained buffer I/O 
errors, (and not some weird coincidence of simply bad downloads.), I would 
start, immediately, by testing the Memory.  Ram corruption can wreak havok with 
btrfs, (any filesystem but I think BTRFS has special challenges in this 
regard.)  and this looks like memory error to me.



Re: Understanding "btrfs filesystem usage"

2018-10-29 Thread Remi Gauvin
On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> I want to know how many free space is left and have problems in
> interpreting the output of: 
> 
> btrfs filesystem usage
> btrfs filesystem df
> btrfs filesystem show
> 
>

In my not so humble opinion, the filesystem usage command has the
easiest to understand output.  It' lays out all the pertinent information.

You can clearly see 825GiB is allocated, with 494GiB used, therefore,
filesystem show is actually using the "Allocated" value as "Used".
Allocated can be thought of "Reserved For".  As the output of the Usage
command and df command clearly show, you have almost 400GiB space available.

Note that the btrfs commands are clearly and explicitly displaying
values in Binary units, (Mi, and Gi prefix, respectively).  If you want
df command to match, use -h instead of -H (see man df)

An observation:

The disparity between 498GiB used and 823Gib is pretty high.  This is
probably the result of using an SSD with an older kernel.  If your
kernel is not very recent, (sorry, I forget where this was fixed,
somewhere around 4.14 or 4.15), then consider mounting with the nossd
option.  You can improve this by running a balance.

Something like:
btrfs balance start -dusage=55

You do *not* want to end up with all your space allocated to Data, but
not actually used by data.  Bad things can happen if you run out of
Unallocated space for more metadata. (not catastrophic, but awkward and
unexpected downtime that can be a little tricky to sort out.)


<>

Re: Have 15GB missing in btrfs filesystem.

2018-10-27 Thread Remi Gauvin
On 2018-10-27 04:19 PM, Marc MERLIN wrote:

> Thanks for confirming. Because I always have snapshots for btrfs
> send/receive, defrag will duplicate as you say, but once the older
> snapshots get freed up, the duplicate blocks should go away, correct?
> 
> Back to usage, thanks for pointing out that command:
> saruman:/mnt/btrfs_pool1# btrfs fi usage .
> Overall:
> Device size:   228.67GiB
> Device allocated:  203.54GiB
> Device unallocated: 25.13GiB
> Device missing:0.00B
> Used:  192.01GiB
> Free (estimated):   32.44GiB  (min: 19.88GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,single: Size:192.48GiB, Used:185.16GiB
>/dev/mapper/pool1   192.48GiB
> 
> Metadata,DUP: Size:5.50GiB, Used:3.42GiB
>/dev/mapper/pool111.00GiB
> 
> System,DUP: Size:32.00MiB, Used:48.00KiB
>/dev/mapper/pool164.00MiB
> 
> Unallocated:
>/dev/mapper/pool125.13GiB
> 
> 
> I'm still seing that I'm using 192GB, but 203GB allocated.
> Do I have 25GB usable:
> Device unallocated: 25.13GiB
> 
> Or 35GB usable?
> Device size:   228.67GiB
>   -
> Used:192.01GiB
>   = 36GB ?
> 


The answer is somewhere between the two.  (BTRFS's estimate of 32.44
Free is probably as close as you'll get to predicting.)

So you have 7.32GB  allocated but still free space for data, and 25GB of
completely unallocated disk space. However, as you add more data, or
create more snapshots and create metadata duplication, some of that 25GB
will be allocated for Metadata.  Remember that Metadata is Duplicated
(so that 3.42GB of Metadata you are using now is actually using 6.84GB
of disk space, out of the allocated 11GB

You want to be careful that unallocated space doesn't run out.If the
system runs out of usable space for metadata, it can be tricky to get
yourself out of the corner.  That is why a large discrepency between
Data Size and Used would be a concern.  If those 25GB of space were
allocated to data, your would get out of space errors even if the 25GB
was still unused.

On that note, you seem to have a rather high metadata to data ratio..
(at least, compared to my limited experience.).  Are you using noatime
on your filesystems?  without it, snapshots will end up causing
duplicated metadata when atime updates.




<>

Re: Have 15GB missing in btrfs filesystem.

2018-10-27 Thread Remi Gauvin
On 2018-10-27 01:42 PM, Marc MERLIN wrote:

> 
> I've been using btrfs for a long time now but I've never had a
> filesystem where I had 15GB apparently unusable (7%) after a balance.
> 

The space isn't unusable.  It's just allocated.. (It's used in the sense
that it's reserved for data chunks.).  Start writing data to the drive,
and the data will fill that space before more gets allocated.. (Unless
you are using an older kernel and the filesystem gets mounted with ssd
option, in which case, you'll want to add nossd option to prevent that
behaviour.)

You can use btrfs fi usage to display that more clearly.


> I can try a defrag next, but since I have COW for snapshots, it's not
> going to help much, correct?

The defrag will end up using more space, as the fragmented parts of
files will get duplicated.  That being said, if you have the luxury to
defrag *before* taking new snapshots, that would be the time to do it.

<>

Re: Two partitionless BTRFS drives no longer seen as containing BTRFS filesystem

2018-10-06 Thread Remi Gauvin
On 2018-10-06 07:23 PM, evan d wrote:
> I have two hard drives that were never partitioned, but set up as two
> independent BRTFS filesystems.  Both drives were used in the same
> machine running Arch Linux and the drives contain(ed) largely static
> data.
> 
> I decommissioned the machine they were originally used in and on
> installing in a newer Arch build found that BRTFS reported no
> filesystem on either of the drives.
> 
> uname -a:
> Linux z87i-pro 4.18.9-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 19 21:19:17
> UTC 2018 x86_64 GNU/Linux
> 
> btrfs --version: btrfs-progs v4.17.1
> btrfs fi show: returns no data

Did you try a btrfs device scan  ?

(Normally, that would be done on boot, but depending on how your arch
was configured, or if the devices are available early enough in the boot
process)
<>

Re: btrfs problems

2018-09-20 Thread Remi Gauvin
On 2018-09-20 05:35 PM, Adrian Bastholm wrote:
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

In case you were not aware, it's perfectly acceptable with BTRFS to use
Raid 1 over 3 devices.  Even more amazing, regardless of how many
devices you start with, 2, 3, 4, whatever, you can add a single drive to
the array to increase capacity,  (at 50%, of course,, ie, adding a 4TB
drive will give you 2TB usable space, assuming the other drives add up
to at least 4TB to match it.)


<>

Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-19 Thread Remi Gauvin
On 2018-09-19 04:43 AM, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.

Not related to the issue you are reporting, but I thought it's worth
mentioning, (since not many do), that using chattr +C on a BTRFS Raid 1
is a dangerous thing.  without COW, the 2 copies are never synchronized,
even if a scrub is executed.  So any kind of unclean shutdown that
interrupts writes (not to mention the extreme of a temporarily
disconnected drive.) will result in files that are inconsistent.  (ie,
depending on which disk happens to read at the time, the data will be
different on each read.)


<>

Re: Re-mounting removable btrfs on different device

2018-09-06 Thread Remi Gauvin
On 2018-09-06 11:32 PM, Duncan wrote:

> Without the mentioned patches, the only way (other than reboot) is to 
> remove and reinsert the btrfs kernel module (assuming it's a module, not 
> built-in), thus forcing it to forget state.
> 
> Of course if other critical mounted filesystems (such as root) are btrfs, 
> or if btrfs is a kernel-built-in not a module and thus can't be removed, 
> the above doesn't work and a reboot is necessary.  Thus the need for 
> those patches you mentioned.
> 

Good to know, thanks.
<>

Re-mounting removable btrfs on different device

2018-09-06 Thread Remi Gauvin
I'm trying to use a BTRFS filesystem on a removable drive.

The first drive drive was added to the system, it was /dev/sdb

Files were added and device unmounted without error.

But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy
about re-using /dev/sdb).

btrfs fi show: output:

Label: 'Archive 01'  uuid: 221222e7-70e7-4d67-9aca-42eb134e2041
Total devices 1 FS bytes used 515.40GiB
devid1 size 931.51GiB used 522.02GiB path /dev/sdg1

This causes BTRFS to fail mounting the device with the following errors:

sd 3:0:0:0: [sdg] Attached SCSI disk
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 1, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 2, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 3, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 4, flush 0,
corrupt 0, gen 0
ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata4: irq_stat 0x00400040, connection status changed
ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }


I've seen some patches on this list to add a btrfs device forget option,
which I presume would help with a situation like this.  Is there a way
to do that manually?
<>

Re: btrfs fi du unreliable?

2018-08-29 Thread Remi Gauvin
On 2018-08-29 08:00 AM, Jorge Bastos wrote:

> 
> Look for example at snapshots from July 21st and 22nd, total used
> space went from 199 to 277GiB, this is mostly from new added files, as
> I confirmed from browsing those snapshots, there were no changes on
> the 23rd, and a lot of files were deleted before the 24th, so should't
> there be about 80GiB of exclusive content for the 22nd, or am I
> misunderstanding how this is reported? ? Those were new files only,
> never existed on previous snapshots, If I delete both snapshots from
> the 22nd and the 23rd I expect to get about 80GiB freed space.

Exclusive means... exclusive... to that 1 snapshop/subvolume.  If the
data also exists on the 23'rd snapshot, it's not exclusive.

If you wanted to report how much data is exclusive to a group of
snapshots, (say, July 22 *and* 23rd). you would have to make them
members of a parent qgroup, then, you could see the exclusive value of
the whole group.

<>

BTRFS and databases

2018-08-02 Thread Remi Gauvin
On 2018-08-02 03:07 AM, Qu Wenruo wrote:


> For data, since we have cow (along with csum), it should be no problem
> to recover.
> 
> And since datacow is used, transaction on each device should be atomic,
> thus we should be able to handle one-time device out-of-sync case.
> (For multiple out-of-sync events, we don't have any good way though).
> 
> Or did I miss something from previous discussion?

As far as I know, that is indeed correct and works very well.  The
question was specifically about using nodatacow for databases,, and
that's the question I was responding too.  In the current state, I do no
believe btrfs nodatacow is in any way appropriate for databases/vm
hosting when combined with multi-device.




signature.asc
Description: OpenPGP digital signature


Re: BTRFS and databases

2018-08-01 Thread Remi Gauvin
On 2018-07-31 11:45 PM, MegaBrutal wrote:

> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?
> 

Be very careful about nodatacow and btrfs 'raid'.  BTRFS has no data
synching mechanism for raid, so if your mirrors end up different
somehow, your Array is going to be inconsistent.
<>

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin

> Acceptable, but not really apply to software based RAID1.
> 

Which completely disregards the minor detail that all the software
Raid's I know of can handle exactly this kind of situation without
loosing or corrupting a single byte of data, (Errors on the remaining
hard drive notwithstanding.)

Exactly what methods they employ to do so I'm not an expert at,, but it
*does* work, contrary to your repeated assertions otherwise.

In any case, thank you the for the patch you wrote.  I will, however,
propose a different solution.

Given the reliance of BTRFS on csum, and the lack of any
resynchronization, (no matter how the drives got out of sync, doesn't
matter.).  I think NoDataCow should just be ignored in the case of RAID,
just like the data blocks would get copied if there was a snapshot.

In the current implementation of RAID on btrfs, RAID and nodatacow are
effectively mutually exclusive.  Consider the kinds of use cases
nodatacow is usually recommended for,  VM images and databases.   Even
though those files should have their own mechanisms for dealing with
incomplete writes, and data verification, BTRFS RAID creates a unique
situation where parts of the file can be inconsistent, with different
data being read depending on which device is doing the reading.

Regardless of which method, short term and long term, developers choose
to address this, this next part I have stress I consider very important.

The status page really needs to be updated to reflect this gotchya.  It
*will* bite people in ways they do not expect, and disastrously.

<>

signature.asc
Description: OpenPGP digital signature


[PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Remi Gauvin
On 2018-06-28 10:36 AM, Adam Borowski wrote:

> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).
> 
> We currently don't have selective scrub yet so resyncing such single-copy

That might not be the case. though I don't really know the numbers
myself and repeating this is hearsay:

crc32 is not infallible.  1 in so many billion errors will be undetected
by it.  In the case of a dropped device with write failures, when you
*know* the data supposedly written to the disk is bad, re-synching from
believed good copy (so long as it passes checksum verification, of
course), is the only way to be certain that the data is good.


Otherwise, you can be left with a Schroedinger's bit somewhere,  (It's
not 0 or 1, but both, depending on which device the filesystem is
reading from at the time.)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin
On 2018-06-28 10:17 AM, Chris Murphy wrote:

> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.

I'm not sure this is the case.  Even though writes failed to the
disconnected device, btrfs seemed to keep on going as though it *were*.

When the array was re-mounted with both devices, (never mounted as
degraded), and scrub was run, scrub took a *long* time fixing errors, at
a whopping 3MB/s, and reported having fixed millions of them.


<>

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread remi



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:

> 
> Please get yourself clear of what other raid1 is doing.

A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically, others will automatically kick out the misbehaving 
drive.  *none* of them will take back the the drive with old data and start 
commingling that data with good copy.)\ This behaviour from BTRFS is completely 
abnormal.. and defeats even the most basic expectations of RAID.

I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Remi Gauvin
On 2018-06-27 09:58 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月28日 09:42, Remi Gauvin wrote:
>> There seems to be a major design flaw with BTRFS that needs to be better
>> documented, to avoid massive data loss.
>>
>> Tested with Raid 1 on Ubuntu Kernel 4.15
>>
>> The use case being tested was a Virtualbox VDI file created with
>> NODATACOW attribute, (as is often suggested, due to the painful
>> performance penalty of COW on these files.)
> 
> NODATACOW implies NODATASUM.
> 

yes yes,, none of which changes the simple fact that if you use this
option, which is often touted as outright necessary for some types of
files, BTRFS raid is worse than useless,, not only will it not protect
your data at all from bitrot, (as expected), it will actively go out of
it's way to corrupt it!

This is not expected behaviour from 'Raid', and I despair that seems to
be something that I have to explain!




signature.asc
Description: OpenPGP digital signature


Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Remi Gauvin
There seems to be a major design flaw with BTRFS that needs to be better
documented, to avoid massive data loss.

Tested with Raid 1 on Ubuntu Kernel 4.15

The use case being tested was a Virtualbox VDI file created with
NODATACOW attribute, (as is often suggested, due to the painful
performance penalty of COW on these files.)

However, if a device is temporarily dropped (this in case, tested by
disconnecting drives.) and re-connects automatically next boot, BTRFS
does not in any way synchronize the VDI file, or have any means to know
that one of copy is out of date and bad.

The result of trying to use said VDI file is interestingly insane.
Scrub did not do anything to rectify the situation.


<>