Re: a new kind of "No space left on device" error

2018-10-29 Thread Henk Slager
On Mon, Oct 29, 2018 at 7:20 AM Dave  wrote:
>
> This is one I have not seen before.
>
> When running a simple, well-tested and well-used script that makes
> backups using btrfs send | receive, I got these two errors:
>
> At subvol snapshot
> ERROR: rename o131621-1091-0 ->
> usr/lib/node_modules/node-gyp/gyp/pylib/gyp/MSVSVersion.py failed: No
> space left on device
>
> At subvol snapshot
> ERROR: rename o259-1095-0 -> myser/.bash_profile failed: No space left on 
> device
>
> I have run this script many, many times and never seen errors like
> this. There is plenty of room on the device:
>
> # btrfs fi df /mnt/
> Data, single: total=18.01GiB, used=16.53GiB
> System, DUP: total=8.00MiB, used=16.00KiB
> Metadata, DUP: total=1.00GiB, used=145.12MiB
> GlobalReserve, single: total=24.53MiB, used=0.00B
>
> # df -h /mnt/
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdc254G   17G   36G  33% /mnt
>
> The send | receive appears to have mostly succeeded because the final
> expected size is about 17G, as shown above. That will use only about
> 1/3 of the available disk space, when completed. I don't see any
> reason for "No space left on device" errors, but maybe somebody here
> can spot a problem I am missing.
What kernel and progs versions?
What are the mount options for the filesystem?
Can you tell something about the device /dev/sdc2 (SSD, HDD, SD-card,
USBstick, LANstorage, etc)?
Could it be that your ENOSPACE has the same cause as this:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg81554.html


Re: Untar on empty partition returns ENOSPACE

2018-10-18 Thread Henk Slager
On Thu, Oct 18, 2018 at 6:04 AM Jean-Denis Girard  wrote:
>
> Hi list,
>
> My goal is to duplicate some SD cards, to prepare 50 similar Raspberry Pi.
>
> First, I made a tar of my master SD card (unmounted). Then I made a
> script, which creates 2 partitions (50 MB for boot, 14 GB for /),
> creates the file-systems (vfat and btrfs, default options), mounts the
> two partitions:
>
> mount $part2 $mnt
> -ocompress=zstd,space_cache=v2,autodefrag,noatime,nodiratime
> mkdir $mnt/boot
> mount $part1 $mnt/boot
>
> When untarring, I get many errors, like:
> tar:
> ./usr/lib/libreoffice/share/gallery/arrows/A45-TrendArrow-Red-GoUp.png :
> open impossible: No space left on device
> tar:
> ./usr/lib/libreoffice/share/gallery/arrows/A53-TrendArrow-LightBlue-TwoDirections.svg
> : open impossible: No space left on device
> tar:
> ./usr/lib/libreoffice/share/gallery/arrows/A27-CurvedArrow-DarkRed.png :
> open impossible: No space left on device
> tar:
> ./usr/lib/libreoffice/share/gallery/arrows/A41-CurvedArrow-Gray-Left.svg
> : open impossible: No space left on device
>
> Which usually results in unusable SD card.
>
> When the first errors occur, less than 1 GB has been written. I tried to
> change mount options, especially commit interval, but still got ENOSPACE.
>
> What I ended up doing is limiting write speed with:
> xzcat $archive | pv -L 4M | tar x
> (with -L 5M I start to get a couple of errors)
>
> Is there a better work around? Or a patch to test that could help?
>
> The machine that runs the script has:
> [jdg@tiare tar-install]$ uname -r
> 4.19.0-rc7-snx
> [jdg@tiare tar-install]$ btrfs version
> btrfs-progs v4.17.1
This looks like to have a similar root cause as this:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg81536.html

For 50 clones, you could prepare 1 master (sparse) image and change
the UUID with btrfstune for every SD-card you have after written the
image (optionally sparse write with e.g. dd_rescue).

But it looks like some change is needed in the kernel code, although I
have no clue at the moment where exactly.


Re: btrfs send receive: No space left on device

2018-10-17 Thread Henk Slager
On Wed, Oct 17, 2018 at 10:29 AM Libor Klepáč  wrote:
>
> Hello,
> i have new 32GB SSD in my intel nuc, installed debian9 on it, using btrfs as 
> a rootfs.
> Then i created subvolumes /system and /home and moved system there.
>
> System was installed using kernel 4.9.x and filesystem created using 
> btrfs-progs 4.7.x
> Details follow:
> main filesystem
>
> # btrfs filesystem usage /mnt/btrfs/ssd/
> Overall:
> Device size:  29.08GiB
> Device allocated:  4.28GiB
> Device unallocated:   24.80GiB
> Device missing:  0.00B
> Used:  2.54GiB
> Free (estimated): 26.32GiB  (min: 26.32GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:4.00GiB, Used:2.48GiB
>/dev/sda3   4.00GiB
>
> Metadata,single: Size:256.00MiB, Used:61.05MiB
>/dev/sda3 256.00MiB
>
> System,single: Size:32.00MiB, Used:16.00KiB
>/dev/sda3  32.00MiB
>
> Unallocated:
>/dev/sda3  24.80GiB
>
> #/etc/fstab
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /mnt/btrfs/ssd  btrfs 
> noatime,space_cache=v2,compress=lzo,commit=300,subvolid=5 0   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/system 0 
>   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /home   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/home 0   > 0
>
> -
> Then i installed kernel from backports:
> 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
> and btrfs-progs 4.17
>
> For backups , i have created 16GB iscsi device on my qnap and mounted it, 
> created filesystem, mounted like this:
> LABEL=backup/mnt/btrfs/backup   btrfs 
>   noatime,space_cache=v2,compress=lzo,subvolid=5,nofail,noauto 0  
>  0
>
> After send-receive operation on /home subvolume, usage looks like this:
>
> # btrfs filesystem usage /mnt/btrfs/backup/
> Overall:
> Device size:  16.00GiB
> Device allocated:  1.27GiB
> Device unallocated:   14.73GiB
> Device missing:  0.00B
> Used:844.18MiB
> Free (estimated): 14.92GiB  (min: 14.92GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:1.01GiB, Used:833.36MiB
>/dev/sdb1.01GiB
>
> Metadata,single: Size:264.00MiB, Used:10.80MiB
>/dev/sdb  264.00MiB
>
> System,single: Size:4.00MiB, Used:16.00KiB
>/dev/sdb4.00MiB
>
> Unallocated:
>/dev/sdb   14.73GiB
>
>
> Problem is, during send-receive of system subvolume, it runs out of space:
>
> # btrbk run /mnt/btrfs/ssd/system/ -v
> btrbk command line client, version 0.26.1  (Wed Oct 17 09:51:20 2018)
> Using configuration: /etc/btrbk/btrbk.conf
> Using transaction log: /var/log/btrbk.log
> Creating subvolume snapshot for: /mnt/btrfs/ssd/system
> [snapshot] source: /mnt/btrfs/ssd/system
> [snapshot] target: /mnt/btrfs/ssd/_snapshots/system.20181017T0951
> Checking for missing backups of subvolume "/mnt/btrfs/ssd/system" in 
> "/mnt/btrfs/backup/"
> Creating subvolume backup (send-receive) for: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> No common parent subvolume present, creating full backup...
> [send/receive] source: /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> [send/receive] target: /mnt/btrfs/backup/system.20181016T2034
> mbuffer: error: outputThread: error writing to  at offset 0x4b5bd000: 
> Broken pipe
> mbuffer: warning: error during output to : Broken pipe
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol system.20181016T2034
> ERROR: rename o77417-5519-0 -> 
> lib/modules/4.18.0-0.bpo.1-amd64/kernel/drivers/watchdog/pcwd_pci.ko failed: 
> No space left on device
> ERROR: Failed to send/receive btrfs subvolume: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034  -> /mnt/btrfs/backup
> [delete] options: commit-after
> [delete] target: /mnt/btrfs/backup/system.20181016T2034
> WARNING: Deleted partially received (garbled) subvolume: 
> /mnt/btrfs/backup/system.20181016T2034
> ERROR: Error while resuming backups, aborting
> Created 0/2 missing backups
> WARNING: Skipping cleanup of snapshots for subvolume "/mnt/btrfs/ssd/system", 
> as at least one target aborted earlier
> Completed within: 116s  (Wed Oct 17 09:53:16 2018)
> 

Re: Is autodefrag recommended?

2017-09-05 Thread Henk Slager
On Tue, Sep 5, 2017 at 1:45 PM, Austin S. Hemmelgarn
 wrote:

>>   - You end up duplicating more data than is strictly necessary. This
>> is, IIRC, something like 128 KiB for a write.
>
> FWIW< I'm pretty sure you can mitigate this first issue by running a regular
> defrag on a semi-regular basis (monthly is what I would probably suggest).

No, both autodefrag and regular defrag duplicate data, so if you keep
snapshots around for weeks or months, it can eat up a significant
amount of space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is autodefrag recommended?

2017-09-04 Thread Henk Slager
On Mon, Sep 4, 2017 at 12:34 PM, Duncan <1i5t5.dun...@cox.net> wrote:

> * Autodefrag works very well when these internal-rewrite-pattern files
> are relatively small, say a quarter GiB or less, but, again with near-
> capacity throughput, not necessarily so well with larger databases or VM
> images of a GiB or larger.  (The quarter-gig to gig size is intermediate,
> not as often a problem and not a problem for many, but it can be for
> slower devices, while those on fast ssds may not see a problem until
> sizes reach multiple GiB.)

I have seen you stating this before about some quarter GiB filesize or
so, but it is irrelevant, it is simply not how it works. See
explanation of Hugo for how it works. I can post/store an actual
filefrag output of a vm image that is around for 2 years on the one of
my btrfs fs, then you can do some statistics on it and see from there
how it works.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Henk Slager
On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG
 wrote:
> Hello,
>
> i'm trying to speed up big btrfs volumes.
>
> Some facts:
> - Kernel will be 4.13-rc7
> - needed volume size is 60TB
>
> Currently without any ssds i get the best speed with:
> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>
> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5.
>
> I can live with a data loss every now and and than ;-) so a raid 0 on
> top of the 4x radi5 is acceptable for me.
>
> Currently the write speed is not as good as i would like - especially
> for random 8k-16k I/O.
>
> My current idea is to use a pcie flash card with bcache on top of each
> raid 5.

If it can speed up depends quite a lot on what the use-case is, for
some not-so-much-parallel-access it might work. So this 60TB is then
20 4TB disks or so and the 4x 1GB cache is simply not very helpful I
think. The working set doesn't fit in it I guess. If there is mostly
single or a few users of the fs, a single pcie based bcacheing 4
devices can work, but for SATA SSD, I would use 1 SSD per HWraid5.

Then roughly make sure the complete set of metadata blocks fits in the
cache. For an fs of this size let's say/estimate 150G. Then maybe same
of double for data, so an SSD of 500G would be a first try.

You give the impression that reliability for this fs is not the
highest prio, so if you go full risk, then put bcache in write-back
mode, then you will have your desired random 8k-16k I/O speedup after
the cache is warmed up. But any SW or HW failure wil result in total
fs loss normally if SSD and HDD get out of sync somehow. Bcache
write-through might also be acceptable, you will need extensive
monitoring and tuning of all (bcache) parameters etc to be sure of the
right choice of size and setup etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is autodefrag recommended?

2017-09-04 Thread Henk Slager
On Mon, Sep 4, 2017 at 11:31 AM, Marat Khalili  wrote:
> Hello list,
> good time of the day,
>
> More than once I see mentioned in this list that autodefrag option solves
> problems with no apparent drawbacks, but it's not the default. Can you
> recommend to just switch it on indiscriminately on all installations?

Of course it has drawbacks, it depends on the use-cases on the
filesystem what your trade-off is. If the filesystem is created log
time ago and has 4k leafes the on HDD over time you get exessive
fragmentation and scattered 4k blocks all over the disk for a file
with a lot random writes (standard CoW for whole fs), like a 50G vm
image, easily 500k extents.

With autodefrag on from the beginning of fs creation, most
extent/blocksizes will be 128k or 256k in that order and then amount
of extents for the same vm image is roughly 50k. So statistically, the
average blocksize is not 4k but 128k, which is at least less free
space fragmentation (I use SSD caching of HDD otherwise also those 50k
extents result in totally unacceptable performance). But also for
newer standard 16k leafes, it is more or less the same story.

The drawbacks for me are:
1. I use nightly differential snapshotting for backup/replication over
a metered mobile network link, and this autodefrag causes a certain
amount of unnessccesaty fake content difference due to send|receive
based on CoW. But the amount of extra datavolume it causes is still
acceptable. If I would let the guest OS defragment its fs inside the
vm image for example, then the datavolume per day becomes
unacceptable.
2. It causes extra HDD activity, so noise, powerconsumption etc, which
might be unacceptable for some use-cases.

> I'm currently on kernel 4.4, can switch to 4.10 if necessary (it's Ubuntu
> that gives us this strange choice, no idea why it's not 4.9). Only spinning
> rust here, no SSDs.
Kernel 4.4 is new enough w.r.t. autodefrag, but if you can switch to
4.8 or newer, I would do so.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only for no good reason on 4.9.30

2017-09-04 Thread Henk Slager
On Mon, Sep 4, 2017 at 7:19 AM, Russell Coker
 wrote:
> I have a system with less than 50% disk space used.  It just started rejecting
> writes due to lack of disk space.  I ran "btrfs balance" and then it started
> working correctly again.  It seems that a btrfs filesystem if left alone will
> eventually get fragmented enough that it rejects writes (I've had similar
> issues with other systems running BTRFS with other kernel versions).
>
> Is this a known issue?
This is a known issue, but with kernel 4.8 or later it more rare. At
least I myself have never had it since then on 10+ active btrfs
filesystems.

> Is there any good way of recognising when it's likely to happen?  Is there
> anything I can do other than rewriting a medium size file to determine when
> it's happened?
I am not aware of an easy way of recognizing it, but there are some
python tools that can show you the usage per chunk and that can give
an indication.

A way to prevent the issue might be adding/changing mount options,
although I have no hard proof, only good experiences.

What are the mount options fpor this filesystem?
I think compress=lzo, noatime, and autodefrag might help for various
reasons. I use the first 2 on SSDs en the last one on HDDs (with and
without bcache).
Maybe someone else has more tips or comments.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashed filesystem, nothing helps

2017-08-07 Thread Henk Slager
On Mon, Aug 7, 2017 at 7:12 AM, Thomas Wurfbaum  wrote:
> Now i do a btrfs-find-root, but it runs now since 5 day without a result.
> How long should i wait? Or is it already to late to hope?
>
> mainframe:~ # btrfs-find-root.static /dev/sdb1
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> Ignoring transid failure
> Superblock thinks the generation is 1490226
> Superblock thinks the level is 1
>
> The process is still running...

I also have a filesystem that has a big difference in wanted and found
transid. I my case the wanted is much higher than the found. I also
tried to repair it for more than 2 weeks in total, but both progs and
kernel fail. I haven't seen that you mounted it just with only ro
mount option. I think in your case chances are low, but it worked in
my case. I could copy all data (7TB unique non-shared thorugh snapshot
or reflink) to a new fs with rsync. Btrfs send and other action done
by btrfs progs quite soon failed on 1 or several parent transid
inconsistencies.

Others on this forum have stated that when there is such a big
difference in transid, the fs is simple unrepairable. I must admit
that this is the truth in practice. Restore with btrfs-restore I tried
once on a other fs that was damaged and at that time the restore gave
quite different content for some file compared to a copy of the same
file from the fs mouted via the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: eliminate bogus IOC_DEV_INFO call

2017-07-28 Thread Henk Slager
On Thu, Jul 27, 2017 at 9:24 PM, Hans van Kranenburg
 wrote:
> Device ID numbers always start at 1, not at 0. The first IOC_DEV_INFO
> call does not make sense, since it will always return ENODEV.

When there is a btrfs-replace ongoing, there is a Device ID 0

> ioctl(3, BTRFS_IOC_DEV_INFO, {devid=0}) = -1 ENODEV (No such device)
>
> Signed-off-by: Hans van Kranenburg 
> ---
>  cmds-fi-usage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
> index 101a0c4..52c4c62 100644
> --- a/cmds-fi-usage.c
> +++ b/cmds-fi-usage.c
> @@ -535,7 +535,7 @@ static int load_device_info(int fd, struct device_info 
> **device_info_ptr,
> return 1;
> }
>
> -   for (i = 0, ndevs = 0 ; i <= fi_args.max_id ; i++) {
> +   for (i = 1, ndevs = 0 ; i <= fi_args.max_id ; i++) {
> if (ndevs >= fi_args.num_devices) {
> error("unexpected number of devices: %d >= %llu", 
> ndevs,
> (unsigned long long)fi_args.num_devices);
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: Fix false alert about EXTENT_DATA shouldn't be hole

2017-07-03 Thread Henk Slager
On Mon, Jun 19, 2017 at 1:26 PM, Henk Slager <eye...@gmail.com> wrote:
> On 16-06-17 03:43, Qu Wenruo wrote:
>> Since incompat feature NO_HOLES still allow us to have explicit hole
>> file extent, current check is too restrict and will cause false alert
>> like:
>>
>> root 5 EXTENT_DATA[257, 0] shouldn't be hole
>>
>> Fix it by removing the restrict hole file extent check.
>>
>> Reported-by: Henk Slager <eye...@gmail.com>
>> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
>> ---
>>  cmds-check.c | 6 +-
>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/cmds-check.c b/cmds-check.c
>> index c052f66e..7bd57677 100644
>> --- a/cmds-check.c
>> +++ b/cmds-check.c
>> @@ -4841,11 +4841,7 @@ static int check_file_extent(struct btrfs_root *root, 
>> struct btrfs_key *fkey,
>>   }
>>
>>   /* Check EXTENT_DATA hole */
>> - if (no_holes && is_hole) {
>> - err |= FILE_EXTENT_ERROR;
>> - error("root %llu EXTENT_DATA[%llu %llu] shouldn't be hole",
>> -   root->objectid, fkey->objectid, fkey->offset);
>> - } else if (!no_holes && *end != fkey->offset) {
>> + if (!no_holes && *end != fkey->offset) {
>>   err |= FILE_EXTENT_ERROR;
>>   error("root %llu EXTENT_DATA[%llu %llu] interrupt",
>> root->objectid, fkey->objectid, fkey->offset);
>
>
> Thanks for the patch, I applied it on v4.11 btrfs-progs and re-ran the check:
> # btrfs check -p --readonly /dev/mapper/smr
While looking at new kernel+progs releases  and my script logs, I see
I made copy-paste error; of course it is the lowmem mode:
# btrfs check -p --mode lowmem --readonly /dev/mapper/smr
> on filesystem mentioned in:
> https://www.spinics.net/lists/linux-btrfs/msg66374.html
>
> and now the "shouldn't be hole" errors don't show up anymore.
>
> Tested-by: Henk Slager <eye...@gmail.com>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] btrfs-progs: lowmem check: Fix false alert about referencer count mismatch

2017-07-02 Thread Henk Slager
On Mon, Jun 26, 2017 at 12:37 PM, Lu Fengqi  wrote:
> The normal back reference counting doesn't care about the extent referred
> by the extent data in the shared leaf. The check_extent_data_backref
> function need to skip the leaf that owner mismatch with the root_id.
>
> Reported-by: Marc MERLIN 
> Signed-off-by: Lu Fengqi 
> ---
>  cmds-check.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/cmds-check.c b/cmds-check.c
> index 70d2b7f2..f42968cd 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -10692,7 +10692,8 @@ static int check_extent_data_backref(struct 
> btrfs_fs_info *fs_info,
> leaf = path.nodes[0];
> slot = path.slots[0];
>
> -   if (slot >= btrfs_header_nritems(leaf))
> +   if (slot >= btrfs_header_nritems(leaf) ||
> +   btrfs_header_owner(leaf) != root_id)
> goto next;
> btrfs_item_key_to_cpu(leaf, , slot);
> if (key.objectid != objectid || key.type != 
> BTRFS_EXTENT_DATA_KEY)
> --
> 2.13.1

With this patch applied to v4.11, I ran:
# btrfs check -p --mode lowmem /dev/mapper/smr

no 'referencer count mismatch' anymore, but likely due to other hidden
corruption, the check took more time than I had planned, so after 5
days, I cancelled it.

As a summary, both kernel and lowmem check mention the same issue as
it looks like; for the lowmem check it is this, (repeating):
[...]
parent transid verify failed on 6350669414400 wanted 24678 found 24184
parent transid verify failed on 6350645837824 wanted 24678 found 23277
Ignoring transid failure
leaf parent key incorrect 6350645837824
ERROR: extent[6349151535104 16384] backref lost (owner: 2, level: 0)
ERROR: check leaf failed root 2 bytenr 6349151535104 level 0, force
continue check
parent transid verify failed on 6350645837824 wanted 24678 found 23277
Ignoring transid failure
leaf parent key incorrect 6350645837824
ERROR: extent[6349150486528 16384] backref lost (owner: 2, level: 0)
ERROR: check leaf failed root 2 bytenr 6349150486528 level 0, force
continue check
^C

My plan is now to image the whole 8TB fs to extra/new storage hardware
with dd and then see if I can get the copy fixed. But it might take a
year before I do so (it is not critical w.r.t. data-loss, it's cold
storage, multi-year btrfs features test).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root -9

2017-06-19 Thread Henk Slager
>> I think I leave it as is for the time being, unless there is some news
>> how to fix things with low risk (or maybe via a temp overlay snapshot
>> with DM). But the lowmem check took 2 days, that's not really fun.
>> The goal for the 8TB fs is to have an up to 7 year snapshot history at
>> sometime, now the oldest snapshot is from early 2014, so almost
>> halfway :)
>
> Btrfs is still much too unstable to trust 7 years worth of backup to
> it. You will probably loose it at some point, especially while many
> snapshots are still such a huge performance breaker in btrfs. I suggest
> trying out also other alternatives like borg backup for such a project.

Maybe I should clarify that I don't use snapshotting for archiving
explicitly. So in the latest snapshot still old but unused files from
many years back are there, like a disk image from a windowsxp laptop
(already recycled) for example. Userdata that is in older snapshots
but not in newer ones is what I consider useless data today, so I had
deleted that explicitly. But who knows maybe for some statistic or
whatever btrfs experiment it might be interesting to have a long and
may snapshot increments.

Another reason is the SMR characteristics of the disk, that made me
decide to designate this fs write-only. If I remove snapshots, the fs
gets free space fragmentation and then writing to it will be much
slower. This disk was relatively cheap and I don't want to experience
the slowness and longer on time.

I snapshot no more than 3 subvolumes monthly, then after 7 years the
fs has 252 snapshots, that is considered no problem for btrfs.
I think borg backup is interesting, but from kernel 3.11 to 4.11 (even
using raid5 up to 4.1) I have managed to keep it running/cloning
multi-site with just a few relatively simple scripts and btrfs
kernel+tools itself, also working on a low-power ARM platform. I don't
like yet another commandset and that borg uses its own extra repo or
small database for tracking diffs (I haven't used it so I am not
sure). But what I need, differential/incremental + compression, is
just build-in in btrfs, that I anyhow use for local snapshotting. I
finally put also some ARM boards btrfs rootfs recently, I am not sure
if/when I am going to use other backup tooling besides just rsync and
btrfs features.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root -9

2017-06-19 Thread Henk Slager
On Thu, Jun 15, 2017 at 9:13 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> At 06/14/2017 09:39 PM, Henk Slager wrote:
>>
>> On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager <eye...@gmail.com> wrote:
>>>
>>> On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow <hurikha...@gmail.com> wrote:
>>>>
>>>> Am Mon, 12 Jun 2017 11:00:31 +0200
>>>> schrieb Henk Slager <eye...@gmail.com>:
>>>>
>>>>> Hi all,
>>>>>
>>>>> there is 1-block corruption a 8TB filesystem that showed up several
>>>>> months ago. The fs is almost exclusively a btrfs receive target and
>>>>> receives monthly sequential snapshots from two hosts but 1 received
>>>>> uuid. I do not know exactly when the corruption has happened but it
>>>>> must have been roughly 3 to 6 months ago. with monthly updated
>>>>> kernel+progs on that host.
>>>>>
>>>>> Some more history:
>>>>> - fs was created in november 2015 on top of luks
>>>>> - initially bcache between the 2048-sector aligned partition and luks.
>>>>> Some months ago I removed 'the bcache layer' by making sure that cache
>>>>> was clean and then zeroing 8K bytes at start of partition in an
>>>>> isolated situation. Then setting partion offset to 2064 by
>>>>> delete-recreate in gdisk.
>>>>> - in december 2016 there were more scrub errors, but related to the
>>>>> monthly snapshot of december2016. I have removed that snapshot this
>>>>> year and now only this 1-block csum error is the only issue.
>>>>> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
>>>>> includes some SMR related changes in the blocklayer this disk works
>>>>> fine with btrfs.
>>>>> - the smartctl values show no error so far but I will run an extended
>>>>> test this week after another btrfs check which did not show any error
>>>>> earlier with the csum fail being there
>>>>> - I have noticed that the board that has the disk attached has been
>>>>> rebooted due to power-failures many times (unreliable power switch and
>>>>> power dips from energy company) and the 150W powersupply is broken and
>>>>> replaced since then. Also due to this, I decided to remove bcache
>>>>> (which has been in write-through and write-around only).
>>>>>
>>>>> Some btrfs inpect-internal exercise shows that the problem is in a
>>>>> directory in the root that contains most of the data and snapshots.
>>>>> But an  rsync -c  with an identical other clone snapshot shows no
>>>>> difference (no writes to an rw snapshot of that clone). So the fs is
>>>>> still OK as file-level backup, but btrfs replace/balance will fatal
>>>>> error on just this 1 csum error. It looks like that this is not a
>>>>> media/disk error but some HW induced error or SW/kernel issue.
>>>>> Relevant btrfs commands + dmesg info, see below.
>>>>>
>>>>> Any comments on how to fix or handle this without incrementally
>>>>> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
>>>>> fail)?
>>>>>
>>>>>
>>>>> # uname -r
>>>>> 4.11.3-1-default
>>>>> # btrfs --version
>>>>> btrfs-progs v4.10.2+20170406
>>>>
>>>>
>>>> There's btrfs-progs v4.11 available...
>>>
>>>
>>> I started:
>>> # btrfs check -p --readonly /dev/mapper/smr
>>> but it stopped with printing 'Killed' while checking extents. The
>>> board has 8G RAM, no swap (yet), so I just started lowmem mode:
>>> # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
>>>
>>> Now after a 1 day 77 lines like this are printed:
>>> ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
>>> 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
>>>
>>> It is still running, hopefully it will finish within 2 days. But
>>> lateron I can compile/use latest progs from git. Same for kernel,
>>> maybe with some tweaks/patches, but I think I will also plug the disk
>>> into a faster machine then ( i7-4770 instead of the J1900 ).
>>>
>>>>> fs profile is dup for system+meta, single for data
>>>>>
>>>>> # btrfs scrub start /local/smr
&g

Re: [PATCH 1/2] btrfs-progs: Fix false alert about EXTENT_DATA shouldn't be hole

2017-06-19 Thread Henk Slager
On 16-06-17 03:43, Qu Wenruo wrote:
> Since incompat feature NO_HOLES still allow us to have explicit hole
> file extent, current check is too restrict and will cause false alert
> like:
>
> root 5 EXTENT_DATA[257, 0] shouldn't be hole
>
> Fix it by removing the restrict hole file extent check.
>
> Reported-by: Henk Slager <eye...@gmail.com>
> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> ---
>  cmds-check.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/cmds-check.c b/cmds-check.c
> index c052f66e..7bd57677 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -4841,11 +4841,7 @@ static int check_file_extent(struct btrfs_root *root, 
> struct btrfs_key *fkey,
>   }
>  
>   /* Check EXTENT_DATA hole */
> - if (no_holes && is_hole) {
> - err |= FILE_EXTENT_ERROR;
> - error("root %llu EXTENT_DATA[%llu %llu] shouldn't be hole",
> -   root->objectid, fkey->objectid, fkey->offset);
> - } else if (!no_holes && *end != fkey->offset) {
> + if (!no_holes && *end != fkey->offset) {
>   err |= FILE_EXTENT_ERROR;
>   error("root %llu EXTENT_DATA[%llu %llu] interrupt",
> root->objectid, fkey->objectid, fkey->offset);


Thanks for the patch, I applied it on v4.11 btrfs-progs and re-ran the check:
# btrfs check -p --readonly /dev/mapper/smr

on filesystem mentioned in:
https://www.spinics.net/lists/linux-btrfs/msg66374.html

and now the "shouldn't be hole" errors don't show up anymore.

Tested-by: Henk Slager <eye...@gmail.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root -9

2017-06-14 Thread Henk Slager
On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager <eye...@gmail.com> wrote:
> On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow <hurikha...@gmail.com> wrote:
>> Am Mon, 12 Jun 2017 11:00:31 +0200
>> schrieb Henk Slager <eye...@gmail.com>:
>>
>>> Hi all,
>>>
>>> there is 1-block corruption a 8TB filesystem that showed up several
>>> months ago. The fs is almost exclusively a btrfs receive target and
>>> receives monthly sequential snapshots from two hosts but 1 received
>>> uuid. I do not know exactly when the corruption has happened but it
>>> must have been roughly 3 to 6 months ago. with monthly updated
>>> kernel+progs on that host.
>>>
>>> Some more history:
>>> - fs was created in november 2015 on top of luks
>>> - initially bcache between the 2048-sector aligned partition and luks.
>>> Some months ago I removed 'the bcache layer' by making sure that cache
>>> was clean and then zeroing 8K bytes at start of partition in an
>>> isolated situation. Then setting partion offset to 2064 by
>>> delete-recreate in gdisk.
>>> - in december 2016 there were more scrub errors, but related to the
>>> monthly snapshot of december2016. I have removed that snapshot this
>>> year and now only this 1-block csum error is the only issue.
>>> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
>>> includes some SMR related changes in the blocklayer this disk works
>>> fine with btrfs.
>>> - the smartctl values show no error so far but I will run an extended
>>> test this week after another btrfs check which did not show any error
>>> earlier with the csum fail being there
>>> - I have noticed that the board that has the disk attached has been
>>> rebooted due to power-failures many times (unreliable power switch and
>>> power dips from energy company) and the 150W powersupply is broken and
>>> replaced since then. Also due to this, I decided to remove bcache
>>> (which has been in write-through and write-around only).
>>>
>>> Some btrfs inpect-internal exercise shows that the problem is in a
>>> directory in the root that contains most of the data and snapshots.
>>> But an  rsync -c  with an identical other clone snapshot shows no
>>> difference (no writes to an rw snapshot of that clone). So the fs is
>>> still OK as file-level backup, but btrfs replace/balance will fatal
>>> error on just this 1 csum error. It looks like that this is not a
>>> media/disk error but some HW induced error or SW/kernel issue.
>>> Relevant btrfs commands + dmesg info, see below.
>>>
>>> Any comments on how to fix or handle this without incrementally
>>> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
>>> fail)?
>>>
>>>
>>> # uname -r
>>> 4.11.3-1-default
>>> # btrfs --version
>>> btrfs-progs v4.10.2+20170406
>>
>> There's btrfs-progs v4.11 available...
>
> I started:
> # btrfs check -p --readonly /dev/mapper/smr
> but it stopped with printing 'Killed' while checking extents. The
> board has 8G RAM, no swap (yet), so I just started lowmem mode:
> # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
>
> Now after a 1 day 77 lines like this are printed:
> ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
> 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
>
> It is still running, hopefully it will finish within 2 days. But
> lateron I can compile/use latest progs from git. Same for kernel,
> maybe with some tweaks/patches, but I think I will also plug the disk
> into a faster machine then ( i7-4770 instead of the J1900 ).
>
>>> fs profile is dup for system+meta, single for data
>>>
>>> # btrfs scrub start /local/smr
>>
>> What looks strange to me is that the parameters of the error reports
>> seem to be rotated by one... See below:
>>
>>> [27609.626555] BTRFS error (device dm-0): parent transid verify failed
>>> on 6350718500864 wanted 23170 found 23076
>>> [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
>>> [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
>>> [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
>>> [27609.687136] BTRFS info (device dm-0): read error corrected

Re: csum failed root -9

2017-06-13 Thread Henk Slager
On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow <hurikha...@gmail.com> wrote:
> Am Mon, 12 Jun 2017 11:00:31 +0200
> schrieb Henk Slager <eye...@gmail.com>:
>
>> Hi all,
>>
>> there is 1-block corruption a 8TB filesystem that showed up several
>> months ago. The fs is almost exclusively a btrfs receive target and
>> receives monthly sequential snapshots from two hosts but 1 received
>> uuid. I do not know exactly when the corruption has happened but it
>> must have been roughly 3 to 6 months ago. with monthly updated
>> kernel+progs on that host.
>>
>> Some more history:
>> - fs was created in november 2015 on top of luks
>> - initially bcache between the 2048-sector aligned partition and luks.
>> Some months ago I removed 'the bcache layer' by making sure that cache
>> was clean and then zeroing 8K bytes at start of partition in an
>> isolated situation. Then setting partion offset to 2064 by
>> delete-recreate in gdisk.
>> - in december 2016 there were more scrub errors, but related to the
>> monthly snapshot of december2016. I have removed that snapshot this
>> year and now only this 1-block csum error is the only issue.
>> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
>> includes some SMR related changes in the blocklayer this disk works
>> fine with btrfs.
>> - the smartctl values show no error so far but I will run an extended
>> test this week after another btrfs check which did not show any error
>> earlier with the csum fail being there
>> - I have noticed that the board that has the disk attached has been
>> rebooted due to power-failures many times (unreliable power switch and
>> power dips from energy company) and the 150W powersupply is broken and
>> replaced since then. Also due to this, I decided to remove bcache
>> (which has been in write-through and write-around only).
>>
>> Some btrfs inpect-internal exercise shows that the problem is in a
>> directory in the root that contains most of the data and snapshots.
>> But an  rsync -c  with an identical other clone snapshot shows no
>> difference (no writes to an rw snapshot of that clone). So the fs is
>> still OK as file-level backup, but btrfs replace/balance will fatal
>> error on just this 1 csum error. It looks like that this is not a
>> media/disk error but some HW induced error or SW/kernel issue.
>> Relevant btrfs commands + dmesg info, see below.
>>
>> Any comments on how to fix or handle this without incrementally
>> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
>> fail)?
>>
>>
>> # uname -r
>> 4.11.3-1-default
>> # btrfs --version
>> btrfs-progs v4.10.2+20170406
>
> There's btrfs-progs v4.11 available...

I started:
# btrfs check -p --readonly /dev/mapper/smr
but it stopped with printing 'Killed' while checking extents. The
board has 8G RAM, no swap (yet), so I just started lowmem mode:
# btrfs check -p --mode lowmem --readonly /dev/mapper/smr

Now after a 1 day 77 lines like this are printed:
ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2

It is still running, hopefully it will finish within 2 days. But
lateron I can compile/use latest progs from git. Same for kernel,
maybe with some tweaks/patches, but I think I will also plug the disk
into a faster machine then ( i7-4770 instead of the J1900 ).

>> fs profile is dup for system+meta, single for data
>>
>> # btrfs scrub start /local/smr
>
> What looks strange to me is that the parameters of the error reports
> seem to be rotated by one... See below:
>
>> [27609.626555] BTRFS error (device dm-0): parent transid verify failed
>> on 6350718500864 wanted 23170 found 23076
>> [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
>> off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
>> [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
>> off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
>> [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
>> off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
>> [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
>> off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
>> [37663.606455] BTRFS error (device dm-0): parent transid verify failed
>> on 6350453751808 wanted 23170 found 23075
>> [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
>> off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
>> [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
>&

csum failed root -9

2017-06-12 Thread Henk Slager
Hi all,

there is 1-block corruption a 8TB filesystem that showed up several
months ago. The fs is almost exclusively a btrfs receive target and
receives monthly sequential snapshots from two hosts but 1 received
uuid. I do not know exactly when the corruption has happened but it
must have been roughly 3 to 6 months ago. with monthly updated
kernel+progs on that host.

Some more history:
- fs was created in november 2015 on top of luks
- initially bcache between the 2048-sector aligned partition and luks.
Some months ago I removed 'the bcache layer' by making sure that cache
was clean and then zeroing 8K bytes at start of partition in an
isolated situation. Then setting partion offset to 2064 by
delete-recreate in gdisk.
- in december 2016 there were more scrub errors, but related to the
monthly snapshot of december2016. I have removed that snapshot this
year and now only this 1-block csum error is the only issue.
- brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
includes some SMR related changes in the blocklayer this disk works
fine with btrfs.
- the smartctl values show no error so far but I will run an extended
test this week after another btrfs check which did not show any error
earlier with the csum fail being there
- I have noticed that the board that has the disk attached has been
rebooted due to power-failures many times (unreliable power switch and
power dips from energy company) and the 150W powersupply is broken and
replaced since then. Also due to this, I decided to remove bcache
(which has been in write-through and write-around only).

Some btrfs inpect-internal exercise shows that the problem is in a
directory in the root that contains most of the data and snapshots.
But an  rsync -c  with an identical other clone snapshot shows no
difference (no writes to an rw snapshot of that clone). So the fs is
still OK as file-level backup, but btrfs replace/balance will fatal
error on just this 1 csum error. It looks like that this is not a
media/disk error but some HW induced error or SW/kernel issue.
Relevant btrfs commands + dmesg info, see below.

Any comments on how to fix or handle this without incrementally
sending all snapshots to a new fs (6+ TiB of data, assuming this won't
fail)?


# uname -r
4.11.3-1-default
# btrfs --version
btrfs-progs v4.10.2+20170406

fs profile is dup for system+meta, single for data

# btrfs scrub start /local/smr

[27609.626555] BTRFS error (device dm-0): parent transid verify failed
on 6350718500864 wanted 23170 found 23076
[27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
[27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
[27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
[27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
[37663.606455] BTRFS error (device dm-0): parent transid verify failed
on 6350453751808 wanted 23170 found 23075
[37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
[37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
[37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
off 635045376 (dev /dev/mapper/smr sector 11679647024)
[37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453764096 (dev /dev/mapper/smr sector 11679647032)
[43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[43497.234605] BTRFS error (device dm-0): unable to fixup (regular)
error at logical 7175413624832 on dev /dev/mapper/smr

# < figure out which chunk with help of btrfs py lib >

chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808
length 1073741824 used 1073741824 used_pct 100
chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632
length 1073741824 used 1073741824 used_pct 100

# btrfs balance start -v -dvrange=7174898057216..7174898057217 /local/smr

[74250.913273] BTRFS info (device dm-0): relocating block group
7174898057216 flags data
[74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino
257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
[74255.965804] BTRFS warning (device dm-0): csum failed root -9 ino
257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check --check-data-csum malfunctioning?

2017-05-24 Thread Henk Slager
On Wed, Apr 19, 2017 at 11:44 AM, Henk Slager <eye...@gmail.com> wrote:

> I also have a WD40EZRX and the fs on it is also almost exclusively a
> btrfs receive target and it has now for the second time csum (just 5 )
> errors. Extended selftest at 16K hours shows no problem and I am not
> fully sure if this is a magnetic media error case or something else.

I have now located the 20K (sequential) of bad csums in a 4G file and
physical chunk address. Then read that 1G chunk to a file and wrote it
back to the same disk location. No I/O errors in dmesg, so my
assumption is that the 20K bad spot is replaced by good spares. Or it
was a btrfs or luks fault or just a spurious random write somehow due
to SW/HW glitch.

As a sort of locking the bad area, I did cp --reflink the 4G file to
the root of the fs and read-writeback the 20K spot in the 4G file in
the send-source fs. So now after another differential receive, I
remove all but the latest snapshot. The 5 csum errors will then sit
there fixed if I don't balance. Then just before I do a btrfs-repflace
(if I decide to ), I delete the 4G file en make sure the cleaner has
finished so that replace will not fail on bad the 5 bad csums.

The fs on the WD40EZRX is just another clone/backup but with quite
some complex subvolume tree. The above actions + replace are more fun
and faster cloning again than recreating the tree with rsync etc. I
have done similar things in the past, when csum errors were clearly
due to btrfs bugs but with good HDDs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check --check-data-csum malfunctioning?

2017-04-19 Thread Henk Slager
> At 04/18/2017 08:41 PM, Werner Braun wrote:
>>
>> Hi,
>>
>> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
>>
>> running btrfs check --check-data-csum returns no errors on the disk
>>
>> running btrfs scrub on the disk finds tons of errors
>>
>> i could clear the disk and send it to anyone intrested in ;-)
>>
>>
> That's because --check-data-csum will only check the first copy of data if
> the first copy is good.

So is the conclusion that all the csum errors are in the metadata?
What is the profile of the fs? ( not dup for metadata I assume?)

I also have a WD40EZRX and the fs on it is also almost exclusively a
btrfs receive target and it has now for the second time csum (just 5 )
errors. Extended selftest at 16K hours shows no problem and I am not
fully sure if this is a magnetic media error case or something else.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send snapshot from snapshot incremental

2017-03-29 Thread Henk Slager
On Wed, Mar 29, 2017 at 12:01 AM, Jakob Schürz
 wrote:
[...]
> There is Subvolume A on the send- and the receive-side.
> There is also Subvolume AA on the send-side from A.
> The parent-ID from send-AA is the ID from A.
> The received-ID from A on received-side A is the ID from A.
>
> To send the AA, i use the command
> btrfs send -p A AA|btrfs receive /path/to/receiveFS/
>
> The received-ID from AA on received-side A is the ID from AA.
>
>
> Now i take a snapshot from AA on the send-side -> Called AAA
Is AAA readonly?

> If i try to send AAA to the receiving FS with
> btrfs send -p AA AAA|btrfs receive /path/to/receiveFS/
> no parent snapshot is found.
Something else is wrong, as this simply should work. I think the
subvol AAA was created on the send side like this:
btrfs sub snap -r AA AAA

then the send -p should work.

> I should better take the -c Option.
>
> btrfs send -c AA AAA|btrfs receive /path/to/receiveFS/
>
> Am I right?
> (Sorry, cannot test it now, i do not have my external Drive here)
> I might remember, that this didn't work in the past...

I once had a working (test) situation with the -c option where send
and receive were on the same machine. But I use send|receive mostly
over long distance and/or mobile (metered) links, so the -c option is
not preferred by me, as besides the diffs in data also way more
metadata exchange is done. And normally, in order to be able do
meaningful incremental differences checking, you need a single chain
of transactions.

It looks like you want or have have been changing snapshots on the
receive side. If you want that changes played back to the send side,
you somehow need still at least 1 older read-only pair on send and
receive side to do the increment from, not only for -p option but also
for -c option. And after that, you likely need to do some merging on
the send side.

I think 2 redundancy principles needs are mixed, that's why the
confusion w.r.t. send|receive:
1- redundancy on content level, e.g. you make some changes to a
script, C-source or whatever document: read-only snapshots within the
same file-system on the send-side alone should be OK for that.
2- redundancy on device level, a HDD or SSD might fail, so you want
backup: RAID is meant for that.

Using btrfs send|receice for both 1 and 2 is fine and works well I can
say, as long you keep strict rules of what your master/working
file-system/subvolume is and always that pair of identical older
read-only snapshots existing on the send and receive side.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi du is unreliable

2016-12-04 Thread Henk Slager
On Sun, Dec 4, 2016 at 7:30 PM, Chris Murphy  wrote:
> Hi,
>
> [chris@f25s ~]$ uname -r
> 4.8.11-300.fc25.x86_64
> [chris@f25s ~]$ rpm -q btrfs-progs
> btrfs-progs-4.8.5-1.fc26.x86_64
>
>
> I'm not finding any pattern to this so far, but it's definitely not
> always reliable. Here is today's example.
>
> [chris@f25s ~]$ sudo btrfs fi du -s /mnt/second/jackson.2015/
>  Total   Exclusive  Set shared  Filename
>  111.67GiB   111.67GiB   0.00B  /mnt/second/jackson.2015/
>
> Super. This is correct. The jackson.2015 subvolume contains only
> exclusively used data, there are no snapshots, and there are no
> reflinks involved for any files.
>
>
> [chris@f25s ~]$ sudo btrfs send  /mnt/second/jackson.2015/ | sudo
> btrfs receive /mnt/int/
>
> Send the subvolume to a different btrfs volume. This is not an
> incremental send. After completion however:
>
>
> [chris@f25s ~]$ sudo btrfs fi du -s /mnt/int/jackson.2015/
>  Total   Exclusive  Set shared  Filename
>  111.67GiB37.20GiB29.93GiB  /mnt/int/jackson.2015/
>
> That makes zero sense. What's going on here?

I tried to reproduce, with progs from git v4.8.5 and kernel
4.8.10-1-default (tumbleweed):

# /net/src/btrfs-progs/btrfs send testfidu |
/net/src/btrfs-progs/btrfs receive /local/omedia/

# /net/src/btrfs-progs/btrfs fi du -s ./testfidu/
 Total   Exclusive  Set shared  Filename
  42.56GiB42.56GiB   0.00B  ./testfidu/

# /net/src/btrfs-progs/btrfs fi du -s /local/omedia/testfidu/
 Total   Exclusive  Set shared  Filename
  42.56GiB42.56GiB   0.00B  /local/omedia/testfidu/

There are no btrfs changes between kernels 4.8.10 and 4.8.11. There is
no compress mount option in my case, that is the only thing I
currently can think of that could make your Set shared number
non-zero.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs restore differs from normal copy

2016-12-04 Thread Henk Slager
Hi all,

I noticed that a monthly differential snapshot creation ended with an
error although the created snapshot itself seemed OK. To test en be
more confident, I also transferred the diff between the 2016-12-01 en
2016-12-02 and that went without an error from send or receive. The
send runs on a server and heavily relies on 'Received UUID'. The
receive runs at the same time on a fanless system that is only a
monthly btrfs backup receive target if all works well.

After the 2016-12-01 snapshot receive, the fs refused to mount after
reboot and same as for another mount refusal, I did a clear_space
mount and that made it work again. A scrub revealed 2 csum errors,
each in a VM imagefle (35GB and 16GB), so I thought I use btrfs
restore this time to fix the backup, prepare for next month, without
growing the fs with 51GB and also get hints on the root cause of the
csum errors.

The 16GB VM imagefile only exists in the 2016-12-01 snapshot (after I
deleted the 2016-12-02 snapshot), so with regex I just fetched the
file. I had to press 'y' once on the "We seem to be looping a lot
...".
I ran  btrfs restore with -o -i -s -v options (and --path-regex).

# dd_rescue -W  
showed about 11GB diff, which is way more than the 4k block with failed csum.

I did the same for another, older VM imagefile (no csum error blocks)
in the 2016-12-01 snapshot. That gave 2GB diff. If I mount the fs and
just copy the file instead of btrfs restore, the diff is 0, as
expected.

On both computers:
# uname -r
4.8.10-1-default
# btrfs --version
btrfs-progs v4.8.2+20161031

The mount options of the receive fs( everything is default CoW:
rw,noatime,compress=lzo,nossd,space_cache,subvolid=5,subvol=/

Any idea why btrfs restores wrong file data?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs release 4.8.3

2016-11-13 Thread Henk Slager
On Fri, Nov 11, 2016 at 4:38 PM, David Sterba  wrote:
> Hi,
>
> btrfs-progs version 4.8.3 have been released. Handful of fixes and lots of
> cleanups.
>
> Changes:
>   * check:
> * support for clearing space cache (v1)
> * size reduction of inode backref structure
>   * send:
> * fix handling of multiple snapshots (-p and -c options)
> * transfer buffer increased (should reduce number of context switches)
> * reuse existing file for output (-f), eg. when root cannot create files 
> (NFS)
>   * dump-tree:
> * print missing items for various structures
> * new: dev stats, balance status item
> * sync key names with kernel (the persistent items)
>   * subvol show: now able to print the toplevel subvolume -- the creation time
> might be wrong though
>   * mkfs:
> * store the creation time of toplevel root inode

It looks like commit 5c4d53450b2c6ff7169c99f9158c14ae96b7b0a8
 (btrfs-progs: mkfs: store creation time of the toplevel subvolume'')
is not enough to display the creation time of a newly created fs with
tools v4.8.3, or am I missing something?

With kernel 4.8.6-2-default as well as 4.9.0-rc4-1-default:

# /net/src/btrfs-progs/mkfs.btrfs -L ctimetest -m single /dev/loop0
btrfs-progs v4.8.3
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (100.00GiB) ...
Label:  ctimetest
UUID:   d65486f0-368b-4b2a-962b-176cd945feb5
Node size:  16384
Sector size:4096
Filesystem size:100.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: single8.00MiB
  System:   single4.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1   100.00GiB  /dev/loop0

# mount /dev/loop0 /mnt
# cd /mnt
# ls > test
# sync
# /net/src/btrfs-progs/btrfs sub sh .
/mnt
Name:   
UUID:   -
Parent UUID:-
Received UUID:  -
Creation time:  -
Subvolume ID:   5
Generation: 9
Gen at creation:0
Parent ID:  0
Top level ID:   0
Flags:  -
Snapshot(s):

I noticed that   btrfs_set_stack_timespec_sec(_item.otime, now);
is called twice during mkfs.btrfs, but during btrfs sub sh
get_ri.otime is 0 just before it is formatted/printed. My idea was to
patch the code (kernel and/or progs) such that I can also put a time
in some exiting filesystems.

> * print UUID in the summary
>   * build: travis CI for devel
>   * other:
> * lots of cleanups and refactoring
> * switched to on-stack path structure
> * fixes from coverity, asan, ubsan
> * new tests
> * updates in testing infrastructure
> * fixed convert test 005
>
> Changes since rc1:
>   * fixed convert test 005
>   * updates in testing infrastructure
>   * mkfs: print UUID in the summary
>
> Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
> Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
>
> Shortlog:
>
> David Sterba (147):
>   btrfs-progs: tests: switch to dump- commands from inspect
>   btrfs-progs: convert: switch more messages to common helpers
>   btrfs-progs: qgroups show: handle errors when resolving root id
>   btrfs-progs: remove unused function btrfs_get_path_rootid
>   btrfs-progs: rename lookup_ino_rootid
>   btrfs-progs: use existing rootid resolving helper in 
> btrfs_list_get_path_rootid
>   btrfs-progs: opencode trivial helper __free_all_subvolumn
>   btrfs-progs: remove leading underscores from several helper
>   btrfs-progs: use symbolic tree name when searching
>   btrfs-progs: list: constify prefix arugment
>   btrfs-progs: use enum for list layout type
>   btrfs-progs: list: rename some helpers
>   btrfs-progs: list: switch to common message helpers
>   btrfs-progs: subvol list: setup list filters later
>   btrfs-progs: return void from btrfs_list_setup_filter
>   btrfs-progs: subvol list: cleanup layout argument setup
>   btrfs-progs: subvol list: remove useless comments
>   btrfs-progs: subvol list: simplify value assignments
>   btrfs-progs: subvol list: consilidate naming of otime varaibles
>   btrfs-progs: subvol list: add simplified helper for adding root backrefs
>   btrfs-progs: subvol list: consolidate uuid types accross functions
>   btrfs-progs: remove trivial helper root_lookup_init
>   btrfs-progs: subvol list: remove ugly goto construct
>   btrfs-progs: subvol list: better error message if subvol insertion fails
>   btrfs-progs: subvol show: print more details about toplevel subvolume
>   btrfs-progs: dump-tree: print missing dev_item data
>   btrfs-progs: dump-tree: print missing chunk data
> 

Re: Small fs

2016-09-12 Thread Henk Slager
> FWIW, I use BTRFS for /boot, but it's not for snapshotting or even the COW,
> it's for DUP mode and the error recovery it provides.  Most people don't
> think about this if it hasn't happened to them, but if you get a bad read
> from /boot when loading the kernel or initrd, it can essentially nuke your
> whole system.  I run BTRFS for /boot in DUP mode with mixed-bg (because I
> only use 512MB for boot) to mitigate the chance that a failed read has any
> impact, and ensure that if it does, it will refuse to boot instead of
> booting with a corrupted kernel or initrd.

Suppose kernel and initrd are on a BTRFS fs with data, metadata and
system all single profile. Will a bootloader then just continue
booting up a system even when there are csum errors in kernel and/or
initrd files?  Suppose the bootloader is grub2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About minimal device number for RAID5/6

2016-08-15 Thread Henk Slager
On Mon, Aug 15, 2016 at 8:30 PM, Hugo Mills  wrote:
> On Mon, Aug 15, 2016 at 10:32:25PM +0800, Anand Jain wrote:
>>
>>
>> On 08/15/2016 10:10 PM, Austin S. Hemmelgarn wrote:
>> >On 2016-08-15 10:08, Anand Jain wrote:
>> >>
>> >>
>> IMHO it's better to warn user about 2 devices RAID5 or 3 devices RAID6.
>> 
>> Any comment is welcomed.
>> 
>> >>>Based on looking at the code, we do in fact support 2/3 devices for
>> >>>raid5/6 respectively.
>> >>>
>> >>>Personally, I agree that we should warn when trying to do this, but I
>> >>>absolutely don't think we should stop it from happening.

About a year ago I had a raid5 array in an disk upgrade situation from
5x 2TB to 4x 4TB. As intermediate I had 2x 2TB + 2x 4TB situation for
several weeks. The 2x 2TB were getting really full and the fs was
slow. just wondering if an enospc would happen, I started an filewrite
task doing several 100 GB's and it simply did work to my surprise. At
some point, chunks only occupying the 4TB disks must have been
created. I also saw the expected write rate on the 4TB disks. CPU load
was not especially high as far as I remember, like a raid1 fs as far
as I remember.

So it is good that in such a situation, one can still use the fs. I
don't remember how the allocated/free space accounting was, probably
not correct, but I did not fill up the whole fs to see/experience
that.

I have no strong opinion whether we should warn about amount of
devices at mkfs time for raid56. It's just that the other known issues
with raid56 draw more attention.

>> >> How does 2 disks RAID5 work ?
>> >One disk is your data, the other is your parity.
>>
>>
>> >In essence, it works
>> >like a really computationally expensive version of RAID1 with 2 disks,
>> >which is why it's considered a degenerate configuration.
>>
>>How do you generate parity with only one data ?
>
>For plain parity calculations, parity is the value p which solves
> the expression:
>
> x_1 XOR x_2 XOR ... XOR x_n XOR p = 0
>
> for corresponding bits in the n data volumes. With one data volume,
> n=1, and hence p = x_1.
>
>What's the problem? :)
>
>Hugo.
>
>> -Anand
>>
>>
>> > Three disks in
>> >RAID6 is similar, but has a slight advantage at the moment in BTRFS
>> >because it's the only way to configure three disks so you can lose two
>> >and not lose any data as we have no support for higher order replication
>> >than 2 copies yet.
>
> --
> Hugo Mills | I always felt that as a C programmer, I was becoming
> hugo@... carfax.org.uk | typecast.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Send-recieve performance

2016-07-22 Thread Henk Slager
On Wed, Jul 20, 2016 at 11:15 AM, Libor Klepáč  wrote:
> Hello,
> we use backuppc to backup our hosting machines.
>
> I have recently migrated it to btrfs, so we can use send-recieve for offsite 
> backups of our backups.
>
> I have several btrfs volumes, each hosts nspawn container, which runs in 
> /system subvolume and has backuppc data in /backuppc subvolume
> .
> I use btrbk to do snapshots and transfer.
> Local side is set to keep 5 daily snapshots, remote side to hold some 
> history. (not much yet, i'm using it this way for few weeks).
>
> If you know backuppc behaviour: for every backup (even incremental), it 
> creates full directory tree of each backed up machine even if it has no 
> modified files and places one small file in each, which holds some info for 
> backuppc.
> So after few days i ran into ENOSPACE on one volume, because my metadata 
> grow, because of inlineing.
> I switched from mdata=DUP to mdata=single (now I see it's possible to change 
> inline file size, right?).

I would try mounting both send and receive volumes with max_inline=0
So then for all small new- and changed files, the filedata will be
stored in data chunks and not inline in the metadata chunks.

That you changed metadata profile from dup to single is unrelated in
principle. single for metadata instead of dup is half the write I/O
for the harddisks, so in that sense it might speed up send actions a
bit. I guess almost all time is spend in seeks.

> My problem is, that on some volumes, send-recieve is relatively fast (rate in 
> MB/s or hundreds of kB/s) but on biggest volume (biggest in space and biggest 
> in contained filesystem trees) rate is just 5-30kB/s.
>
> Here is btrbk progress copyed
> 785MiB 47:52:00 [12.9KiB/s] [4.67KiB/s]
>
> ie. 758MB in 48 hours.
>
> Reciever has high IO/wait - 90-100%, when i push data using btrbk.
> When I run dd over ssh it can do 50-75MB/s.

The send part is the speed bottleneck as it looks like, you can test
and isolate it by doing a dummy send and pipe it to  | mbuffer >
/dev/null  and see what speed you get.

> Sending machine is debian jessie with kernel 4.5.0-0.bpo.2-amd64 (upstream 
> 4.5.3) , btrfsprogs 4.4.1. It is virtual machine running on volume exported 
> from MD3420, 4 SAS disks in RAID10.
>
> Recieving machine is debian jessie on Dell T20 with 4x3TB disks in MD RAID5 , 
> kernel is 4.4.0-0.bpo.1-amd64 (upstream 4.4.6), btrfsprgos 4.4.1
>
> BTRFS volumes were created using those listed versions.
>
> Sender:
> -
> #mount | grep hosting
> /dev/sdg on /mnt/btrfs/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=5,subvol=/)
> /dev/sdg on /var/lib/container/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=259,subvol=/system)
> /dev/sdg on /var/lib/container/hosting/var/lib/backuppc type btrfs 
> (rw,noatime,space_cache,subvolid=260,subvol=/backuppc)
>
> #btrfs filesystem usage /mnt/btrfs/hosting
> Overall:
> Device size: 840.00GiB
> Device allocated:815.03GiB
> Device unallocated:   24.97GiB
> Device missing:  0.00B
> Used:522.76GiB
> Free (estimated):283.66GiB  (min: 271.18GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:710.98GiB, Used:452.29GiB
>/dev/sdg  710.98GiB
>
> Metadata,single: Size:103.98GiB, Used:70.46GiB
>/dev/sdg  103.98GiB

This is a very large ratio metadata/data. Large and scattered
metadata, even on fast rotational media, will result in slow send
operation is my experience ( incremental send, about 10G metadata). So
hopefully, when all the small files and many directories from backuppc
are in data chunks and metadata is significantly smaller, send will be
faster. However, maybe it is just the huge amount of files and not
inlining of small files that makes metadata so big.

I assume incremental send of snapshots is done.

> System,DUP: Size:32.00MiB, Used:112.00KiB
>/dev/sdg   64.00MiB
>
> Unallocated:
>/dev/sdg   24.97GiB
>
> # btrfs filesystem show /mnt/btrfs/hosting
> Label: 'BackupPC-BcomHosting'  uuid: edecc92a-646a-4585-91a0-9cbb556303e9
> Total devices 1 FS bytes used 522.75GiB
> devid1 size 840.00GiB used 815.03GiB path /dev/sdg
>
> #Reciever:
> #mount | grep hosting
> /dev/mapper/vgPecDisk2-lvHostingBackupBtrfs on /mnt/btrfs/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=5,subvol=/)
>
> #btrfs filesystem usage /mnt/btrfs/hosting/
> Overall:
> Device size: 896.00GiB
> Device allocated:604.07GiB
> Device unallocated:  291.93GiB
> Device missing:  0.00B
> Used:565.98GiB
> Free (estimated):313.62GiB  (min: 167.65GiB)
> Data ratio:   1.00
> Metadata ratio:  

Re: Status of SMR with BTRFS

2016-07-17 Thread Henk Slager
>>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a
>>> ST5000DM000.
>>
>>
>> this is TGMR not SMR disk:
>>
>> http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
>> So it still confirms to standard record strategy ...
>
>
> I am not convinced. I had not heared TGMR before. But I find TGMR as a
> technology for the head.
> https://pics.computerbase.de/4/0/3/4/4/29-1080.455720475.jpg
>
> In any case: the drive behaves like a SMR drive: I ran a benchmark on it
> with up to 200MB/s.
> When copying a file onto the drive in parallel the rate in the benchmark
> dropped to 7MB/s, while that particular file was copied at 40MB/s.

It is very well possible that for a normal drive of 4TB or so you get
this kind of behaviour. Suppose you have 2 tasks, 1 writing in with 4k
blocksize to a 1G file at the beginning of the disk and the 2nd with
4k blocksize to a 1G file at the end of the disk. At the beginning you
get sustained ~150MB/s, at the end ~75MB/s. Between every 4k write (or
read) you move the head(s), so ~4ms lost.

I was wondering how big the zones etc are and hopefully this is still true:
http://blog.schmorp.de/data/smr/fast15-paper-aghayev.pdf


> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
> And this does sound like improvements to BTRFS can be done for SMR in a
> generic, not vendor/device specific manner.

Maybe have a look at recent patches from Hannes R from SUSE (to 4.7
kernel AFAIK) and see what will be possible with Btrfs once this
'zone-handling' is all working on the lower layers. Currently, there
is nothing special in Btrfs for SMR drives in recent kernels, but in
my experience it works, if you keep device-managed SMR
characteristics/limitations in mind. Maybe like a tape-archive or
dvd-burner.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of SMR with BTRFS

2016-07-17 Thread Henk Slager
On Sun, Jul 17, 2016 at 10:26 AM, Matthias Prager
 wrote:

> from my experience btrfs does work as badly with SMR drives (I only had
> the opportunity to test on a 8TB Seagate device-managed drive) as ext4.
> The initial performance is fine (for a few gigabytes / minutes), but
> drops of a cliff as soon as the internal buffer-region for
> non-sequential writes fills up (even though I tested large file SMB
> transfers).

What kernel (version) did you use ?
I hope it included:
http://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=bugzilla-93581=7c4fbd50bfece00abf529bc96ac989dd2bb83ca4

so >= 4.4, as without this patch, it is quite problematic, if not
impossible, to use this 8TB Seagate SMR drive with linux without doing
other patches or setting/module changes.

Since this patch, I have been using the drive for cold storage
archiving, connected to a Baytrail SoC SATA port. I use bcache
(writethrough or writearound) on an 8TB GPT partition that has a LUKS
container that is Btrfs m-dup, d-single formatted and mounted
compress=lzo,noatine,nossd. It is only powered on once a month for a
day or so and then it receives incremental snapshots mostly or some
SSD or flash images of 10-50G.
I have more or less kept all the snapshots sofar, so chunks keep being
added to previously unwritten space, so as sequential as possible.

If free space would be heavily fragmented and also files would be
heavily fragmented and the disk would be very full, adding new files
or modifying would be very slow. You see than many seconds that the
drive is active but no traffic on the SATA link. Also then there is
the risk that the default '/sys/block/$(kerneldevname)/device/timeout'
of 30 secs is too low, and that the kernel might reset the SATA link.
A SATA link still happened 2x the last 1/2 year, I haven't really
looked at the details sofar, just rebooted at some point in time
later, but I will set the timeout at least higher, e.g. 180, and then
see if ata errors/resets still occur. It might be FW crashes as well.

> The only file system that worked really well with the 8TB Seagate SMR
> drive was f2fs. I used 'mkfs.f2fs -o 0 -a 0 -s 9 /dev/sdx' to create one
> and mounted it with noatime. -o means no additional over provisioning
> (the 5% default is a lot of wasted space on a 8TB drive), -a 0 tells
> f2fs not to use separate areas on the disks at the same time (which does
> not perform well on hdds only on ssds) and finally -s 9 tells f2fs to
> layout the file system in 1GB chunks.
> I hammered this file system for some days (via SMB and via shred-script)
> and it worked really well (performance and stability wise).

Interesting that f2fs works well, although now thinking a bit, I am
not so surprised that it works better than ext4

> I am considering using SMR drives for the next upgrades in my storage
> server in the basement - the only things missing in f2fs are checksums
> and raid1 support. But in my current setup (md-raid1+ext4) I don't get
> checksums either so f2fs+smr is still on my road-map. Long term, I would
> really like to switch to btrfs with it's built-in check summing (which
> unfortunately does not work with NOCOW) and raid1. But some of the file
> systems are almost 100% filled and I'm not trusting btrfs's stability
> yet (and the manageability / handling of btrfs lacks behind compared to
> say zfs).

At least this SMR drive is not advised to use in raid setups. As
not-so-active array it might work if you use the right timeouts and
scterc etc, but if have seen how long the wait on the SATA link can be
and that makes me realize that the stamp 'Archive Drive' done by
Seagate has a clear reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rollback to a snapshot and delete old top volume - missing of "@"

2016-07-08 Thread Henk Slager
On Fri, Jul 8, 2016 at 11:50 PM, Chris Murphy  wrote:
> On Fri, Jul 8, 2016 at 3:39 PM, Chris Murphy  wrote:
>> On Fri, Jul 8, 2016 at 2:08 PM, Kai Herlemann  wrote:
>>
>>> If here any developers read along: I'd like to suggest that there's
>>> automatically made a subvolume "@" by default, which is set as default
>>> subvolume, or a tip to the distribution, that it would made sense to do that
>>> with the installation. It would protect other users against confusion and
>>> work like I had it.
>>
>> I think that upstream won't do that or recommend it. There is already
>> a subvolume created at mkfs time, that's subvolid=5 (a.k.a. 0) and it
>> is set as the default subvolume. I don't see the point in having two
>> of them. If you want it, make it. If your distro wants it, it should
>> be done in the installer, not mkfs.
>>
>> Further I think it's inappropriate to take 'btrfs sub set-default'
>> away from the user. That is a user owned setting. It is not OK for
>> some utility to assert domain over that setting, and depend on it for
>> proper booting. It makes the entire boot process undiscoverable,
>> breaks self-describing boot process which are simpler to understand
>> and troubleshoot, in favor of secret decoder ring booting that now
>> requires even more esoteric knowledge on the part of users. So I think
>> it's a bad design.
>>
>> Instead those utilities should employ rootflags=subvol or subvolid to
>> explicitly use a particular fs tree for rootfs, rather that hide this
>> fact by using subvolume set-default.
>
> The only distro installer I know that works this way out of the box is
> Fedora/Red Hat's Anaconda. It leaves the default subvolume as 5, but
> does not install the OS there. Instead each mountpoint is created as a
> subvolume in that top level, and rootflags kernel parameter and fstab
> are used to assemble those subvolumes per the FHS virtually. It's
> completely discoverable, you can follow each step along the way, it's
> not obscured.
>
> The additional benefit is no nested subvolumes.
>
> A possible improvement for those distros that will likely continue
> doing things the way they are, would be if the kernel code stated what
> fs tree ID was being mounted when the default subvolume is not 5, and
> neither subvol nor subvolid mount options were used. *shrug*

On a running system as non-root:
$ mount | grep "on / type btrfs"
/dev/sda1 on / type btrfs
(rw,noatime,compress=lzo,ssd,discard,space_cache,subvolid=2429,subvol=/@/latestrootfs)

On an image of a disk or some separate disk with rootfs tree mounted
somewhere, I agree that it might look 'hidden'; you will have to
realize that the filesystem is Btrfs and that the default subvol might
not be 5, but  btrfs sub list / gives the answer to what more is in
the pool.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Frequent btrfs corruption on a USB flash drive

2016-07-08 Thread Henk Slager
>> Device is GOOD
>>
>> I also created a big file with dd using /dev/urandom with the same size
>> as my flash drive, copied it once and read it three times. The SHA-1
>> checksum is always the same and matches the original one on the hard disk.
>>
>> So after much testing I feel I can conclude that my USB flash drive is
>> not fake and it is not defective.
>>
> For what it's worth, there's multiple other things that could cause similar
> issues.  I've had a number of cases where bad USB hubs or poorly designed
> (or just buggy or failing) USB controllers caused similar data corruption,
> the most recent one being an issue with both a bad USB 2.0 hub (which did
> not properly implement the USB standard, counterfeit USB devices come in all
> types) and a malfunctioning USB 3.0 controller (which did not properly
> account for things that didn't properly implement the standard and had no
> recovery code to handle this in the drivers).  I ended up in most cases
> checking the ports using other USB devices (at least a keyboard, a mouse,
> and a USB serial adapter).

Similar as Austin, I also want to note that there might be USB related
issues that only pop-up after some time and not in tests.

For example, this weekend I connected a 2.5inch 500G drive with its
Y-cable to a H87M-Pro board that is fed by a 80+Gold PSU, despite its
many 'bad sectors' I remembered from 2 years ago in a btrfs raid1
setup. This 500G disk has worked well for almost 2 years connected to
a 7-inch eeepc4G, XFS formatted. But with the H87M-Pro I just now saw
that it dropped off the USB every now and then, causing trouble for
Btrfs.

For connecting harddisks to phones, I once bought an external powered
hub, and I put that between the board the the 500G disk => that made
it all stable, no disconnects and Btrfs works fine as expected. I had
similar issues on another PC with a Sandisk Extreme 64G USB3 stick,
but that was likely a protocol issue.

So maybe try to use the stick with your use case in another HW setup,
hopefully then it is stable for a longer time than the few days now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Out of space error even though there's 100 GB unused?

2016-07-08 Thread Henk Slager
On Fri, Jul 8, 2016 at 9:22 AM, Stanislaw Kaminski
 wrote:
> Huh.
>
> I left defrag running overnight, and now I'm back to my >200 GiB free
> space. Also, I got no "out of space" messages in Transmission, and it
> successfully downloaded few GBs.
>
> But in dmesg I have 209 traces, see attached.
>
> Does that say anything to you?

I get some vague clue about the problem, but no-one seems to know
exactly the root cause(s).
The 4.6.2 code where the warning comes from is this:
...
/*
 * Called if we need to clear a data reservation for this inode
 * Normally in a error case.
 *
 * This one will *NOT* use accurate qgroup reserved space API, just for case
 * which we can't sleep and is sure it won't affect qgroup reserved space.
 * Like clear_bit_hook().
 */
void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len)
{
   struct btrfs_root *root = BTRFS_I(inode)->root;
   struct btrfs_space_info *data_sinfo;

   /* Make sure the range is aligned to sectorsize */
   len = round_up(start + len, root->sectorsize) -
   round_down(start, root->sectorsize);
   start = round_down(start, root->sectorsize);

   data_sinfo = root->fs_info->data_sinfo;
   spin_lock(_sinfo->lock);
   if (WARN_ON(data_sinfo->bytes_may_use < len))
   data_sinfo->bytes_may_use = 0;
   else
   data_sinfo->bytes_may_use -= len;
   trace_btrfs_space_reservation(root->fs_info, "space_info",
   data_sinfo->flags, len, 0);
   spin_unlock(_sinfo->lock);
}
...

I think the system is still 'recovering' from getting stuck earlier.
What exactly that is, I don't know. You would probably need to enable
more debugging facilities in order to figure out from which file or
inode the problem comes from. I don't know if you can compile a 4.7
kernel for this Kirkwood SoC, but that would be one way forward. (BTW,
is it a 88F6281 or  a 88F6282 ?)

As far as I remember, Josef Basik has posted some patches that could
benefit this case, I am not sure if they made it in 4.7, but that is
what I think I would try.

Otherwise, it are workarounds:
- you could have a look at the cpuload during defrag and normal
operation and see how it relates to the rate of issuing this warning
- add mount option noatime
- as it looks like this this fs is (also) a torrent-client target, you
can put the torrents in a directoru ot subvol with noCoW flag set or
mount the whole fs with nodatacow
- again clean cache and then mount with space_cache=v2. Then new
mounts will use v2 then automatically. Only thing I can say it that
helped me getting out of kernel crash situation with 4.6.2. 4.7.0-rc5
did not crash on the same fs, so I got it working again (
de-allocations and cleanups, the fs is almost exclusively a
btrfs-receive target)
- connect the fs to a multi-core x86_64 system running same kernel
version for some time and see if you can reproduce the same type of
WARN_ONs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rollback to a snapshot and delete old top volume - missing of "@"

2016-07-07 Thread Henk Slager
On Thu, Jul 7, 2016 at 7:40 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Thu, Jul 7, 2016 at 10:01 AM, Henk Slager <eye...@gmail.com> wrote:
>
>> What the latest debian likes as naming convention I dont know, but in
>> openSuSE @ is a directory in the toplevel volume (ID=5 or ID=0 as
>> alias) and that directory contains subvolumes.

Sorry, I mixed-up latest opensuse and my own adaptions to older installations.

> No, opensuse doesn't use @ at all. They use a subvolume called
> .snapshots to contain snapper snapshots.

On current fresh install "openSUSE Tumbleweed (20160703) (x86_64)" you get this:

# btrfs sub list /
ID 257 gen 24369 top level 5 path @
ID 258 gen 24369 top level 257 path @/.snapshots
ID 259 gen 24369 top level 258 path @/.snapshots/1/snapshot
ID 265 gen 25404 top level 257 path @/tmp
ID 267 gen 24369 top level 257 path @/var/cache
ID 268 gen 20608 top level 257 path @/var/crash
ID 269 gen 20608 top level 257 path @/var/lib/libvirt/images
ID 270 gen 3903 top level 257 path @/var/lib/mailman
ID 271 gen 2996 top level 257 path @/var/lib/mariadb
ID 272 gen 3904 top level 257 path @/var/lib/mysql
ID 273 gen 3903 top level 257 path @/var/lib/named
ID 274 gen 8 top level 257 path @/var/lib/pgsql
ID 275 gen 25404 top level 257 path @/var/log
ID 276 gen 20611 top level 257 path @/var/opt
ID 277 gen 25404 top level 257 path @/var/spool
ID 278 gen 24369 top level 257 path @/var/tmp
ID 300 gen 10382 top level 258 path @/.snapshots/15/snapshot
[..]

@ is the only thing in the toplevel

I have changed it a bit for this particular PC, so that more is in one subvol.
Just after default install, subvol with ID 259 is made default and rw

I had also updated my older linux installs a bit like this, but with @
a dir, not a subvol, so that at least I can easily swap
'latestroofs' subvol with something else. My interpretation of the
OP's report was that he basically wants something like that too.

> On a system using snapper, its snapshots should probably be deleted
> via snapper so it's aware of the state change.

You can do that, but also with btrfs sub del in re-organisation
actions like described here. If you delete the .xml files in the
subvol .snapshots, it starts counting from 1 again. Changing the
latest .xml file can make it start counting from some higher number if
that is important for many-months history for example.

> And very clearly from the OP's output from 'btrfs sub list' there are
> no subvolumes with @ in the path, so there is no subvolume @, nor are
> there any subvolumes contained in a directory @.
>
> Assuming the posted output from btrfs sub list is the complete output,
> .snapshots is a directory and there are three subvolumes in it. I
> suspect the OP is unfamiliar with snapper conventions and is trying to
> delete a snapshot outside of snapper, and is used to some other
> (Debian or Ubuntu) convention where snapshots somehow relate to @,
> which is a mimicking of how ZFS does things.
>
> Anyway the reason why the command fails is stated in the error
> message. The system appears to be installed in the top level of the
> file system (subvolid=5), and that can't be deleted. First it's the
> immutable first subvolume of a Btrfs file system, and second it's
> populated with other subvolumes which would inhibit its removal even
> if it weren't the top level subvolume.
>
> What can be done is delete the directories in the top level, retaining
> the subvolumes that are there.

Indeed, yes, as a last cleanup step.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Out of space error even though there's 100 GB unused?

2016-07-07 Thread Henk Slager
On Thu, Jul 7, 2016 at 5:17 PM, Stanislaw Kaminski
 wrote:
> Hi Chris, Alex, Hugo,
>
> Running now: Linux archb3 4.6.2-1-ARCH #1 PREEMPT Mon Jun 13 02:11:34
> MDT 2016 armv5tel GNU/Linux
>
> Seems to be working fine. I started a defrag, and it seems I'm getting
> my space back:
> $ sudo btrfs fi usage /home
> Overall:
> Device size:   1.81TiB
> Device allocated:  1.73TiB
> Device unallocated:   80.89GiB
> Device missing:  0.00B
> Used:  1.65TiB
> Free (estimated):159.63GiB  (min: 119.19GiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 240.00KiB)
>
> Data,single: Size:1.72TiB, Used:1.65TiB
>/dev/sda4   1.72TiB
>
> Metadata,DUP: Size:3.50GiB, Used:2.16GiB
>/dev/sda4   7.00GiB
>
> System,DUP: Size:32.00MiB, Used:224.00KiB
>/dev/sda4  64.00MiB
>
> Unallocated:
>/dev/sda4  80.89GiB
>
> I deleted some unfinished torrent, ~10 GB in size, but as you can see,
> "Free space" has grown by 60 GB (re-checked now and it's 1 GB more now
> - so definitely caused by defrag).
>
> What has changed between 4.6.2 and 4.6.3?

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/?id=v4.6.3=v4.6.2=2

I see no change for Btrfs

> Cheers,
> Stan
>
> 2016-07-07 12:28 GMT+02:00 Stanislaw Kaminski :
>> Too early report, the issue is back. Back to testing
>>
>> 2016-07-07 12:18 GMT+02:00 Stanislaw Kaminski :
>>> Hi all,
>>> I downgraded to 4.4.1-1 - all fine, 4.5.5.-1 - also fine, then got
>>> back to 4.6.3-2 - and it's still fine. Apparently running under
>>> different kernel somehow fixed the glitch (as far as I can test...).
>>>
>>> That leaves me with the other question: before issues, I 1.6 TiB was
>>> used, now all the tools report 1.7 TiB issued (except for btrfs fs du
>>> /home, this reports 1.6 TiB). How is that possible?
>>>
>>> Cheers,
>>> Stan
>>>
>>> 2016-07-06 19:42 GMT+02:00 Chris Murphy :
 On Wed, Jul 6, 2016 at 3:55 AM, Stanislaw Kaminski
  wrote:

> Device unallocated:   97.89GiB

 There should be no problem creating any type of block group from this
 much space. It's a bug.

 I would try regression testing. Kernel 4.5.7 has some changes that may
 or may not relate to this (they should only relate when there is no
 unallocated space left) so you could try 4.5.6 and 4.5.7. And also
 4.4.14.

 But also the kernel messages are important. There is this obscure
 enospc with error -28, so either with or without enospc_debug mount
 option is useful to try in 4.6.3 (I think it's less useful in older
 kernels).

 But do try nospace_cache first. If that works, you could then mount
 with clear_cache one time and see if that provides an enduring fix. It
 can take some time to rebuild the cache after clear_cache is used.



 --
 Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fstrim problem/bug

2016-07-07 Thread Henk Slager
On Thu, Jul 7, 2016 at 11:46 AM, M G Berberich  wrote:
> Hello,
>
> On a filesystem with 40 G free space and 54 G used, ‘fstrim -v’ gave
> this result:
>
> # fstrim -v /
> /: 0 B (0 bytes) trimmed
>
> After running balance it gave a more sensible
>
> # fstrim -v /
> /: 37.3 GiB (40007368704 bytes) trimmed
>
> As far as I understand, fstrim should report any unused block to the
> disk, so its controller can reuse these blocks. I expected ’fstrim -v’
> to report about 40 G trimmed. The fact, that after balance fstrim
> reports a sensible amount of trimmed bytes leads to the conclusion,
> that fstrim on btrfs does not report unused blocks to the disk (as it
> should), but only the blocks of unused chunks. As the fstrim-command
> only does a ‘ioctl(fd, FITRIM, ))’ this seems to be a bug in the
> fstrim kernel-code.
> In the field this means, that without regularly running balance,
> fstrim does not work on btrfs.

hmm, yes indeed I see this as well:

# btrfs fi us /
Overall:
   Device size:  55.00GiB
   Device allocated: 46.55GiB
   Device unallocated:8.45GiB
   Device missing:  0.00B
   Used: 39.64GiB
   Free (estimated): 13.96GiB  (min: 13.96GiB)
   Data ratio:   1.00
   Metadata ratio:   1.00
   Global reserve:  480.00MiB  (used: 0.00B)

Data,single: Size:43.77GiB, Used:38.25GiB
  /dev/sda1  43.77GiB

Metadata,single: Size:2.75GiB, Used:1.39GiB
  /dev/sda1   2.75GiB

System,single: Size:32.00MiB, Used:16.00KiB
  /dev/sda1  32.00MiB

Unallocated:
  /dev/sda1   8.45GiB
# fstrim -v /
/: 9,3 GiB (10014126080 bytes) trimmed
# fallocate -l 5G testfile
# fstrim -v /
/: 4,3 GiB (4644130816 bytes) trimmed

Where the difference between 8.45GiB and 9,3 GiB comes from, I
currently don't understand.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rollback to a snapshot and delete old top volume - missing of "@"

2016-07-07 Thread Henk Slager
On Thu, Jul 7, 2016 at 2:17 PM, Kai Herlemann  wrote:
> Hi,
>
> I want to rollback a snapshot and have done this by execute "btrfs sub
> set-default / 618".
maybe just a typo here, command syntax is:
# sudo btrfs sub set-default
btrfs subvolume set-default: too few arguments
usage: btrfs subvolume set-default  

   Set the default subvolume of a filesystem

> Now I want to delete the old top volume to save space, but google and
> manuals didn't helped.
>
> I mounted for the following the root volume at /mnt/gparted with subvolid=0,
> subvol=/ has the same effect.
> Usually, the top volume is saved in /@, so I would be able to delete it by
> execute "btrfs sub delete /@" (or move at first @ to @_badroot and the
> snapshot to @). But that isn't possible, the output of that command is
> "ERROR: cannot access subvolume /@: No such file or directory".
> I've posted the output of "btrfs sub list /mnt/gparted" at
> http://pastebin.com/r7WNbJq8. As you can see, there's no subvolume named @.

I think one or the other commandtyping didn't have its expected
effect, just to make sure I get the right state, can you do:

mkdir -p /fsroot
mount -o subvolid=0 UUID= /fsroot
btrfs sub list /fsroot
btrfs subvolume get-default /

What the latest debian likes as naming convention I dont know, but in
openSuSE @ is a directory in the toplevel volume (ID=5 or ID=0 as
alias) and that directory contains subvolumes. You can do whatever you
like best, but at least make sure you have mount entries in fstab
subvolumes like var/cache/apt and usr/src, otherwise this magnificent
rootfstree snapshotting gets you into trouble.

I think your current default subvolume is still 5, so you would need:

fstab:
UUID=/btrfsdefaults   0 0
#UUID=/homebtrfsdefaults,subvol=@/home   0 0
UUID=/usr/srcbtrfs
defaults,subvol=@/usr/src   0 0
UUID=/var/cache/aptbtrfs
defaults,subvol=@/var/cache/apt   0 0
UUID=/.snapshotsbtrfs
defaults,subvol=@/.snapshots   0 0
UUID=/fsrootbtrfsnoauto,subvolid=0   0 0


mkdir -p /fsroot
mount -o subvolid=0 UUID= /fsroot

mkdir -p /usr/src
mkdir -p /var/cache/apt
mkdir -p /.snapshots

mkdir -p /fsroot/@/usr
mkdir -p /fsroot/@/var/cache

btrfs sub create /fsroot/@/usr/src
btrfs sub create /fsroot/@/var/cache/apt
btrfs sub create /fsroot/@/.snapshots

#snapshots might need different, the proposed one works at least for snapper

btrfs sub snap / /fsroot/@/latestrootfs
btrfs sub set-default  /
btrfs fi sync /

#for home fs is it similar as for root fs

reboot

You can then when you want rollback, set a snapshot to rw  (or rename
latestrootfs, snapshot snapshot to that name ) and make it default
subvol and reboot (or maybe also do some temp chroot tricks, I have
not tried that)

> I have the same problem with my /home/ partition.
>
> Output of "uname -a" (self-compiled kernel):
> Linux debian-linux 4.1.26 #1 SMP Wed Jun 8 18:40:04 CEST 2016 x86_64
> GNU/Linux
>
> Output of "btrfs --version":
> btrfs-progs v4.5.2
>
> Output of "btrfs fi show":
> Label: none  uuid: f778877c-d50b-48c8-8951-6635c6e23c61
>   Total devices 1 FS bytes used 43.70GiB
>   devid1 size 55.62GiB used 47.03GiB path /dev/sda1
>
> Output of "btrfs fi df /":
> Data, single: total=44.00GiB, used=42.48GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=3.00GiB, used=1.22GiB
> GlobalReserve, single: total=416.00MiB, used=0.00B
>
> Output of dmesg attached.
>
> Thank you,
> Kai
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Henk Slager
On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
>> On 6 Jul 2016, at 02:25, Henk Slager <eye...@gmail.com> wrote:
>>
>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
>> wrote:
>>>
>>> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote:
>>>
>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com>
>>> wrote:
>>>
>>> I did consider that, but:
>>> - some files were NOT accessed by anything with 100% certainty (well if
>>> there is a rootkit on my system or something in that shape than maybe yes)
>>> - the only application that could access those files is totem (well
>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>> hear about out break of totem killing people files.
>>> - if it was a kernel bug then other large files would be affected.
>>>
>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>> are located in single location on file system (single folder) that might
>>> have a historical bug in some structure somewhere ?
>>>
>>>
>>> I find it hard to imagine that this has something to do with the
>>> folderstructure, unless maybe the folder is a subvolume with
>>> non-default attributes or so. How the files in that folder are created
>>> (at full disktransferspeed or during a day or even a week) might give
>>> some hint. You could run filefrag and see if that rings a bell.
>>>
>>> files that are 4096 show:
>>> 1 extent found
>>
>> I actually meant filefrag for the files that are not (yet) truncated
>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>> an MBR write.
> 117 extents found
> filesize 15468645003
>
> good / bad ?

117 extents for a 1.5G file is fine, with -v option you could see the
fragmentation at the start, but this won't lead to any hint why you
have the truncate issue.

>>> I did forgot to add that file system was created a long time ago and it was
>>> created with leaf & node size = 16k.
>>>
>>>
>>> If this long time ago is >2 years then you have likely specifically
>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>
>>> You are right I used -l 16K -n 16K
>>>
>>> Have you created it as raid10 or has it undergone profile conversions?
>>>
>>> Due to lack of spare disks
>>> (it may sound odd for some but spending for more than 6 disks for home use
>>> seems like an overkill)
>>> and due to last I’ve had I had to migrate all data to new file system.
>>> This played that way that I’ve:
>>> 1. from original FS I’ve removed 2 disks
>>> 2. Created RAID1 on those 2 disks,
>>> 3. shifted 2TB
>>> 4. removed 2 disks from source FS and adde those to destination FS
>>> 5 shifted 2 further TB
>>> 6 destroyed original FS and adde 2 disks to destination FS
>>> 7 converted destination FS to RAID10
>>>
>>> FYI, when I convert to raid 10 I use:
>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>> /path/to/FS
>>>
>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>> folder within a “victim folder” that is within a one sub volume.
>>>
>>>
>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>> check should find that ) and that that causes the issue.
>>>
>>>
>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 4424060642634 bytes used err is 0
>>> total csum bytes: 4315954936
>>> total tree bytes: 4522786816
>>> total fs tree bytes: 61702144
>>> total extent tree bytes: 41402368
>>> btree space waste bytes: 72430813
>>> file data blocks allocated: 4475917217792
>>> referenced 4420407603200
>>>
>>> No luck there :/
>>
>> Indeed looks all normal.
>>
>>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>>> time, it has happened over a year ago with kernels recent at that
>>> time, but the fs was converted from raid5
>>>
>>> Could you please

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Henk Slager
On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
>
> On 6 Jul 2016, at 00:30, Henk Slager <eye...@gmail.com> wrote:
>
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com>
> wrote:
>
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if
> there is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well
> Nautilius checks extension -> directs it to totem) so in that case we would
> hear about out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files
> are located in single location on file system (single folder) that might
> have a historical bug in some structure somewhere ?
>
>
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
>
> files that are 4096 show:
> 1 extent found

I actually meant filefrag for the files that are not (yet) truncated
to 4k. For example for virtual machine imagefiles (CoW), one could see
an MBR write.

> I did forgot to add that file system was created a long time ago and it was
> created with leaf & node size = 16k.
>
>
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
>
> You are right I used -l 16K -n 16K
>
> Have you created it as raid10 or has it undergone profile conversions?
>
> Due to lack of spare disks
> (it may sound odd for some but spending for more than 6 disks for home use
> seems like an overkill)
> and due to last I’ve had I had to migrate all data to new file system.
> This played that way that I’ve:
> 1. from original FS I’ve removed 2 disks
> 2. Created RAID1 on those 2 disks,
> 3. shifted 2TB
> 4. removed 2 disks from source FS and adde those to destination FS
> 5 shifted 2 further TB
> 6 destroyed original FS and adde 2 disks to destination FS
> 7 converted destination FS to RAID10
>
> FYI, when I convert to raid 10 I use:
> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
> /path/to/FS
>
> this filesystem has 5 sub volumes. Files affected are located in separate
> folder within a “victim folder” that is within a one sub volume.
>
>
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.
>
>
> root@noname_server:/mnt# btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 4424060642634 bytes used err is 0
> total csum bytes: 4315954936
> total tree bytes: 4522786816
> total fs tree bytes: 61702144
> total extent tree bytes: 41402368
> btree space waste bytes: 72430813
> file data blocks allocated: 4475917217792
>  referenced 4420407603200
>
> No luck there :/

Indeed looks all normal.

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
>
> Could you please elaborate on that ? you also ended up with files that got
> truncated to 4096 bytes ?

I did not have truncated to 4k files, but your case lets me think of
small files inlining. Default max_inline mount option is 8k and that
means that 0 to ~3k files end up in metadata. I had size corruptions
for several of those small sized files that were updated quite
frequent, also within commit time AFAIK. Btrfs check lists this as
errors 400, although fs operation is not disturbed. I don't know what
happens if those small files are being updated/rewritten and are just
below or just above the max_inline limit.

The only thing I was thinking of is that your files were started as
small, so inline, then extended to multi-GB. In the past, there were
'bad extent/chunk type' issues and it was suggested that the fs would
have been an ext4-converted one (which had non-compliant mixed
metadata and data) but for most it was not the case. So there was/is
something unclear, but full balance or so fixed it as far as I
remember. But it is guessing, I do not have any failure cases like the
one you see.

> You might want to run the python scrips from here:
> https://github.com/k

Re: btrfs defrag questions

2016-07-05 Thread Henk Slager
On Tue, Jul 5, 2016 at 1:15 AM, Dmitry Katsubo <dm...@mail.ru> wrote:
> On 2016-07-01 22:46, Henk Slager wrote:
>> (email ends up in gmail spamfolder)
>> On Fri, Jul 1, 2016 at 10:14 PM, Dmitry Katsubo <dm...@mail.ru> wrote:
>>> Hello everyone,
>>>
>>> Question #1:
>>>
>>> While doing defrag I got the following message:
>>>
>>> # btrfs fi defrag -r /home
>>> ERROR: defrag failed on /home/user/.dropbox-dist/dropbox: Success
>>> total 1 failures
>>>
>>> I feel that something went wrong, but the message is a bit misleading.
>>>
>>> Provided that Dropbox is running in the system, does it mean that it
>>> cannot be defagmented?
>>
>> I think it is a matter of newlines in btrfs-progs and/or stdout/stderr mixup.
>>
>> You should run the command with -v and probably also with -f, so that
>> it gets hopefully clearer what is wrong.
>
> Running with "-v -f" (or just "-v") result the same output:
>
> ...
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/select.so
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/grp.so
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/posixffi.libc._posixffi_libcERROR:
>  defrag failed on /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/dropbox: 
> Success
> .so
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/_functools.so
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/dropbox
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/_csv.so
> ...
>
> This is not a matter of newlines:
>
> $ grep -rnH 'defrag failed' btrfs-progs
> btrfs-progs/cmds-filesystem.c:1021:   error("defrag failed on %s: %s", 
> fpath, strerror(e));
> btrfs-progs/cmds-filesystem.c:1161:   error("defrag 
> failed on %s: %s", argv[i], strerror(e));
>
>> That it fails on dropbox is an error I think, but maybe known: Could
>> be mount option is compress and that that causes trouble for defrag
>> although that should not happen.
>
> True, compression is enabled.

The reason I mentioned compression is that this adds another layer
between the disksectors and the (4K) pages in RAM. My thinking is that
if defrag is done by the kernel (which is the case AFAIK), it should
in theory be possible to hook at some layer, so that in the case in
this email defrag should be able to work.

The big question is, if it is worth the (probably complex)
implementation effort, independent whether you consider this bug or a
feature enhancement.

My experience with manual defrag (so initiated by btrfs-progs) is that
it can take long time and the result can be worse than a copy-mv
sequence (real copy, no --reflink, so done with cat or dd or rsync)
and is also way faster. This is for files in GiB range and with
snapshots existing (no compression). With no compression and no
snapshots, it might be different.

>> You can defrag just 1 file, so maybe you could try to make a reproducible 
>> case.
>
> When I run it on one file, it works as expected:
>
> # btrfs fi defrag -r -v 
> /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/dropbox
> ERROR: cannot open /home/user/.dropbox-dist/dropbox-lnx.x86-5.4.24/dropbox: 
> Text file busy
>
>> What kernel?
>> What btrfs-progs?
>
> kernel v4.4.6
> btrfs-tools v4.5.2
>
>>> Question #2:
>>>
>>> Suppose that in above example /home/ftp is mounted as another btrfs
>>> array (not subvolume). Will 'btrfs fi defrag -r /home' defragment it
>>> (recursively) as well?
>>
>> I dont know, I dont think so, but you can simply try.
>
> Many thanks, now I see how can I check this. Unfortunately it does not
> descend into submounted directories.
>
> --
> With best regards,
> Dmitry
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-05 Thread Henk Slager
On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmi...@gmail.com> wrote:
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if there 
> is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well Nautilius 
> checks extension -> directs it to totem) so in that case we would hear about 
> out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files 
> are located in single location on file system (single folder) that might have 
> a historical bug in some structure somewhere ?

I find it hard to imagine that this has something to do with the
folderstructure, unless maybe the folder is a subvolume with
non-default attributes or so. How the files in that folder are created
(at full disktransferspeed or during a day or even a week) might give
some hint. You could run filefrag and see if that rings a bell.

> I did forgot to add that file system was created a long time ago and it was 
> created with leaf & node size = 16k.

If this long time ago is >2 years then you have likely specifically
set node size = 16k, otherwise with older tools it would have been 4K.
Have you created it as raid10 or has it undergone profile conversions?

It could also be that the ondisk format is somewhat corrupted (btrfs
check should find that ) and that that causes the issue.

In-lining on raid10 has caused me some trouble (I had 4k nodes) over
time, it has happened over a year ago with kernels recent at that
time, but the fs was converted from raid5.

You might want to run the python scrips from here:
https://github.com/knorrie/python-btrfs

so that maybe you see how block-groups/chunks are filled etc.

> (ps. this email client on OS X is driving me up the wall … have to correct 
> the corrections all the time :/)
>
>> On 4 Jul 2016, at 22:13, Henk Slager <eye...@gmail.com> wrote:
>>
>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmi...@gmail.com> 
>> wrote:
>>> Hi,
>>>
>>> My setup is that I use one file system for / and /home (on SSD) and a
>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>
>>> Today I've discovered that 14 of files that are supposed to be over
>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>> and it seems that it does contain information that were at the
>>> beginnings of the files.
>>>
>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>> attributed it to different problem that I've spoke with you guys here
>>> about (corruption due to non ECC ram). At that time I did deleted
>>> files affected (56) and similar problem was discovered a year but not
>>> more than 2 years ago and I believe I've deleted the files.
>>>
>>> I periodically (once a month) run a scrub on my system to eliminate
>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>> to reclaim space after I deleted a large database.
>>>
>>> root@noname_server:/mnt/share# btrfs fi show
>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>Total devices 1 FS bytes used 177.19GiB
>>>devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>
>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>Total devices 6 FS bytes used 4.02TiB
>>>devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>devid5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>
>>> root@noname_server:/mnt/share# uname -a
>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> root@noname_server:/mnt/share# btrfs --version
>>> btrfs-progs v4.4
>>> root@noname_server:/mnt/share#
>>>
>>>
>>> Problem is that stuff on this filesystem moves so slowly that it's
>>> hard to remember historical events ... it's like AWS glacier. What I
>>> can state with 100% certainty is that:
>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>> - files affected were just read (and some not even read) never written
>>> after putting into storage
>>> - In the past I've assumed that fi

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-04 Thread Henk Slager
On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz  wrote:
> Hi,
>
> My setup is that I use one file system for / and /home (on SSD) and a
> larger raid 10 for /mnt/share (6 x 2TB).
>
> Today I've discovered that 14 of files that are supposed to be over
> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
> and it seems that it does contain information that were at the
> beginnings of the files.
>
> I've experienced this problem in the past (3 - 4 years ago ?) but
> attributed it to different problem that I've spoke with you guys here
> about (corruption due to non ECC ram). At that time I did deleted
> files affected (56) and similar problem was discovered a year but not
> more than 2 years ago and I believe I've deleted the files.
>
> I periodically (once a month) run a scrub on my system to eliminate
> any errors sneaking in. I believe I did a balance a half a year ago ?
> to reclaim space after I deleted a large database.
>
> root@noname_server:/mnt/share# btrfs fi show
> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
> Total devices 1 FS bytes used 177.19GiB
> devid3 size 899.22GiB used 360.06GiB path /dev/sde2
>
> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
> Total devices 6 FS bytes used 4.02TiB
> devid1 size 1.82TiB used 1.34TiB path /dev/sdg1
> devid2 size 1.82TiB used 1.34TiB path /dev/sdh1
> devid3 size 1.82TiB used 1.34TiB path /dev/sdi1
> devid4 size 1.82TiB used 1.34TiB path /dev/sdb1
> devid5 size 1.82TiB used 1.34TiB path /dev/sda1
> devid6 size 1.82TiB used 1.34TiB path /dev/sdf1
>
> root@noname_server:/mnt/share# uname -a
> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@noname_server:/mnt/share# btrfs --version
> btrfs-progs v4.4
> root@noname_server:/mnt/share#
>
>
> Problem is that stuff on this filesystem moves so slowly that it's
> hard to remember historical events ... it's like AWS glacier. What I
> can state with 100% certainty is that:
> - files that are affected are 2GB and over (safe to assume 4GB and over)
> - files affected were just read (and some not even read) never written
> after putting into storage
> - In the past I've assumed that files affected are due to size, but I
> have quite few ISO files some backups of virtual machines ... no
> problems there - seems like problem originates in one folder & size >
> 2GB & extension .mkv

In case some application is the root cause of the issue, I would say
try to keep some ro snapshots done by a tool like snapper for example,
but maybe you do that already. It sounds also like this is some kernel
bug, snaphots won't help that much then I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs defrag questions

2016-07-03 Thread Henk Slager
On Sun, Jul 3, 2016 at 12:33 PM, Kai Krakow  wrote:
> Am Fri, 1 Jul 2016 22:14:00 +0200
> schrieb Dmitry Katsubo :
>
>> Hello everyone,
>>
>> Question #1:
>>
>> While doing defrag I got the following message:
>>
>> # btrfs fi defrag -r /home
>> ERROR: defrag failed on /home/user/.dropbox-dist/dropbox: Success
>> total 1 failures
>>
>> I feel that something went wrong, but the message is a bit misleading.
>>
>> Provided that Dropbox is running in the system, does it mean that it
>> cannot be defagmented?
>
> That is probably true. Files that are mapped into memory (like running
> executables) cannot be changed on disk. You could make a copy of that
> file, remove the original, and rename the new into place. As long as
> the executable is running it will stay on disk but you can now
> defragment the file and next time dropbox is started it will use the
> new one.

I get:
ERROR: cannot open ./dropbox: Text file busy

when I run:
btrfs fi defrag -v ./dropbox

This is with kernel 4.6.2 and progs 4.6.1, dropbox running and mount
option compress=lzo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs defrag questions

2016-07-01 Thread Henk Slager
(email ends up in gmail spamfolder)
On Fri, Jul 1, 2016 at 10:14 PM, Dmitry Katsubo  wrote:
> Hello everyone,
>
> Question #1:
>
> While doing defrag I got the following message:
>
> # btrfs fi defrag -r /home
> ERROR: defrag failed on /home/user/.dropbox-dist/dropbox: Success
> total 1 failures
>
> I feel that something went wrong, but the message is a bit misleading.
>
> Provided that Dropbox is running in the system, does it mean that it
> cannot be defagmented?

I think it is a matter of newlines in btrfs-progs and/or stdout/stderr mixup.

You should run the command with -v and probably also with -f, so that
it gets hopefully clearer what is wrong.

That it fails on dropbox is an error I think, but maybe known: Could
be mount option is compress and that that causes trouble for defrag
although that should not happen.

You can defrag just 1 file, so maybe you could try to make a reproducible case.
What kernel?
What btrfs-progs?


> Question #2:
>
> Suppose that in above example /home/ftp is mounted as another btrfs
> array (not subvolume). Will 'btrfs fi defrag -r /home' defragment it
> (recursively) as well?

I dont know, I dont think so, but you can simply try.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: noholes and incremental send streamsize

2016-07-01 Thread Henk Slager
I have filed:

https://bugzilla.kernel.org/show_bug.cgi?id=121321

On Sat, Jun 25, 2016 at 1:17 AM, Henk Slager <eye...@gmail.com> wrote:
> Hi,
>
> the virtual machine images files I create and use are mostly sparse,
> so that not too much space on a filesystem with snapshots and on
> filesystems that are receive targets is used.
> But I noticed that with just starting up and shutting down a virtual
> machine, the difference between 2 nightly snapshots as generated by
> send -p is huge. It turns out that the cause is the noholes flag of
> the filesystem. I have set this flag to have a bit less metadata
> hopefully, but it turns out to have a very negative side effect.

As a workaround, I flipped the no-holes flag back to zero, so that for
new files the problem is avoided I assume.

btrfs-check detects this as a problem as extents with no 'metadata
administration' are found and no-holes flag is not set. But so far,
the kernel can handle the situation apparently, so I currently won't
re-create the multi-TB filesystems just for this issue.

> The sparse part of a file that is changed between 2 snapshots is added
> to the difference between the 2 snapshots when doing send -p, in full
> 'no-sparse' size (as zeros I think, as the data compresses extremely
> well).
> So in case of a 50G VM that has roughly 8G of actual written
> filesystem data (a new OS install), I saw 40G of generated btrfs
> sendstream data by just a start/stop of the VM.
>
> I think this behavior is not correct: Sparse is read as zeros normally
> AFAIK, so if the file locations that are sparse are the same for a
> snapshotted file and its parent, there should not be a difference.
>
> The following sequence shows the problem; if the btrfstune command is
> skipped, streamsize is as expected.
>
>
> uname -r
> btrfs --version
> truncate -s 5G /holetest.img
> losetup /dev/loop0 /holetest.img
> mkfs.btrfs /dev/loop0
>
> btrfstune -n /dev/loop0
>
> btrfs-show-super /dev/loop0 | grep incompat_flags
> mount /dev/loop0 /mnt
> btrfs sub create /mnt/vol1
> truncate -s 4G /mnt/vol1/test1
> btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro0
> dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
> sync
> btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro1
> dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
> sync
> btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro2
> btrfs send -p /mnt/vol1.ro1 /mnt/vol1.ro2 > /sendsize.btrfs
> ls -al /sendsize.btrfs
> umount /mnt
> losetup -d /dev/loop0
> rm /holetest.img /sendsize.btrfs
>
>
> => output
>
> # uname -r
> 4.7.0-rc4-kade
> # btrfs --version
> btrfs-progs v4.6.1
> # truncate -s 5G /holetest.img
> # losetup /dev/loop0 /holetest.img
> # mkfs.btrfs /dev/loop0
> btrfs-progs v4.5.3+20160516

I used an alias for 'btrfs', to point to the binary in the btrfs-progs
git, but forgot to alias mkfs.btrfs, but anyhow the problem is there
for multiple kernel/progs versions.

> See http://btrfs.wiki.kernel.org for more information.
>
> Performing full device TRIM (5.00GiB) ...
> Label:  (null)
> UUID:   338e948c-ee8c-4b9f-926e-b6f8fe140ae6
> Node size:  16384
> Sector size:4096
> Filesystem size:5.00GiB
> Block group profiles:
>  Data: single8.00MiB
>  Metadata: DUP 264.00MiB
>  System:   DUP  12.00MiB
> SSD detected:   no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>   IDSIZE  PATH
>1 5.00GiB  /dev/loop0
>
> #
> # btrfstune -n /dev/loop0
> #
> # btrfs-show-super /dev/loop0 | grep incompat_flags
> incompat_flags  0x341
> # mount /dev/loop0 /mnt
> # btrfs sub create /mnt/vol1
> Create subvolume '/mnt/vol1'
> # truncate -s 4G /mnt/vol1/test1
> # btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro0
> Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro0'
> # dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1,0 MB, 1,0 MiB) copied, 0,0779673 s, 13,4 MB/s
> # sync
> # btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro1
> Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro1'
> # dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1,0 MB, 1,0 MiB) copied, 0,0634306 s, 16,5 MB/s
> # sync
> # btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro2
> Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro2'
> # btrfs send -p /mnt/vol1.ro1 /mnt/vol1.ro2 > /sendsize.btrfs
> At subvol /mnt/vol1.ro2
> # ls -al /sendsize.btrfs
> -rw-r--r-- 1 root root 4298025877 jun 25 00:26 /sendsize.btrfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-28 Thread Henk Slager
On Tue, Jun 28, 2016 at 3:46 PM, Francesco Turco <ftu...@fastmail.fm> wrote:
> On 2016-06-27 23:26, Henk Slager wrote:
>> btrfs-debug does not show metadata ans system chunks; the balancing
>> problem might come from those.
>> This script does show all chunks:
>> https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py
>>
>> You might want to use vrange or drange balance filters so that you can
>> just target a certain chunk and maybe that gives a hint where the
>> problem might be. But anyhow, the behavior experienced is a bug.
>
> Updated the bug with the output log from your script. I simply ran it as:
>
> ./show_usage.py /
>
> I don't know how to use vrange/drange balance filters. Can you show me
> how to do that, please?

The original dmesg log shows that balance gets into trouble at block group
46435139584. This is SYSTEM|DUP and in the later blockgroup list
generated with Hans' py script it is not there anymore under this same
vaddr, so btrfs (or new manual balance) has managed to relocate it,
despite the enospc.

One theory I once had is that at the beginning of the disk, there were
or are small chunks of 8MiB, whereas the rest of the disk has 1G or at
least bigger chunks once the fs gets used and filled up. Those initial
small chunks tend to be system and/or metadata. If then later after
heavy use of a full balance the small chunks get relocated, there is
then unallocated space, but virtually nothing fits there if the policy
is 'first allocate big chunks'. So here the allocater mechanism could
then output an enospc, assuming that it
doesn't try exhaustively in order to keep the code simple and fast.
But it is only theory, one would need to traceback etc in such a case.
I never had such a case, so I can't prove it.

Suppose you want to relocate the metadata blockgroup (you have only
one, it is the same location in the 2 lists from the bug report)

btrfs balance start -v -mvrange=29360128..29360129 /

This 1-byte range in in the virtual adress range 29360128 ..
29360128+1G-1,[ so it will relocate the metadata blockgroup. After
succesfull balance, you will see its vaddr increased and its device
addresses (paddr) also changed.

If you want to balance based on device address and for example
relocate just 1 of the dup of the metadata:
btrfs balance start -v -mdevid=1,drange=37748736..37748737 /

All this does not solve the bug, but hopefully gives us better
understanding in cases where balance also fails and also no file
creation is possible anymore.



>
> --
> Website: http://www.fturco.net/
> GPG key: 6712 2364 B2FE 30E1 4791 EB82 7BB1 1F53 29DE CD34
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 'btrfs filesystem du' ?

2016-06-28 Thread Henk Slager
On Tue, Jun 28, 2016 at 2:56 PM, M G Berberich <bt...@oss.m-berberich.de> wrote:
> Hello,
>
> Am Montag, den 27. Juni schrieb Henk Slager:
>> On Mon, Jun 27, 2016 at 3:33 PM, M G Berberich <bt...@oss.m-berberich.de> 
>> wrote:
>> > Am Montag, den 27. Juni schrieb M G Berberich:
>> >> after a balance ‘btrfs filesystem du’ probably shows false data about
>> >> shared data.
>> >
>> > Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2.
>>
>> With  btrfs-progs v4.6.1 and kernel 4.7-rc5, the numbers are correct
>> about shared data.
>
> I tested with kernels 4.6.3 and 4.7-rc5 and with btrfs-progs 4.5.2 and
> 4.61.
Also with kernel 4.6.2-1-default and btrfs-progs v4.5.3+20160516
(current stock opensuse tumbleweed) I cannot reproduce the problem.

 The later kernel with two patches to make the kernel work:
> https://lkml.org/lkml/2016/6/1/310 https://lkml.org/lkml/2016/6/1/311 .
... so these seem to cause the problem


> You can see the script¹ I used (do-btrfs-du-test) and the logs at
> http://m-berberich.de/btrfs/
>
> In all four cases, ‘btrfs fi du -s .’ reports
>
>  Total   Exclusive  Set shared  Filename
>   59.38MiB   0.00B29.69MiB  .
>
> befor balance and
>
>  Total   Exclusive  Set shared  Filename
>   59.38MiB59.38MiB   0.00B  .
>
> after balance.
>
> Disclaimer: The script works for me, no guaranty at all.
>
> MfG
> bmg
> __
> ¹ Disclaimer: The script works for me, no guaranty at all.
> --
> „Des is völlig wurscht, was heut beschlos- | M G Berberich
>  sen wird: I bin sowieso dagegn!“  | m...@m-berberich.de
> (SPD-Stadtrat Kurt Schindler; Regensburg)  |
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 9:24 PM, Chris Murphy  wrote:
> On Mon, Jun 27, 2016 at 12:32 PM, Francesco Turco  wrote:
>> On 2016-06-27 20:18, Chris Murphy wrote:
>>> If you can grab btrfs-debugfs from
>>> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs
>>>
>>> And then attach the output to the bug report it might be useful for a
>>> developer. But really your case is an odd duck, because there's fully
>>> 14GiB unallocated, so it should be able to create a new one without
>>> problem.
>>>
>>> $ sudo ./btrfs-debugfs -b /
>>
>> Done! Thank you, I was not aware of the existence of btrfs-debug...
>
> I'm not certain what the "1 enospc errors during balance' refers to.
> That message happens several times, the balance operation isn't
> aborted, and doesn't come with any call traces (those appear later).
> Further, the btrfs-debugfs output suggests the balance worked - each
> bg is continguously located after the last and they're all new bg
> offset values compared to what's found in the dmesg.

btrfs-debug does not show metadata ans system chunks; the balancing
problem might come from those.
This script does show all chunks:
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py

You might want to use vrange or drange balance filters so that you can
just target a certain chunk and maybe that gives a hint where the
problem might be. But anyhow, the behavior experienced is a bug.

> This might be that obscure -28 enospc bug that affects some file
> systems and hasn't been tracked down yet. If I recall correctly it's a
> misleading error, and the only work around to get rid of it is migrate
> to a new Btrfs file system. I don't think the file system is at any
> risk in the current state, but I'm not certain as it's already an edge
> case. I'd just make sure you keep suitably current backups and keep
> using it.
>
>
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy  wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>  wrote:
>>>
 Well, the obvious major advantage that comes to mind for me to
 checksumming
 parity is that it would let us scrub the parity data itself and verify
 it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 'btrfs filesystem du' ?

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 3:33 PM, M G Berberich  wrote:
> Am Montag, den 27. Juni schrieb M G Berberich:
>> after a balance ‘btrfs filesystem du’ probably shows false data about
>> shared data.
>
> Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2.

With  btrfs-progs v4.6.1 and kernel 4.7-rc5, the numbers are correct
about shared data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


noholes and incremental send streamsize

2016-06-24 Thread Henk Slager
Hi,

the virtual machine images files I create and use are mostly sparse,
so that not too much space on a filesystem with snapshots and on
filesystems that are receive targets is used.
But I noticed that with just starting up and shutting down a virtual
machine, the difference between 2 nightly snapshots as generated by
send -p is huge. It turns out that the cause is the noholes flag of
the filesystem. I have set this flag to have a bit less metadata
hopefully, but it turns out to have a very negative side effect.

The sparse part of a file that is changed between 2 snapshots is added
to the difference between the 2 snapshots when doing send -p, in full
'no-sparse' size (as zeros I think, as the data compresses extremely
well).
So in case of a 50G VM that has roughly 8G of actual written
filesystem data (a new OS install), I saw 40G of generated btrfs
sendstream data by just a start/stop of the VM.

I think this behavior is not correct: Sparse is read as zeros normally
AFAIK, so if the file locations that are sparse are the same for a
snapshotted file and its parent, there should not be a difference.

The following sequence shows the problem; if the btrfstune command is
skipped, streamsize is as expected.


uname -r
btrfs --version
truncate -s 5G /holetest.img
losetup /dev/loop0 /holetest.img
mkfs.btrfs /dev/loop0

btrfstune -n /dev/loop0

btrfs-show-super /dev/loop0 | grep incompat_flags
mount /dev/loop0 /mnt
btrfs sub create /mnt/vol1
truncate -s 4G /mnt/vol1/test1
btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro0
dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
sync
btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro1
dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
sync
btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro2
btrfs send -p /mnt/vol1.ro1 /mnt/vol1.ro2 > /sendsize.btrfs
ls -al /sendsize.btrfs
umount /mnt
losetup -d /dev/loop0
rm /holetest.img /sendsize.btrfs


=> output

# uname -r
4.7.0-rc4-kade
# btrfs --version
btrfs-progs v4.6.1
# truncate -s 5G /holetest.img
# losetup /dev/loop0 /holetest.img
# mkfs.btrfs /dev/loop0
btrfs-progs v4.5.3+20160516
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (5.00GiB) ...
Label:  (null)
UUID:   338e948c-ee8c-4b9f-926e-b6f8fe140ae6
Node size:  16384
Sector size:4096
Filesystem size:5.00GiB
Block group profiles:
 Data: single8.00MiB
 Metadata: DUP 264.00MiB
 System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
  IDSIZE  PATH
   1 5.00GiB  /dev/loop0

#
# btrfstune -n /dev/loop0
#
# btrfs-show-super /dev/loop0 | grep incompat_flags
incompat_flags  0x341
# mount /dev/loop0 /mnt
# btrfs sub create /mnt/vol1
Create subvolume '/mnt/vol1'
# truncate -s 4G /mnt/vol1/test1
# btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro0
Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro0'
# dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
1+0 records in
1+0 records out
1048576 bytes (1,0 MB, 1,0 MiB) copied, 0,0779673 s, 13,4 MB/s
# sync
# btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro1
Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro1'
# dd if=/dev/urandom of=/mnt/vol1/test1 bs=1M count=1 conv=notrunc
1+0 records in
1+0 records out
1048576 bytes (1,0 MB, 1,0 MiB) copied, 0,0634306 s, 16,5 MB/s
# sync
# btrfs sub snap -r /mnt/vol1 /mnt/vol1.ro2
Create a readonly snapshot of '/mnt/vol1' in '/mnt/vol1.ro2'
# btrfs send -p /mnt/vol1.ro1 /mnt/vol1.ro2 > /sendsize.btrfs
At subvol /mnt/vol1.ro2
# ls -al /sendsize.btrfs
-rw-r--r-- 1 root root 4298025877 jun 25 00:26 /sendsize.btrfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 11:22 PM, Maximilian Böhm  wrote:
> Hi there, I did something terribly wrong, all blame on me. I wanted to
> write to an USB stick but /dev/sdc wasn't the stick in this case but
> an attached HDD with GPT and an 8 TB btrfs partition…

GPT has a secondary copy at the end of the device, so maybe gdisk can
reconstruct in first main one at the beginning of the disk, I don't
know all gdisk commands. But is you once created just 1 partition max
size for the whole disk with modern tools, your btrfs fs starts at
sector 2048, (with logical sectorsize 512).


> $ sudo dd bs=4M if=manjaro-kde-16.06.1-x86_64.iso of=/dev/sdc
> 483+1 Datensätze ein
> 483+1 Datensätze aus
> 2028060672 bytes (2,0 GB, 1,9 GiB) copied, 16,89 s, 120 MB/s

To confuse btrfs+tools as little as possible, I would first overwrite
/dev/sdc again from the start with the same amount of bytes byte then
from /dev/zero.
Then create / 'newly overlay' a the original partition offset 1M, till
the end. Alternatively:
$ losetup /dev/loopX -o 1M /dev/sdc
then your broken fs will be on /dev/sdc1 or /dev/loopX

Or overlay it with dm, snapshot, set the original ro and then work on
the rw flavor, so you keep the current broken HDD/fs as is and then do
the above.

> $ sudo btrfs check --repair /dev/sdc
> enabling repair mode
> No valid Btrfs found on /dev/sdc
> Couldn't open file system

Forget --repair I would say, hopefully btrfs restore can still find /
copy most of the data

> $ sudo btrfs-find-root /dev/sdc
> No valid Btrfs found on /dev/sdc
> ERROR: open ctree failed
>
> $ sudo btrfs-show-super /dev/sdc --all
> superblock: bytenr=65536, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 65536
>
> superblock: bytenr=67108864, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 67108864
>
> superblock: bytenr=274877906944, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 274877906944

run
$ sudo btrfs-show-super /dev/sdc1 --all
or
$ sudo btrfs-show-super /dev/loopX --all



> System infos:
>
> $ uname -a
> Linux Mongo 4.6.2-1-MANJARO #1 SMP PREEMPT Wed Jun 8 11:00:08 UTC 2016
> x86_64 GNU/Linux
>
> $ btrfs --version
> btrfs-progs v4.5.3
>
> Don't think dmesg is necessary here.
>
>
> OK, the btrfs wiki says there is a second superblock at 64 MiB
> (overwritten too in my case) and a third at 256 GiB ("0x40").
> But how to restore it? And how to restore the general btrfs header
> metadata? How to restore GPT without doing something terrible again?

See above, maybe you have lookup various sizes etc GPT etc first, but
I think your 3rd SB should be there.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 7:03 PM, boli  wrote:
>>> It's done now, and took close to 99 hours to rebalance 8.1 TB of data from 
>>> a 4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 
>>> 3x6TB raid1 (9 TB capacity).
>>
>> Indeed, it not clear why it takes 4 days for such an action. You
>> indicated that you cannot add an online 5th drive, so then you and
>> intermediate compaction of the fs to less drives is a way to handle
>> this issue. There are 2 ways however:
>>
>> 1) Keeping the to-be-replaced drive online until a btrfs dev remove of
>> it from the fs of it is finished and only then replace a 6TB with an
>> 8TB in the drivebay. So in this case, one needs enough free capacity
>> on the fs (which you had) and full btrfs raid1 redundancy is there all
>> the time.
>>
>> 2) Take a 6TB out of the drivebay first and then do the btrfs dev
>> remove, in this case on a really missing disk. This way, the fs is in
>> degraded mode (or mounted as such) and the action of remove missing is
>> also a sort of 'reconstruction'. I don't know the details of the code,
>> but I can imagine that it has performance implications.
>
> Thanks for reminding me about option 1). So in summary, without temporarily 
> adding an additional drive, there are 3 ways to replace a drive:
>
> 1) Logically removing old drive (triggers 1st rebalance), physically removing 
> it, then adding new drive physically and logically (triggers 2nd rebalance)
>
> 2) Physically removing old drive, mounting degraded, logically removing it 
> (triggers 1st rebalance, while degraded), then adding new drive physically 
> and logically (2nd rebalance)
>
> 3) Physically replacing old with new drive, mounting degraded, then logically 
> replacing old with new drive (triggers rebalance while degraded)
>
>
> I did option 2, which seems to be the worst of the three, as there was no 
> redundancy for a couple days, and 2 rebalances are needed, which potentially 
> take a long time.
>
> Option 1 also has 2 rebalances, but redundancy is always maintained.
>
> Option 3 needs just 1 rebalance, but (like option 1) does not maintain 
> redundancy at all times.
>
> That's where an extra drive bay would come in handy, allowing to maintain 
> redundancy while still just needing one "rebalance"? Question mark because 
> you mentioned "highspeed data transfer" rather than "rebalance" when doing a 
> btrfs-replace, which sounds very efficient (in case of -r option these 
> transfers would be from multiple drives).

I haven't used -r with replace other then for testing purposes inside
virtual machines. I think the '..transfers would be from multiple
drives...' might not be a speed advantage with the current state of
the code. If the drives are still healthy and the replace purpose is
capacity increase, my experience is that without the -r option (and
using an extra SATA port), the transfer is mostly at the drives max
magnetic media transferspeed. Also for cases like if you want to add
LUKS or bcache headers in front of the blockdevice that hosts the
fs/devid1 data.

But now that you anyhow have all data on 3x 6TB drives, you could save
balancing time by just doing btrfs-replace 6TB to 8TB 3x and then for
the 4th 8TB just add it and let btrfs do the spreading/balancing over
time by itself.

> The man page mentioned that the replacement drive needs to be at least as 
> large as the original, which makes me wonder if it's still a "highspeed data 
> transfer" if the new drive is larger, or if it does a rebalance in that case. 
> If not then that'd be pretty much what I'm looking for. More on that below.
>
>>> If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still 
>>> raid1), is there a way to remove one 6 TB drive at a time, recreate its 
>>> exact contents from the other 3 drives onto a new 8 TB drive, without doing 
>>> a full rebalance? That is: without writing any substantial amount of data 
>>> onto the remaining 3 drives.
>>
>> There isn't such a way. This goal has a violation in itself with
>> respect to redundancy (btrfs raid1).
>
> True, it would be "hack" to minimize the amount of data to rebalance (thus 
> saving time), with the (significant) downside of not maintaining redundancy 
> at all times.
> Personally I'd probably be willing to take the risk, since I have a few other 
> copies of this data.
>
>> man btrfs-replace and option -r I would say. But still, having a 5th
>> drive online available makes things much easier and faster and solid
>> and is the way to do a drive replace. You can then do a normal replace
>> and there is just highspeed data transfer for the old and the new disk
>> and only for parts/blocks of the disk that contain filedata. So it is
>> not a sector-by-sector copying also deleted blocks, but from end-user
>> perspective is an exact copy. There are patches ('hot spare') that
>> assume it to be this way, but they aren't in the mainline kernel yet.
>
> Hmm, so maybe I should think about 

Re: Files seen by some apps and not others

2016-06-12 Thread Henk Slager
Bearcat Şándor  gmail.com> writes:

> Is there a fix for the bad tree block error, which seems to be the
> root (pun intended) of all this?

I think the root cause is some memory corruption. It might be known case, 
maybe someone else recognizes something.

Anyhow, if you can't and won't reproduce it, best is to test 
memory/hardware, check any software that might have overwritten something in 
memory, use a recent (mainline/stable) kernel and see if it runs stable.
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 12:35 PM, boli  wrote:
>> It has now been doing "btrfs device delete missing /mnt" for about 90 hours.
>>
>> These 90 hours seem like a rather long time, given that a rebalance/convert 
>> from 4-disk-raid5 to 4-disk-raid1 took about 20 hours months ago, and a 
>> scrub takes about 7 hours (4-disk-raid1).
>>
>> OTOH the filesystem will be rather full with only 3 of 4 disks available, so 
>> I do expect it to take somewhat "longer than usual".
>>
>> Would anyone venture a guess as to how long it might take?
>
> It's done now, and took close to 99 hours to rebalance 8.1 TB of data from a 
> 4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 3x6TB 
> raid1 (9 TB capacity).

Indeed, it not clear why it takes 4 days for such an action. You
indicated that you cannot add an online 5th drive, so then you and
intermediate compaction of the fs to less drives is a way to handle
this issue. There are 2 ways however:

1) Keeping the to-be-replaced drive online until a btrfs dev remove of
it from the fs of it is finished and only then replace a 6TB with an
8TB in the drivebay. So in this case, one needs enough free capacity
on the fs (which you had) and full btrfs raid1 redundancy is there all
the time.

2) Take a 6TB out of the drivebay first and then do the btrfs dev
remove, in this case on a really missing disk. This way, the fs is in
degraded mode (or mounted as such) and the action of remove missing is
also a sort of 'reconstruction'. I don't know the details of the code,
but I can imagine that it has performance implications.

> Now I made sure quotas were off, then started a screen to fill the new 8 TB 
> disk with zeros, detached it and and checked iotop to get a rough estimate on 
> how long it will take (I'm aware it will become slower in time).
>
> After that I'll add this 8 TB disk to the btrfs raid1 (for yet another 
> rebalance).
>
> The next 3 disks will be replaced with "btrfs replace", so only one rebalance 
> each is needed.
>
> I assume each "btrfs replace" would do a full rebalance, and thus assign 
> chunks according to the normal strategy of choosing the two drives with the 
> most free space, which in this case would be a chunk to the new drive, and a 
> mirrored chunk to that existing 3 drive with most free space.
>
> What I'm wondering is this:
> If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still 
> raid1), is there a way to remove one 6 TB drive at a time, recreate its exact 
> contents from the other 3 drives onto a new 8 TB drive, without doing a full 
> rebalance? That is: without writing any substantial amount of data onto the 
> remaining 3 drives.

There isn't such a way. This goal has a violation in itself with
respect to redundancy (btrfs raid1).

> It seems to me that would be a lot more efficient, but it would go against 
> the normal chunk assignment strategy.

man btrfs-replace and option -r I would say. But still, having a 5th
drive online available makes things much easier and faster and solid
and is the way to do a drive replace. You can then do a normal replace
and there is just highspeed data transfer for the old and the new disk
and only for parts/blocks of the disk that contain filedata. So it is
not a sector-by-sector copying also deleted blocks, but from end-user
perspective is an exact copy. There are patches ('hot spare') that
assume it to be this way, but they aren't in the mainline kernel yet.

The btrfs-replace should work ok for btrfs raid1 fs (at least it
worked ok for btrfs raid10 half a year ago I can confirm), if the fs
is mostly idle during the replace (almost no new files added). Still,
you might want to have the replace related fixes added in kernel
4.7-rc2.

Another less likely reason for the performance issue is that the fs is
changed from raid5 and has 4k nodesize. btrfs-show-super can show you
that. It should not be, but my experience for a delete / add sequence
for such a case is that is very slow.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cannot balance FS (No space left on device)

2016-06-10 Thread Henk Slager
On Fri, Jun 10, 2016 at 8:04 PM, ojab //  wrote:
> [Please CC me since I'm not subscribed to the list]
> Hi,
> I've tried to `/usr/bin/btrfs fi defragment -r` my btrfs partition,
> but it's failed w/ "No space left on device" and now I can't get any
> free space on that partition (deleting some files or adding new device
> doesn't help). During defrag I've used `space_cache=v2` mount option,
> but remounted FS w/ `clear_cache` flag since then. Also I've deleted
> about 50Gb of files and added new 250Gb disk since then:
>
>>$ df -h /mnt/xxx/
>>Filesystem  Size  Used Avail Use% Mounted on
>>/dev/sdc1   2,1T  1,8T   37G  99% /mnt/xxx
>>$ sudo /usr/bin/btrfs fi show
>>Label: none  uuid: 8a65465d-1a8c-4f80-abc6-c818c38567c3
>>Total devices 3 FS bytes used 1.78TiB
>>devid1 size 931.51GiB used 931.51GiB path /dev/sdc1
>>devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
>>devid3 size 230.41GiB used 0.00B path /dev/sdd1
>>$ sudo /usr/bin/btrfs fi usage /mnt/xxx/
>>Overall:
>>Device size:   2.04TiB
>>Device allocated:  1.82TiB
>>Device unallocated:230.41GiB
>>Device missing:0.00B
>>Used:  1.78TiB
>>Free (estimated):  267.23GiB  (min: 152.03GiB)
>>Data ratio:1.00
>>Metadata ratio:2.00
>>Global reserve:512.00MiB  (used: 0.00B)
>>
>>Data,RAID0: Size:1.81TiB, Used:1.78TiB
>>   /dev/sdb1   928.48GiB
>>   /dev/sdc1   928.48GiB
>>
>>Metadata,RAID1: Size:3.00GiB, Used:2.30GiB
>>   /dev/sdb1   3.00GiB
>>   /dev/sdc1   3.00GiB
>>
>>System,RAID1: Size:32.00MiB, Used:176.00KiB
>>   /dev/sdb132.00MiB
>>   /dev/sdc132.00MiB
>>
>>Unallocated:
>>   /dev/sdb1   1.01MiB
>>   /dev/sdc1   1.00MiB
>>   /dev/sdd1   230.41GiB
>>$ sudo /usr/bin/btrfs balance start -dusage=66 /mnt/xxx/
>>Done, had to relocate 0 out of 935 chunks
>>$ sudo /usr/bin/btrfs balance start -dusage=67 /mnt/xxx/
>>ERROR: error during balancing '/mnt/xxx/': No space left on device
>>There may be more info in syslog - try dmesg | tail
>
> I assume that there is something wrong with metadata, since I can copy
> files to FS.
> I'm on 4.6.2 vanilla kernel and using btrfs-progs-4.6, btrfs-debugfs
> output can be found here:
> https://gist.githubusercontent.com/ojab/1a8b1f83341403a169a8e66995c7c3da/raw/61621d22f706d7543a93a3d005415543af9a0db0/gistfile1.txt.
> Any hint what else can I try to fix the issue?

I have seldom seen an fs so full, very regular numbers :)

But can you provide the output of this script:
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py

It gives better info w.r.t. devices and it is then easier to say what
has to be done.

But you have btrfs raid0 data (2 stripes) and raid1 metadata, and they
both want 2 devices currently and there is only one device with place
for your 2G chunks. So in theory you need 2 empty devices added for a
balance to succeed. If you can allow reduces redundancy for some time,
you could shrink the fs used space on hdd1 to half, same for the
partition itself, add a hdd2 parttition and add that to the fs. Or
just add another HDD.
Then your 50Gb of deletions could get into effect if you start
balancing. Also have a look at the balance stripe filters I would say.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-10 Thread Henk Slager
On Sat, May 14, 2016 at 10:19 AM, Jukka Larja  wrote:
> In short:
>
> I added two 8TB Seagate Archive SMR disk to btrfs pool and tried to delete
> one of the old disks. After some errors I ended up with file system that can
> be mounted read-only, but crashes the kernel if mounted normally. Tried
> btrfs check --repair (which noted that space cache needs to be zeroed) and
> zeroing space cache (via mount parameter), but that didn't change anything.
>
> Longer version:
>
> I was originally running Debian Jessie with some pretty recent kernel (maybe
> 4.4), but somewhat older btrfs tools. After the trouble started, I tried

You should at least have kernel 4.4, the critical patch for supporting
this drive was added in 4.4-rc3 or 4.4-rc4, i dont remember exactly.
Only if you somehow disable NCQ completely in your linux system
(kernel and more) or use a HW chipset/bridge that does that for you it
might work.

> updating (now running Kernel 4.5.1 and tools 4.4.1). I checked the new disks
> with badblocks (no problems found), but based on some googling, Seagate's
> SMR disks seem to have various problems, so the root cause is probably one
> type or another of disk errors.

Seagate provides a special variant of the linux ext4 fs system that
should then play well with their SMR drive. Also the advice is to not
use this drive in a array setup; the risk is way to high that they
can't keep up with the demands of the higher layers and then get
resets or their FW crashes. You should have had also have a look at
your system's and drive timeouts (see scterc). To summarize: adding
those drives to an btrfs raid array is asking for trouble.

I am using 1 such drive with an Intel J1900 SoC (Atom, SATA2) and it
works, although I get still the typical error occasionally. As it is
just a btrfs receive target, just 1 fs dup/dup/single for the whole
drive, all CoW, it survives those lockups or crashes, I just restart
the board+drive. In general, reading back multi-TB ro snapshots works
fine and is on par with Gbps LAN speeds.

> Here's the output of btrfs fi show:
>
> Label: none  uuid: 8b65962d-0982-449b-ac6f-1acc8397ceb9
> Total devices 12 FS bytes used 13.15TiB
> devid1 size 3.64TiB used 3.36TiB path /dev/sde1
> devid2 size 3.64TiB used 3.36TiB path /dev/sdg1
> devid3 size 3.64TiB used 3.36TiB path /dev/sdh1
> devid4 size 3.64TiB used 3.34TiB path /dev/sdf1
> devid5 size 1.82TiB used 1.44TiB path /dev/sdi1
> devid6 size 1.82TiB used 1.54TiB path /dev/sdl1
> devid7 size 1.82TiB used 1.51TiB path /dev/sdk1
> devid8 size 1.82TiB used 1.54TiB path /dev/sdj1
> devid9 size 3.64TiB used 3.31TiB path /dev/sdb1
> devid   10 size 3.64TiB used 3.36TiB path /dev/sda1
> devid   11 size 7.28TiB used 168.00GiB path /dev/sdc1
> devid   12 size 7.28TiB used 168.00GiB path /dev/sdd1
>
> Last two devices (11 and 12) are the new disks. After adding them, I first
> copied some new data in (about 130 GBs), which seemed to go fine. Then I
> tried to remove disk 5. After some time (about 30 GiBs written to 11 and
> 12), there were some errors and disk 11 or 12 dropped out and fs went
> read-only. After some trouble-shooting (googling), I decided the new disks
> were too iffy to trust and tried to remove them.
>
> I don't remember exactly what errors I got, but device delete operation was
> interrupted due to errors at least once or twice, before more serious
> trouble began. In between the attempts I updated the HBA's (an LSI 9300)
> firmware. After final device delete attempt the end result was that
> attempting to mount causes kernel to crash. I then tried updating kernel and
> running check --repair, but that hasn't helped. Mounting read-only seems to
> work perfectly, but I haven't tried copying everything to /dev/null or
> anything like that (just few files).
>
> The log of the crash (it is very repeatable) can be seen here:
> http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/btrfs_crash_log.txt
>
> Snipped from start of that:
>
> touko 12 06:41:22 jane kernel: BTRFS info (device sda1): disk space caching
> is enabled
> touko 12 06:41:24 jane kernel: BTRFS info (device sda1): bdev /dev/sdd1
> errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> touko 12 06:41:39 jane kernel: BUG: unable to handle kernel NULL pointer
> dereference at 01f0
> touko 12 06:41:39 jane kernel: IP: []
> can_overcommit+0x1e/0xf0 [btrfs]
> touko 12 06:41:39 jane kernel: PGD 0
> touko 12 06:41:39 jane kernel: Oops:  [#1] SMP
>
> My dmesg log is here:
> http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/dmesg.log
>
> Other information:
> Linux jane 4.5.0-1-amd64 #1 SMP Debian 4.5.1-1 (2016-04-14) x86_64 GNU/Linux
> btrfs-progs v4.4.1
>
> btrfs fi df /mnt/Allosaurus/
> Data, RAID1: total=13.13TiB, used=13.07TiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: 

Re: Allocator behaviour during device delete

2016-06-10 Thread Henk Slager
On Thu, Jun 9, 2016 at 3:54 PM, Brendan Hide  wrote:
>
>
> On 06/09/2016 03:07 PM, Austin S. Hemmelgarn wrote:
>>
>> On 2016-06-09 08:34, Brendan Hide wrote:
>>>
>>> Hey, all
>>>
>>> I noticed this odd behaviour while migrating from a 1TB spindle to SSD
>>> (in this case on a LUKS-encrypted 200GB partition) - and am curious if
>>> this behaviour I've noted below is expected or known. I figure it is a
>>> bug. Depending on the situation, it *could* be severe. In my case it was
>>> simply annoying.
>>>
>>> ---
>>> Steps
>>>
>>> After having added the new device (btrfs dev add), I deleted the old
>>> device (btrfs dev del)
>>>
>>> Then, whilst waiting for that to complete, I started a watch of "btrfs
>>> fi show /". Note that the below is very close to the output at the time
>>> - but is not actually copy/pasted from the output.
>>>
 Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
 Total devices 2 FS bytes used 115.03GiB
 devid1 size 0.00GiB used 298.06GiB path /dev/sda2
 devid2 size 200.88GiB used 0.00GiB path
 /dev/mapper/cryptroot
>>>
>>>
>>>
>>> devid1 is the old disk while devid2 is the new SSD
>>>
>>> After a few minutes, I saw that the numbers have changed - but that the
>>> SSD still had no data:
>>>
 Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
 Total devices 2 FS bytes used 115.03GiB
 devid1 size 0.00GiB used 284.06GiB path /dev/sda2
 devid2 size 200.88GiB used 0.00GiB path
 /dev/mapper/cryptroot
>>>
>>>
>>> The "FS bytes used" amount was changing a lot - but mostly stayed near
>>> the original total, which is expected since there was very little
>>> happening other than the "migration".
>>>
>>> I'm not certain of the exact point where it started using the new disk's
>>> space. I figure that may have been helpful to pinpoint. :-/
>>
>> OK, I'm pretty sure I know what was going on in this case.  Your
>> assumption that device delete uses the balance code is correct, and that
>> is why you see what's happening happening.  There are two key bits that
>> are missing though:
>> 1. Balance will never allocate chunks when it doesn't need to.

In relation to discussions w.r.t. enospc and device full of chunks, I
say this 1. statement and I see different behavior with kernel 4.6.0
tools 4.5.3
On a idle fs with some fragmentation, I did balance -dusage=5, it
completes succesfuly and leaves and new empty chunk (highest vaddr).
Then balance -dusage=6, does 2 chunks with that usage level:
- the zero filled last chunk is replaced with a new empty chunk (higher vaddr)
- the 2 usage=6 chunks are gone
- one chunk with the lowest vaddr saw its usage increase from 47 to 60
- several metadata chunks have change slightly in usage

It could be a 2-step datamove, but from just the states before and
after balance I can't prove that.

>> 2. The space usage listed in fi show is how much space is allocated to
>> chunks, not how much is used in those chunks.
>>
>> In this case, based on what you've said, you had a lot of empty or
>> mostly empty chunks.  As a result of this, the device delete was both
>> copying data, and consolidating free space.  If you have a lot of empty
>> or mostly empty chunks, it's not unusual for a device delete to look
>> like this until you start hitting chunks that have actual data in them.
>> The pri8mary point of this behavior is that it makes it possible to
>> directly switch to a smaller device without having to run a balance and
>> then a resize before replacing the device, and then resize again
>> afterwards.
>
>
> Thanks, Austin. Your explanation is along the lines of my thinking though.
>
> The new disk should have had *some* data written to it at that point, as it
> started out at over 600GiB in allocation (should have probably mentioned
> that already). Consolidating or not, I would consider data being written to
> the old disk to be a bug, even if it is considered minor.
>
> I'll set up a reproducible test later today to prove/disprove the theory. :)
>
> --
> __
> Brendan Hide
> http://swiftspirit.co.za/
> http://www.webafrica.co.za/?AFF1E97
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-06-10 Thread Henk Slager
On Fri, Jun 10, 2016 at 7:22 PM, Adam Borowski  wrote:
> On Fri, Jun 10, 2016 at 01:12:42PM -0400, Austin S. Hemmelgarn wrote:
>> On 2016-06-10 12:50, Adam Borowski wrote:
>> >And, as of coreutils 8.25, the default is no reflink, with "never" not being
>> >recognized even as a way to avoid an alias.  As far as I remember, this
>> >applies to every past version with support for reflinks too.
>> >
>> Odd, I could have sworn that was an option...
>>
>> And I do know there was talk at least at one point of adding it and
>> switching to reflink=auto by default.
>
> Yes please!
>
> It's hard to come with a good reason for not reflinking when it's possible
> -- the only one I see is if you have a nocow VM and want to slightly improve
> speed at a cost of lots of disk space.  And even then, there's cat a >b for
> that.

For a nocow VM imagefile, reflink anyhow does not work so cp
--reflink=auto would then just duplicate the whole thing, do doing a
'cp --reflink=never' (never works for --sparse), either silently or
with a warning/note.

For a cow VM imagefile, the only thing I do and want w.r.t. cp is
reflink=always, so I also vote for auto on by default.

If you want to 'defrag' a VM imagefile, using cat or dd and enough RAM
does a better and faster job than cp or btrfs manual defrag.

> And the cost on non-btrfs non-unmerged-xfs is a single syscall per file,
> that's utterly negligible compared to actually copying the data.
>
> --
> An imaginary friend squared is a real enemy.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-10 Thread Henk Slager
On Thu, Jun 9, 2016 at 5:41 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
> excerpted:
>
>> The next question is what files these extents belong to. To find out, I
>> need to open up the extent items I get back and follow a backreference
>> to an inode object. Might do that tomorrow, fun.
>>
>> To be honest, I suspect /var/log and/or the file storage of mailman to
>> be the cause of the fragmentation, since there's logging from postfix,
>> mailman and nginx going on all day long in a slow but steady tempo.
>> While using btrfs for a number of use cases at work now, we normally
>> don't use it for the root filesystem. And the cases where it's used as
>> root filesystem don't do much logging or mail.
>
> FWIW, that's one reason I have a dedicated partition (and filesystem) for
> logs, here.  (The other reason is that should something go runaway log-
> spewing, I get a warning much sooner when my log filesystem fills up, not
> much later, with much worse implications, when the main filesystem fills
> up!)
>
>> And no, autodefrag is not in the mount options currently. Would that be
>> helpful in this case?
>
> It should be helpful, yes.  Be aware that autodefrag works best with
> smaller (sub-half-gig) files, however, and that it used to cause
> performance issues with larger database and VM files, in particular.

I don't know why you relate filesize and autodefrag. Maybe because you
say '... used to cause ...'.

autodefrag detects random writes and then tries to defrag a certain
range. Its scope size is 256K as far as I see from the code and over
time you see VM images that are on a btrfs fs (CoW, hourly ro
snapshots) having a lot of 256K (or a bit less) sized extents
according to what filefrag reports. I once wanted to try and change
the 256K to 1M or even 4M, but I haven't  come to that.
A 32G VM image would consist of 131072 extents for 256K, 32768 extents
for 1M, 8192 extents for 4M.

> There used to be a warning on the wiki about that, that was recently
> removed, so apparently it's not the issue that it was, but you might wish
> to monitor any databases or VMs with gig-plus files to see if it's going
> to be a performance issue, once you turn on autodefrag.

For very active databases, I don't know what the effects are, with or
without autodefrag ( either on SSD and/or HDD).
At least on HDD-only, so no persistent SSD caching and noautodefrag,
VMs will result in unacceptable performance soon.

> The other issue with autodefrag is that if it hasn't been on and things
> are heavily fragmented, it can at first drive down performance as it
> rewrites all these heavily fragmented files, until it catches up and is
> mostly dealing only with the normal refragmentation load.

I assume you mean that one only gets a performance drop if you
actually do new writes to the fragmented files since autodefrag on. It
shouldn't start defragging by itself AFAIK.

> Of course the
> best way around that is to run autodefrag from the first time you mount
> the filesystem and start writing to it, so it never gets overly
> fragmented in the first place.  For a currently in-use and highly
> fragmented filesystem, you have two choices, either backup and do a fresh
> mkfs.btrfs so you can start with a clean filesystem and autodefrag from
> the beginning, or doing manual defrag.
>
> However, be aware that if you have snapshots locking down the old extents
> in their fragmented form, a manual defrag will copy the data to new
> extents without releasing the old ones as they're locked in place by the
> snapshots, thus using additional space.  Worse, if the filesystem is
> already heavily fragmented and snapshots are locking most of those
> fragments in place, defrag likely won't help a lot, because the free
> space as well will be heavily fragmented.   So starting off with a clean
> and new filesystem and using autodefrag from the beginning really is your
> best bet.

If it is about multi-TB fs, I think most important is to have enough
unfragmented free space available and hopefully at the beginning of
the device if it is flat HDD. Maybe a  balance -ddrange=1M..<20% of
device> can do that, I haven't tried.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs

2016-06-05 Thread Henk Slager
>> > - OTOH, defrag seems to be viable for important use cases (VM
>> > images,
>> >   DBs,... everything where large files are internally re-written
>> >   randomly).
>> >   Sure there is nodatacow, but with that one effectively completely
>> >   looses one of the core features/promises of btrfs (integrity by
>> >   checksumming)... and as I've showed in an earlier large
>> > discussion,
>> >   none of the typical use cases for nodatacow has any high-level
>> >   checksumming, and even if, it's not used per default, or doesn't
>> > give
>> >   the same benefits at it would on the fs level, like using it for
>> > RAID
>> >   recovery).
>> The argument of nodatacow being viable for anything is a pretty
>> significant secondary discussion that is itself entirely orthogonal
>> to
>> the point you appear to be trying to make here.
>
> Well the point here was:
> - many people (including myself) like btrfs, it's
>   (promised/future/current) features
> - it's intended as a general purpose fs
> - this includes the case of having such file/IO patterns as e.g. for VM
>   images or DBs
> - this is currently not really doable without loosing one of the
>   promises (integrity)
>
> So the point I'm trying to make:
> People do probably not care so much whether their VM image/etc. is
> COWed or not, snapshots/etc. still work with that,... but they may
> likely care if the integrity feature is lost.
> So IMHO, nodatacow + checksumming deserves to be amongst the top
> priorities.

Have you tried blockdevice/HDD caching like bcache or dmcache in
combination with VMs on BTRFS?  Or ZVOL for VMs in ZFS with L2ARC?
I assume the primary reason for wanting nodatacow + checksumming is to
avoid long seektimes on HDDs due to growing fragmentation of the VM
images over time. But even if you have nodatacow + checksumming
implemented, it is then still HDD access and a VM imagefile itself is
not guaranteed to be continuous.
It is clear that for VM images the amount of extents will be large
over time (like 50k or so, autodefrag on), but with a modern SSD used
as cache, it doesn't matter. It is still way faster than just HDD(s),
even with freshly copied image with <100 extents.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "No space left on device" and balance doesn't work

2016-06-02 Thread Henk Slager
On Thu, Jun 2, 2016 at 3:55 PM, MegaBrutal <megabru...@gmail.com> wrote:
> 2016-06-02 0:22 GMT+02:00 Henk Slager <eye...@gmail.com>:
>> What is the kernel version used?
>> Is the fs on a mechanical disk or SSD?
>> What are the mount options?
>> How old is the fs?
>
> Linux 4.4.0-22-generic (Ubuntu 16.04).
> Mechanical disks in LVM.
> Mount: /dev/mapper/centrevg-rootlv on / type btrfs
> (rw,relatime,space_cache,subvolid=257,subvol=/@)
> I don't know how to retrieve the exact FS age, but it was created in
> 2014 August.
>
> Snapshots (their names encode their creation dates):
>
> ID 908 gen 487349 top level 5 path @-snapshot-2016050301
...
> ID 937 gen 521829 top level 5 path @-snapshot-2016060201
>
> Removing old snapshots is the most feasible solution, but I can also
> increase the FS size. It's easy since it's in LVM, and there is plenty
> of space in the volume group.
>
> Probably I should rewrite my alert script to check btrfs fi show
> instead of plain df.

Yes I think that makes sense, to decide on chunk-level. You can see
how big the chunks are with the linked show_usage.py program, most of
33 should be 1GiB as already very well explained by Austin.

The setup looks all pretty normal and btrfs should be able to handle
it, but unfortunately your fs is a typical example that one currently
needs to monitor/tune a btrfs fs for its 'health' in order to keep it
running longterm. You might want to change mount option relatime to
noatime, so that you have less writes to metadata chunks. It should
lower the scattering inside the metadata chunks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] receive not seeing file that exists

2016-06-02 Thread Henk Slager
On Thu, Jun 2, 2016 at 9:26 AM, Benedikt Morbach
 wrote:
> Hi all,
>
> I've encountered a bug in btrfs-receive. When receiving a certain
> incremental send, it will error with:
>
> ERROR: cannot open
> backup/detritus/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal:
> No such file or directory
>
> even though that path exists and the parent subvolume is identical on
> both ends (I checked manually).
>
> I've noticed this happen before on the same directory (and google
> confirms it has also happened to others) and /var/log/journal/ and its
> children are the only directories with 'chattr +C' on this system, so
> it might be related to that?

Now that I see this report, I realize that I also hit this issue. I
was compiling a kernel with 'make -j4 all'. Under some circumstances,
this leads to 'package temp too high' and throtlling speed of CPU;
with -j1 or -j2 I haven't seen it (root cause is the power supply I
think).

Anyhow, while this compile was running, my nightly snapshotting and
incremental send|receive was started. I saw a mce HW error in kernel
log also at that point, so I did restart. Also the inc send had failed
so I thought, it was due to mce issue. But also with no mce-HW issues
logged, tools-4.5.3 + kernel-4.5.4 and also tools-4.5.3 + kernel-4.6.0
had the same issue.

I run send and receive on same PC in this case, but splitting the
stream to a file in addition. The file was already corrupt (too short)
I noticed, so I concluded the issue was in send. I set up a hourly
extra backup crontask for this problem subvol and it failed almost
every hour. For another subvolume on the new 3-day young fs, it was
not a problem. The fs is a few TB, has default mkfs settings +noholes.
Nodesize increased from 4k to 16k, that was a reason to re-create it.

For the problem subvol and also others that I do not inc backup, I set
the subvol to ro on old fs, send the stream-file to temp storage,
received it back on new fs and set it to rw and created initial backup
snapshot of it and send it over to backup fs. That all worked fine.
Several programs write and delete roughly 10 files/hour so not very
active part of the fs. It was quite random at which file the
incremental stream got corrupted.

My best guess was that the use of  btrfs property set  might be the
issue, so I rsynced the data in the subvol into a new subvol and did
initial backup snapshot transfer. This was with tools-4.5.3 +
kernel-4.5.4 and it runs now fine for 10 days.

I had limited time to research this issue for the subvol and also
cannot provide send-stream data for the subvol. But I have still a 12G
btrfs-stream of a .git kernelbuild tree that also got this btrfs
property set ro=true treatment. So I might try to reproduce the bug
with that one.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "No space left on device" and balance doesn't work

2016-06-01 Thread Henk Slager
On Wed, Jun 1, 2016 at 11:06 PM, MegaBrutal  wrote:
> Hi Peter,
>
> I tried. I either get "Done, had to relocate 0 out of 33 chunks" or
> "ERROR: error during balancing '/': No space left on device", and
> nothing changes.
>
>
> 2016-06-01 22:29 GMT+02:00 Peter Becker :
>> try this:
>>
>> btrfs fi balance start -musage=0 /
>> btrfs fi balance start -dusage=0 /
>>
>> btrfs fi balance start -musage=1 /
>> btrfs fi balance start -dusage=1 /
>>
>> btrfs fi balance start -musage=5 /
>> btrfs fi balance start -musage=10 /
>> btrfs fi balance start -musage=20 /
>>
>>
>> btrfs fi balance start -dusage=5 /
>> btrfs fi balance start -dusage=10 /
>> btrfs fi balance start -dusage=20 /
>> 
>>
>> 2016-06-01 20:30 GMT+02:00 MegaBrutal :
>>> Hi all,
>>>
>>> I have a 20 GB file system and df says I have about 2,6 GB free space,
>>> yet I can't do anything on the file system because I get "No space
>>> left on device" errors. I read that balance may help to remedy the
>>> situation, but it actually doesn't.
>>>
>>>
>>> Some data about the FS:
>>>
>>>
>>> root@ReThinkCentre:~# df -h /
>>> FájlrendszerMéret Fogl. Szab. Fo.% Csatol. pont
>>> /dev/mapper/centrevg-rootlv   20G   18G  2,6G  88% /
>>>
>>> root@ReThinkCentre:~# btrfs fi show /
>>> Label: 'RootFS'  uuid: 3f002b8d-8a1f-41df-ad05-e3c91d7603fb
>>> Total devices 1 FS bytes used 15.42GiB
>>> devid1 size 20.00GiB used 20.00GiB path 
>>> /dev/mapper/centrevg-rootlv

The device is completely filled with chunks (size and used are the
same) and none of the chunks is empty so a balance won't work at all.

A way to get out of this situation is to add a temporary extra device
(e.g. 8GB USB stick or a loop device on some larger USB disk) to the
fs and then do the various balance operations. Removing as much as
possible snapshots will ease the balance mostly, depending how old the
snapshots are.
Once you see that the total amount of space used by chunks is (a few
GiB) less then 19GiB, you can remove the temporary extra device from
the fs again.

It then is still possible that you run into the same situation again;
This is a longterm bug/problem. A brand new kernel might help, with
ENOSPC patches included.

What is the kernel version used?
Is the fs on a mechanical disk or SSD?
What are the mount options?
How old is the fs?

You might want to run this phython script, so you get an idea of what
the chunks fill-level is
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py

Also you could mount the fs with enospc_debug, and see what is reported in dmesg

>>> root@ReThinkCentre:~# btrfs fi df /
>>> Data, single: total=16.69GiB, used=14.14GiB
>>> System, DUP: total=32.00MiB, used=16.00KiB
>>> Metadata, DUP: total=1.62GiB, used=1.28GiB
>>> GlobalReserve, single: total=352.00MiB, used=0.00B
>>>
>>> root@ReThinkCentre:~# btrfs version
>>> btrfs-progs v4.4
>>>
>>>
>>> This happens when I try to balance:
>>>
>>> root@ReThinkCentre:~# btrfs fi balance start -dusage=66 /
>>> Done, had to relocate 0 out of 33 chunks
>>> root@ReThinkCentre:~# btrfs fi balance start -dusage=67 /
>>> ERROR: error during balancing '/': No space left on device
>>> There may be more info in syslog - try dmesg | tail
>>>
>>>
>>> "dmesg | tail" does not show anything related to this.
>>>
>>> It is important to note that the file system currently has 32
>>> snapshots of / at the moment, and snapshots taking up all the free
>>> space is a plausible explanation. Maybe deleting some of the oldest
>>> snapshots or just increasing the file system would help the situation.
>>> However, I'm still interested, if the file system is full, why does df
>>> show there is free space, and how could I show the situation without
>>> having the mentioned options? I actually have an alert set up which
>>> triggers when the FS usage reaches 90%, so then I know I have to
>>> delete some old snapshots. It worked so far, I cleaned the snapshots
>>> at 90%, FS usage fell back, everyone was happy. But now the alert
>>> didn't even trigger because the FS is at 88% usage, so it shouldn't be
>>> full yet.
>>>
>>>
>>> Best regards and kecske,
>>> MegaBrutal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10

2016-05-25 Thread Henk Slager
There is a division by 2 missing in the code. With that added, the
RAID10 numbers make more sense. See also:
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/53989

More detail in here:
https://www.spinics.net/lists/linux-btrfs/msg52882.html

And if you want to look at allocation in a different way, this might
be interesting:
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py

On Mon, May 23, 2016 at 3:34 PM, Marco Lorenzo Crociani
 wrote:
> Hi,
> as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'.
> I have a 4 partitions RAID10 btrfs filesystem almost full.
> 'btrfs filesystem usage' reports wrong "Unallocated" indications.
>
> Linux 4.5.3
> btrfs-progs v4.5.3
>
>
> # btrfs fi usage /data/
>
> Overall:
> Device size:  13.93TiB
> Device allocated:  13.77TiB
> Device unallocated: 167.54GiB
> Device missing: 0.00B
> Used:  13.44TiB
> Free (estimated): 244.39GiB(min: 244.39GiB)
> Data ratio:  2.00
> Metadata ratio:  2.00
> Global reserve: 512.00MiB(used: 0.00B)
>
> Data,single: Size:8.00MiB, Used:0.00B
>/dev/sda4   8.00MiB
>
> Data,RAID10: Size:6.87TiB, Used:6.71TiB
>/dev/sda4   1.72TiB
>/dev/sdb3   1.72TiB
>/dev/sdc3   1.72TiB
>/dev/sdd3   1.72TiB
>
> Metadata,single: Size:8.00MiB, Used:0.00B
>/dev/sda4   8.00MiB
>
> Metadata,RAID10: Size:19.00GiB, Used:14.15GiB
>/dev/sda4   4.75GiB
>/dev/sdb3   4.75GiB
>/dev/sdc3   4.75GiB
>/dev/sdd3   4.75GiB
>
> System,single: Size:4.00MiB, Used:0.00B
>/dev/sda4   4.00MiB
>
> System,RAID10: Size:16.00MiB, Used:768.00KiB
>/dev/sda4   4.00MiB
>/dev/sdb3   4.00MiB
>/dev/sdc3   4.00MiB
>/dev/sdd3   4.00MiB
>
> Unallocated:
>/dev/sda4   1.76TiB
>/dev/sdb3   1.76TiB
>/dev/sdc3   1.76TiB
>/dev/sdd3   1.76TiB
>
> --
> # btrfs fi show /data/
> Label: 'data'  uuid: df6639d5-3ef2-4ff6-a871-9ede440e2dae
> Total devices 4 FS bytes used 6.72TiB
> devid1 size 3.48TiB used 3.44TiB path /dev/sda4
> devid2 size 3.48TiB used 3.44TiB path /dev/sdb3
> devid3 size 3.48TiB used 3.44TiB path /dev/sdc3
> devid4 size 3.48TiB used 3.44TiB path /dev/sdd3
>
> --
> # btrfs fi df /data/
> Data, RAID10: total=6.87TiB, used=6.71TiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID10: total=16.00MiB, used=768.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID10: total=19.00GiB, used=14.15GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> --
> # df -h
> /dev/sda4 7,0T  6,8T245G  97% /data
>
> Regards,
>
> --
> Marco Crociani
> Prisma Telecom Testing S.r.l.
> via Petrocchi, 4  20127 MILANO  ITALY
> Phone:  +39 02 26113507
> Fax:  +39 02 26113597
> e-mail:  mar...@prismatelecomtesting.com
> web:  http://www.prismatelecomtesting.com
>
> Questa email (e I suoi allegati) costituisce informazione riservata e
> confidenziale e può essere soggetto a legal privilege. Può essere utilizzata
> esclusivamente dai suoi destinatari legittimi.  Se avete ricevuto questa
> email per errore, siete pregati di informarne immediatamente il mittente e
> quindi cancellarla.  A meno che non siate stati a ciò espressamente
> autorizzati, la diffusione o la riproduzione di questa email o del suo
> contenuto non sono consentiti.
>
>  Salvo che questa email sia espressamente qualificata come offerta o
> accettazione contrattuale, il mittente non intende con questa email dare
> vita ad un vincolo giuridico e questa email non può essere interpretata
> quale offerta o accettazione che possa dare vita ad un contratto. Qualsiasi
> opinione manifestata in questa email è un'opinione personale del mittente,
> salvo che il mittente dichiari espressamente che si tratti di un'opinione di
> Prisma Engineering.
>
>
> ***
>
>  This e-mail (including any attachments) is private and confidential, and
> may be privileged.  It is for the exclusive use of the intended
> recipient(s).  If you have received this email in error, please inform the
> sender immediately and then delete this email.  Unless you have been given
> specific permission to do so, please do not distribute or copy this email or
> its contents.
>  Unless the text of this email specifically states that it is a contractual
> offer or acceptance, the sender does not intend to create a legal
> relationship and this email shall not constitute an offer or acceptance
> which could 

Re: Copy on write of unmodified data

2016-05-25 Thread Henk Slager
On Wed, May 25, 2016 at 10:58 AM, H. Peter Anvin  wrote:
> Hi,
>
> I'm looking at using a btrfs with snapshots to implement a generational
> backup capacity.  However, doing it the naïve way would have the side
> effect that for a file that has been partially modified, after
> snapshotting the file would be written with *mostly* the same data.  How
> does btrfs' COW algorithm deal with that?  If necessary I might want to
> write some smarter user space utilities for this.

Assuming 'snapshots' plural refers incremental snapshots of a
subvolume, you might want to use the send ioctl of the kernel.
Userspace btrfs-progs  btrfs send --no-data  output might give some hints.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-20 Thread Henk Slager
 bcache protective superblocks is a one-time procedure which can be done
 online. The bcache devices act as normal HDD if not attached to a
 caching SSD. It's really less pain than you may think. And it's a
 solution available now. Converting back later is easy: Just detach the
 HDDs from the SSDs and use them for some other purpose if you feel so
 later. Having the bcache protective superblock still in place doesn't
 hurt then. Bcache is a no-op without caching device attached.
>>>
>>>
>>> No, bcache is _almost_ a no-op without a caching device.  From a
>>> userspace
>>> perspective, it does nothing, but it is still another layer of
>>> indirection
>>> in the kernel, which does have a small impact on performance.  The same
>>> is
>>> true of using LVM with a single volume taking up the entire partition, it
>>> looks almost no different from just using the partition, but it will
>>> perform
>>> worse than using the partition directly.  I've actually done profiling of
>>> both to figure out base values for the overhead, and while bcache with no
>>> cache device is not as bad as the LVM example, it can still be a roughly
>>> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
>>> is).
>>>
>>> You also lose the ability to mount that filesystem directly on a kernel
>>> without bcache support (this may or may not be an issue for you).
>>
>>
>> The bcache (protective) superblock is in an 8KiB block in front of the
>> file system device. In case the current, non-bcached HDD's use modern
>> partitioning, you can do a 5-minute remove or add of bcache, without
>> moving/copying filesystem data. So in case you have a bcache-formatted
>> HDD that had just 1 primary partition (512 byte logical sectors), the
>> partition start is at sector 2048 and the filesystem start is at 2064.
>> Hard removing bcache (so making sure the module is not
>> needed/loaded/used the next boot) can be done done by changing the
>> start-sector of the partition from 2048 to 2064. In gdisk one has to
>> change the alignment to 16 first, otherwise this it refuses. And of
>> course, also first flush+stop+de-register bcache for the HDD.
>>
>> The other way around is also possible, i.e. changing the start-sector
>> from 2048 to 2032. So that makes adding bcache to an existing
>> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
>> not online of course, but just one reboot is needed (or just umount,
>> gdisk, partprobe, add bcache etc).
>> For RAID setups, one could just do 1 HDD first.
>
> My argument about the overhead was not about the superblock, it was about
> the bcache layer itself.  It isn't practical to just access the data
> directly if you plan on adding a cache device, because then you couldn't do
> so online unless you're going through bcache.  This extra layer of
> indirection in the kernel does add overhead, regardless of the on-disk
> format.

Yes, sorry, I took some shortcut in the discussion and jumped to a
method for avoiding this 0.5-2% slowdown that you mention. (Or a
kernel crashing in bcache code due to corrupt SB on a backing device
or corrupted caching device contents).
I am actually bit surprised that there is a measurable slowdown,
considering that it is basically just one 8KiB offset on a certain
layer in the kernel stack, but I haven't looked at that code.

> Secondarily, having a HDD with just one partition is not a typical use case,
> and that argument about the slack space resulting from the 1M alignment only
> holds true if you're using an MBR instead of a GPT layout (or for that
> matter, almost any other partition table format), and you're not booting
> from that disk (because GRUB embeds itself there). It's also fully possible
> to have an MBR formatted disk which doesn't have any spare space there too
> (which is how most flash drives get formatted).

I don't know other tables than MBR and GPT, but this bcache SB
'insertion' works with both. Indeed, if GRUB is involved, it can get
complicated, I have avoided that. If there is less than 8KiB slack
space on a HDD, I would worry about alignment/performance first, then
there is likely a reason to fully rewrite the HDD with a standard 1M
alingment.
If there is more partitions and the partition in front of the one you
would like to be bcached, I personally would shrink it by 8KiB (like
NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers.

> This also doesn't change the fact that without careful initial formatting
> (it is possible on some filesystems to embed the bcache SB at the beginning
> of the FS itself, many of them have some reserved space at the beginning of
> the partition for bootloaders, and this space doesn't have to exist when
> mounting the FS) or manual alteration of the partition, it's not possible to
> mount the FS on a system without bcache support.

If we consider a non-bootable single HDD btrfs FS, are you then
suggesting that the bcache SB could be placed in the 

Re: Hot data tracking / hybrid storage

2016-05-20 Thread Henk Slager
On Fri, May 20, 2016 at 7:59 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-05-20 13:02, Ferry Toth wrote:
>>
>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>> is in subvolume @boot.
>
> If you have GRUB installed on all 4, then you don't actually have the full
> 2047 sectors between the MBR and the partition free, as GRUB is embedded in
> that space.  I forget exactly how much space it takes up, but I know it's
> not the whole 1023.5K  I would not suggest risking usage of the final 8k
> there though.  You could however convert to raid1 temporarily, and then for
> each device, delete it, reformat for bcache, then re-add it to the FS.  This
> may take a while, but should be safe (of course, it's only an option if
> you're already using a kernel with bcache support).

There is more then enough space in that 2047 sectors area for
inserting a bcache SB, but initially I also found it risky and was not
so sure. I anyhow don't want GRUB in the MBR, but in the filesystem/OS
partition that it should boot, otherwise multi-OS on the same SSD or
HDD gets into trouble.

For the described system, assuming a few minutes offline or
'maintenance' mode is acceptable, I personally would just shrink the
swap by 8KiB, lower its end-sector by 16 and also lower the
start-sector of the btrfs partition by 16 and then add bcache. The
location of GRUB should not matter actually.

>> In this configuration nothing would beat btrfs if I could just add 2
>> SSD's to the pool that would be clever enough to be paired in RAID1 and
>> would be preferred for small (<1GB) file writes. Then balance should be
>> able to move not often used files to the HDD.
>>
>> None of the methods mentioned here sound easy or quick to do, or even
>> well tested.

I agree that all the methods are actually quite complicated,
especially if compared to ZFS and its tools. Adding an ARC is as
simple and easy as you want and describe.

The statement I wanted make is that adding bcache for a (btrfs)
file-system can be done without touching the FS itself, provided that
one can allow some offline time for the FS.

> It really depends on what you're used to.  I would consider most of the
> options easy, but one of the areas I'm strongest with is storage management,
> and I've repaired damaged filesystems and partition tables by hand with a
> hex editor before, so I'm not necessarily a typical user.  If I was going to
> suggest something specifically, it would be dm-cache, because it requires no
> modification to the backing store at all, but that would require running on
> LVM if you want it to be easy to set up (it's possible to do it without LVM,
> but you need something to call dmsetup before mounting the filesystem, which
> is not easy to configure correctly), and if you're on an enterprise distro,
> it may not be supported.
>
> If you wanted to, it's possible, and not all that difficult, to convert a
> BTRFS system to BTRFS on top of LVM online, but you would probably have to
> split out the boot subvolume to a separate partition (depends on which
> distro you're on, some have working LVM support in GRUB, some don't).  If
> you're on a distro which does have LVM support in GRUB, the procedure would
> be:
> 1. Convert the BTRFS array to raid1. This lets you run with only 3 disks
> instead of 4.
> 2. Delete one of the disks from the array.
> 3. Convert the disk you deleted from the array to a LVM PV and add it to a
> VG.
> 4. Create a new logical volume occupying almost all of the PV you just added
> (having a little slack space is usually a good thing).
> 5. Add use btrfs replace to add the LV to the BTRFS array while deleting one
> of the others.
> 6. Repeat from step 3-5 for each disk, but stop at step 4 when you have
> exactly one disk that isn't on LVM (so for four disks, stop at step four
> when you have 2 with BTRFS+LVM, one with just the LVM logical volume, and
> one with just BTRFS).
> 7. Reinstall GRUB (it should pull in LVM support now).
> 8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume.
> 9. Convert the now empty final disk to LVM using steps 3-4
> 10. Add the LV to the BTRFS array and rebalance to raid10.
> 11. Reinstall GRUB again (just to be certain).
>
> I've done essentially the same thing on numerous occasions when
> reprovisioning for various reasons, and it's actually one of the things
> outside of the xfstests that I check with my regression testing (including
> simulating a couple of the common failure modes).  It takes a while
> (especially for big arrays with lots of data), but it works, and is
> relatively safe (you are guaranteed to be able to rebuild a raid1 array of 3
> disks from just 2, so losing the disk in the process of copying it will not
> result in data loss unless you hit a kernel bug).
--
To unsubscribe from this 

Re: Hot data tracking / hybrid storage

2016-05-19 Thread Henk Slager
On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-05-19 14:09, Kai Krakow wrote:
>>
>> Am Wed, 18 May 2016 22:44:55 + (UTC)
>> schrieb Ferry Toth :
>>
>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>>
 Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
 :

> On 2016-05-17 02:27, Ferry Toth wrote:
>>>
>>>  [...]
>>>  [...]
>
>  [...]
>>>
>>>  [...]
>>>  [...]
>>>  [...]
>
> On the other hand, it's actually possible to do this all online
> with BTRFS because of the reshaping and device replacement tools.
>
> In fact, I've done even more complex reprovisioning online before
> (for example, my home server system has 2 SSD's and 4 HDD's,
> running BTRFS on top of LVM, I've at least twice completely
> recreated the LVM layer online without any data loss and minimal
> performance degradation).
>>>
>>>  [...]
>
> I have absolutely no idea how bcache handles this, but I doubt
> it's any better than BTRFS.


 Bcache should in theory fall back to write-through as soon as an
 error counter exceeds a threshold. This is adjustable with sysfs
 io_error_halftime and io_error_limit. Tho I never tried what
 actually happens when either the HDD (in bcache writeback-mode) or
 the SSD fails. Actually, btrfs should be able to handle this (tho,
 according to list reports, it doesn't handle errors very well at
 this point).

 BTW: Unnecessary copying from SSD to HDD doesn't take place in
 bcache default mode: It only copies from HDD to SSD in writeback
 mode (data is written to the cache first, then persisted to HDD in
 the background). You can also use "write through" (data is written
 to SSD and persisted to HDD at the same time, reporting persistence
 to the application only when both copies were written) and "write
 around" mode (data is written to HDD only, and only reads are
 written to the SSD cache device).

 If you want bcache behave as a huge IO scheduler for writes, use
 writeback mode. If you have write-intensive applications, you may
 want to choose write-around to not wear out the SSDs early. If you
 want writes to be cached for later reads, you can choose
 write-through mode. The latter two modes will ensure written data
 is always persisted to HDD with the same guaranties you had without
 bcache. The last mode is default and should not change behavior of
 btrfs if the HDD fails, and if the SSD fails bcache would simply
 turn off and fall back to HDD.
>>>
>>>
>>> Hello Kai,
>>>
>>> Yeah, lots of modes. So that means, none works well for all cases?
>>
>>
>> Just three, and they all work well. It's just a decision wearing vs.
>> performance/safety. Depending on your workload you might benefit more or
>> less from write-behind caching - that's when you want to turn the knob.
>> Everything else works out of the box. In case of an SSD failure,
>> write-back is just less safe while the other two modes should keep your
>> FS intact in that case.
>>
>>> Our server has lots of old files, on smb (various size), imap
>>> (1's small, 1000's large), postgresql server, virtualbox images
>>> (large), 50 or so snapshots and running synaptics for system upgrades
>>> is painfully slow.
>>
>>
>> I don't think that bcache even cares to cache imap accesses to mail
>> bodies - it won't help performance. Network is usually much slower than
>> SSD access. But it will cache fs meta data which will improve imap
>> performance a lot.
>
> Bcache caches anything that falls within it's heuristics as candidates for
> caching.  It pays no attention to what type of data you're accessing, just
> the access patterns.  This is also the case for dm-cache, and for Windows
> ReadyBoost (or whatever the hell they're calling it these days).  Unless
> you're shifting very big e-mails, it's pretty likely that ones that get
> accessed more than once in a short period of time will end up being cached.
>>
>>
>>> We are expecting slowness to be caused by fsyncs which appear to be
>>> much worse on a raid10 with snapshots. Presumably the whole thing
>>> would be fast enough with ssd's but that would be not very cost
>>> efficient.
>>>
>>> All the overhead of the cache layer could be avoided if btrfs would
>>> just prefer to write small, hot, files to the ssd in the first place
>>> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
>>> would be very nice (the mobo has 6 x sata, which is pretty common)
>>
>>
>> Well, I don't want to advertise bcache. But there's nothing you
>> couldn't do with it in your particular case:
>>
>> Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
>> here, you can use 1:n where n is the backing devices. There's no need
>> to clean up using balancing because bcache will track hot data by
>> default. 

Re: fsck: to repair or not to repair

2016-05-12 Thread Henk Slager
On Wed, May 11, 2016 at 11:10 PM, Nikolaus Rath  wrote:
> Hello,
>
> I recently ran btrfsck on one of my file systems, and got the following
> messages:
>
> checking extents
> checking free space cache
> checking fs roots
> root 5 inode 3149867 errors 400, nbytes wrong
> root 5 inode 3150237 errors 400, nbytes wrong
> root 5 inode 3150238 errors 400, nbytes wrong
> root 5 inode 3150242 errors 400, nbytes wrong
> root 5 inode 3150260 errors 400, nbytes wrong
> [ lots of similar message with different inode numbers ]
> root 5 inode 15595011 errors 400, nbytes wrong
> root 5 inode 15595016 errors 400, nbytes wrong
> Checking filesystem on /dev/mapper/vg0-nikratio_crypt
> UUID: 8742472d-a9b0-4ab6-b67a-5d21f14f7a38
> found 263648960636 bytes used err is 1
> total csum bytes: 395314372
> total tree bytes: 908644352
> total fs tree bytes: 352735232
> total extent tree bytes: 95039488
> btree space waste bytes: 156301160
> file data blocks allocated: 675209801728
>  referenced 410351722496
> Btrfs v3.17
>
>
>
> Can someone explain to me the risk that I run by attempting a repair,
> and (conversely) what I put at stake when continuing to use this file
> system as-is?

It has once been mentioned in this mail-list, that if the 'errors 400,
nbytes wrong' is the only error on an fs, btrfs check --repair can fix
them ( was around time of tools release 4.4 , by Qu AFAIK).
I had /(have?) about 7 of those errors in small files on an fs that is
2.5 years old and has quite some older ro snapshots. I once tried to
fix them with 4.5.0 + some patches tools, but actually they did not
get fixed. At least with 4.5.2 or 4.5.3 tools it should be possible to
fix them in your case. Maybe you first want to test it on an overlay
of the device or copy the whole fs with dd. It depends on how much
time you can allow the fs to be offline etc, it is up to you.

In my case, I recreated the files in the working subvol, but as long
as I don't remove the older snapshots, the errors 400 will still be
there I assume. At least I don't experience any negative impact of it,
so I keep it like it is until at some point in time the older
snapshots get removed or I am somehow forced to clone back the data
into a fresh fs. I am running mostly latest stable or sometimes
mainline kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FYI: Kernel crash info

2016-05-10 Thread Henk Slager
On Tue, May 10, 2016 at 9:35 PM,   wrote:
> He guys!
>
>
> while testing/stressing (dd'ing 200GB random to the drive) a brand new
> 8TB seagate drive i ran into an kernel ooops.
>
> i think it happend after i finished dd'ing and while removing the drive.
> saw it a few minutes afterwards.

Strictly speaking, this is not a kernel crash, it is just a failure to
write to the device and reporting an IO error and file system going
read-only as a result.

What does   dmesg | grep ST8000   show?

If this 8TB is an SMR drive ('Archive'), be aware that under certain
circumstances, like for example unplugging/unpowering too fast after
massive random writes, this kind of IO failures are to be expected.


> uname -a
> Linux MacBookPro 4.4.0-22-generic #39~14.04.1-Ubuntu SMP Thu May 5
> 19:19:06 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> master@MacBookPro:~$ btrfs --version
> btrfs-progs v4.4
>
>
> [ 6971.087519] BTRFS: error (device sdb1) in
> btrfs_commit_transaction:2124: errno=-5 IO failure (Error while writing
> out transaction)
> [ 6971.087525] BTRFS info (device sdb1): forced readonly
> [ 6971.087528] BTRFS warning (device sdb1): Skipping commit of aborted
> transaction.
> [ 6971.087530] [ cut here ]
> [ 6971.087559] WARNING: CPU: 1 PID: 8161 at
> /build/linux-lts-xenial-T5gd_J/linux-lts-xenial-4.4.0/fs/btrfs/transaction.c:1746
> cleanup_transaction+0x87/0x2e0 [btrfs]()
> [ 6971.087561] BTRFS: Transaction aborted (error -5)
> [ 6971.087563] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos
> jfs xfs libcrc32c cpuid uas usb_storage ipt_MASQUERADE
> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4
> xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter
> ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables
> rpcsec_gss_krb5 nfsv4 rfcomm bnep nfsd auth_rpcgss nfs_acl nfs lockd
> grace sunrpc fscache intel_rapl x86_pkg_temp_thermal intel_powerclamp
> joydev coretemp applesmc kvm_intel input_polldev kvm irqbypass
> crct10dif_pclmul snd_hda_codec_cirrus snd_hda_codec_generic
> snd_hda_codec_hdmi crc32_pclmul dm_multipath aesni_intel uvcvideo
> aes_x86_64 videobuf2_vmalloc lrw videobuf2_memops btusb gf128mul
> snd_seq_midi videobuf2_v4l2 glue_helper btrtl snd_seq_midi_event
> snd_hda_intel ablk_helper videobuf2_core v4l2_common btbcm cryptd
> snd_rawmidi snd_hda_codec btintel videodev snd_hda_core hid_appleir
> bluetooth wl(POE) snd_hwdep lpc_ich snd_pcm input_leds snd_seq cfg80211
> media bcm5974 snd_seq_device snd_timer snd soundcore acpi_als sbs
> kfifo_buf apple_gmux industrialio sbshc mac_hid mei_me parport_pc
> apple_bl mei shpchp ppdev lp parport btrfs xor raid6_pq hid_generic
> hid_apple usbhid hid amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt tg3 fb_sys_fops drm
> firewire_ohci ptp firewire_core pata_acpi pps_core crc_itu_t video fjes
> [ 6971.087672] CPU: 1 PID: 8161 Comm: umount Tainted: P   OE
> 4.4.0-22-generic #39~14.04.1-Ubuntu
> [ 6971.087675] Hardware name: Apple Inc.
> MacBookPro8,3/Mac-942459F5819B171B, BIOS
> MBP81.88Z.0047.B2C.1510261540 10/26/15
> [ 6971.087677]   88045806fc38 813cde6c
> 88045806fc80
> [ 6971.087681]  c04741c0 88045806fc70 8107d856
> 88045b52f960
> [ 6971.087684]  88045ab22800 88045b5a6000 fffb
> fffb
> [ 6971.087687] Call Trace:
> [ 6971.087694]  [] dump_stack+0x63/0x87
> [ 6971.087699]  [] warn_slowpath_common+0x86/0xc0
> [ 6971.087703]  [] warn_slowpath_fmt+0x4c/0x50
> [ 6971.087722]  [] cleanup_transaction+0x87/0x2e0 [btrfs]
> [ 6971.087727]  [] ? prepare_to_wait_event+0xf0/0xf0
> [ 6971.087731]  [] ? __wake_up+0x44/0x50
> [ 6971.087750]  []
> btrfs_commit_transaction+0x267/0xa80 [btrfs]
> [ 6971.087764]  [] btrfs_sync_fs+0x55/0x110 [btrfs]
> [ 6971.087768]  [] sync_filesystem+0x71/0xa0
> [ 6971.087773]  [] generic_shutdown_super+0x27/0x100
> [ 6971.087776]  [] kill_anon_super+0x12/0x20
> [ 6971.087790]  [] btrfs_kill_super+0x18/0x120 [btrfs]
> [ 6971.087794]  [] deactivate_locked_super+0x43/0x70
> [ 6971.087797]  [] deactivate_super+0x46/0x60
> [ 6971.087801]  [] cleanup_mnt+0x3f/0x80
> [ 6971.087805]  [] __cleanup_mnt+0x12/0x20
> [ 6971.087809]  [] task_work_run+0x77/0x90
> [ 6971.087812]  [] exit_to_usermode_loop+0x73/0xa2
> [ 6971.087816]  [] syscall_return_slowpath+0x4e/0x60
> [ 6971.087820]  [] int_ret_from_sys_call+0x25/0x8f
> [ 6971.087823] ---[ end trace 02b0451a7c4939ae ]---
> [ 6971.087826] BTRFS: error (device sdb1) in cleanup_transaction:1746:
> errno=-5 IO failure
> [ 6971.087828] BTRFS info (device sdb1): delayed_refs has NO entry
> [ 6973.206207] BTRFS error (device sdb1): cleaner transaction attach
> returned -30
>
>
> master@MacBookPro:~$ apt-cache show linux-image-generic-lts-xenial
> Package: linux-image-generic-lts-xenial
> Priority: optional
> 

Re: Question: raid1 behaviour on failure

2016-04-28 Thread Henk Slager
On Thu, Apr 28, 2016 at 7:09 AM, Matthias Bodenbinder
<matth...@bodenbinder.de> wrote:
> Am 26.04.2016 um 18:19 schrieb Henk Slager:
>> It looks like a JMS567 + SATA port multipliers behaind it are used in
>> this drivebay. The command   lsusb -v  could show that. So your HW
>> setup is like JBOD, not RAID.
>
> Here is the output of lsusb -v:
>
>
> Bus 003 Device 004: ID 152d:0567 JMicron Technology Corp. / JMicron USA 
> Technology Corp.
> Device Descriptor:
>   bLength18
>   bDescriptorType 1
>   bcdUSB   3.00
>   bDeviceClass0 (Defined at Interface level)
>   bDeviceSubClass 0
>   bDeviceProtocol 0
>   bMaxPacketSize0 9
>   idVendor   0x152d JMicron Technology Corp. / JMicron USA Technology 
> Corp.
>   idProduct  0x0567
>   bcdDevice2.05
>   iManufacturer  10 JMicron
>   iProduct   11 USB to ATA/ATAPI Bridge
>   iSerial 5 152D00539000
>   bNumConfigurations  1

OK, that is how the drivebay presents itself. It does not really
correspond to this:
http://www.jmicron.com/PDF/brief/jms567.pdf
It looks more like a jms562 is used, but I don't know what is on the
PCB and in the FW

Anyhow, hot (un)plug capability on the 4 internal SATA i/f is not
explicitly mentioned. If you expect or want that, ask Fantec I would
say.

>   Configuration Descriptor:
> bLength 9
> bDescriptorType 2
> wTotalLength   44
> bNumInterfaces  1
> bConfigurationValue 1
> iConfiguration  0
> bmAttributes 0xc0
>   Self Powered
> MaxPower2mA
> Interface Descriptor:
>   bLength 9
>   bDescriptorType 4
>   bInterfaceNumber0
>   bAlternateSetting   0
>   bNumEndpoints   2
>   bInterfaceClass 8 Mass Storage
>   bInterfaceSubClass  6 SCSI
>   bInterfaceProtocol 80 Bulk-Only
>   iInterface  0
>   Endpoint Descriptor:
> bLength 7
> bDescriptorType 5
> bEndpointAddress 0x81  EP 1 IN
> bmAttributes2
>   Transfer TypeBulk
>   Synch Type   None
>   Usage Type   Data
> wMaxPacketSize 0x0400  1x 1024 bytes
> bInterval   0
> bMaxBurst  15
>   Endpoint Descriptor:
> bLength 7
> bDescriptorType 5
> bEndpointAddress 0x02  EP 2 OUT
> bmAttributes2
>   Transfer TypeBulk
>   Synch Type   None
>   Usage Type   Data
> wMaxPacketSize 0x0400  1x 1024 bytes
> bInterval   0
> bMaxBurst  15
> Binary Object Store Descriptor:
>   bLength 5
>   bDescriptorType15
>   wTotalLength   22
>   bNumDeviceCaps  2
>   USB 2.0 Extension Device Capability:
> bLength 7
> bDescriptorType16
> bDevCapabilityType  2
> bmAttributes   0x0002
>   Link Power Management (LPM) Supported
>   SuperSpeed USB Device Capability:
> bLength10
> bDescriptorType16
> bDevCapabilityType  3
> bmAttributes 0x00
> wSpeedsSupported   0x000e
>   Device can operate at Full Speed (12Mbps)
>   Device can operate at High Speed (480Mbps)
>   Device can operate at SuperSpeed (5Gbps)
> bFunctionalitySupport   1
>   Lowest fully-functional device speed is Full Speed (12Mbps)
> bU1DevExitLat  10 micro seconds
> bU2DevExitLat2047 micro seconds
> Device Status: 0x0001
>   Self Powered
>
>
>
>> IMHO, using such a setup for software RAID (like btrfs RAID1)
>> fundamentally violates the concept of RAID (redundant array of
>> independent disks). It depends on where you define the system border
>> of the (independent) disks.
>> If it is at:
>>
>> A) the 4 (or 3 disk in this case) SATA+power interfaces inside the drivebay 
>> or
>>
>> B) inside the PC's chipset.
>>
>> In case A) there is a shared removable link (USB) inside the
>> filesystem processing machine.
>> In case B) the disks aren't really independent as they share a
>> removable link (and as proven by the (un)plug of 1 device affecting
>> all others).
>> --
>> To unsubscribe from this list: send the line "unsubscribe lin

Re: Question: raid1 behaviour on failure

2016-04-26 Thread Henk Slager
On Thu, Apr 21, 2016 at 7:27 PM, Matthias Bodenbinder
<matth...@bodenbinder.de> wrote:
> Am 21.04.2016 um 13:28 schrieb Henk Slager:
>>> Can anyone explain this behavior?
>>
>> All 4 drives (WD20, WD75, WD50, SP2504C) get a disconnect twice in
>> this test. What is on WD20 is unclear to me, but the raid1 array is
>> {WD75, WD50, SP2504C}
>> So the test as described by Matthias is not what actually happens.
>> In fact, the whole btrfs fs is 'disconnected on the lower layers of
>> the kernel' but there is no unmount.  You can see the scsi items go
>> from 8?.0.0.x to
>> 9.0.0.x to 10.0.0.x. In the 9.0.0.x state, the tools show then 1 dev
>> missing (WD75), but in fact the whole fs state is messed up. So as
>> indicated by Anand already, it is a bad test and it is what one can
>> expect from an unpatched 4.4.0 kernel. ( I'm curious to know how md
>> raidX would handle this ).
>>
>> a) My best guess is that the 4 drives are in a USB connected drivebay
>> and that Matthias unplugged WD75 (so cut its power and SATA
>> connection), did the file copy trial and then plugged in the WD75
>> again into the drivebay. The (un)plug of a harddisk is then assumed to
>> trigger a USB link re-init by the chipset in the drivebay.
>>
>> b) Another possibility is that due to (un)plug of WD75 cause the host
>> USB chipset to re-init the USB link due to (too big?) changes in
>> electrical current. And likely separate USB cables and maybe some
>> SATA.
>>
>> c) Or some flaw in the LMDE2 distribution in combination with btrfs. I
>> don't what is in the  linux-image-4.4.0-0.bpo.1-amd64
>>
>
> Just to clarify my setup. I HDs are mounted into a FANTEC QB-35US3-6G case. 
> According to the handbook it has "Hot-Plug for  USB / eSATA interface".
>
> It is equipped with 4 HDs. 3 of them are part of the raid1. The fourth HD is 
> a 2 TB device with ext4 filesystem and no relevance for this thread.

It looks like a JMS567 + SATA port multipliers behaind it are used in
this drivebay. The command   lsusb -v  could show that. So your HW
setup is like JBOD, not RAID.

IMHO, using such a setup for software RAID (like btrfs RAID1)
fundamentally violates the concept of RAID (redundant array of
independent disks). It depends on where you define the system border
of the (independent) disks.
If it is at:

A) the 4 (or 3 disk in this case) SATA+power interfaces inside the drivebay or

B) inside the PC's chipset.

In case A) there is a shared removable link (USB) inside the
filesystem processing machine.
In case B) the disks aren't really independent as they share a
removable link (and as proven by the (un)plug of 1 device affecting
all others).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question: raid1 behaviour on failure

2016-04-26 Thread Henk Slager
On Sat, Apr 23, 2016 at 9:07 AM, Matthias Bodenbinder
 wrote:
>
> Here is my newest test. The backports provide a 4.5 kernel:
>
> 
> kernel: 4.5.0-0.bpo.1-amd64
> btrfs-tools: 4.4-1~bpo8+1
> 
>
> This time the raid1 is automatically unmounted after I unplug the device and 
> it can not be mounted while the device is missing. See below.
>
> Matthias
>
>
> 
> 1) turn on the Fantec case:
>
> Apr 23 08:45:38 rakete kernel: usb 3-1: new SuperSpeed USB device number 2 
> using xhci_hcd
> Apr 23 08:45:38 rakete kernel: usb 3-1: New USB device found, idVendor=152d, 
> idProduct=0567
> Apr 23 08:45:38 rakete kernel: usb 3-1: New USB device strings: Mfr=10, 
> Product=11, SerialNumber=5
> Apr 23 08:45:38 rakete kernel: usb 3-1: Product: USB to ATA/ATAPI Bridge
> Apr 23 08:45:38 rakete kernel: usb 3-1: Manufacturer: JMicron
> Apr 23 08:45:38 rakete kernel: usb 3-1: SerialNumber: 152D00539000
> Apr 23 08:45:38 rakete mtp-probe[3641]: checking bus 3, device 2: 
> "/sys/devices/pci:00/:00:1c.5/:04:00.0/usb3/3-1"
> Apr 23 08:45:38 rakete mtp-probe[3641]: bus: 3, device: 2 was not an MTP 
> device
> Apr 23 08:45:38 rakete kernel: usb-storage 3-1:1.0: USB Mass Storage device 
> detected
> Apr 23 08:45:38 rakete kernel: usb-storage 3-1:1.0: Quirks match for vid 152d 
> pid 0567: 500
> Apr 23 08:45:38 rakete kernel: scsi host8: usb-storage 3-1:1.0
> Apr 23 08:45:38 rakete kernel: usbcore: registered new interface driver 
> usb-storage
> Apr 23 08:45:38 rakete kernel: usbcore: registered new interface driver uas
> Apr 23 08:45:39 rakete kernel: scsi 8:0:0:0: Direct-Access WDC WD20 
> 02FAEX-007BA00125 PQ: 0 ANSI: 6
> Apr 23 08:45:39 rakete kernel: scsi 8:0:0:1: Direct-Access WDC WD75 
> 00AACS-00C7B00125 PQ: 0 ANSI: 6
> Apr 23 08:45:39 rakete kernel: scsi 8:0:0:2: Direct-Access WDC WD50 
> 01AALS-00L3B20125 PQ: 0 ANSI: 6
> Apr 23 08:45:39 rakete kernel: scsi 8:0:0:3: Direct-Access SAMSUNG  
> SP2504C  0125 PQ: 0 ANSI: 6
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: Attached scsi generic sg6 type 0
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] 3907029168 512-byte logical 
> blocks: (2.00 TB/1.82 TiB)
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: Attached scsi generic sg7 type 0
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] Write Protect is off
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] Mode Sense: 67 00 10 08
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: Attached scsi generic sg8 type 0
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] 1465149168 512-byte logical 
> blocks: (750 GB/699 GiB)
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] Write Protect is off
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] Mode Sense: 67 00 10 08
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: [sdh] 976773168 512-byte logical 
> blocks: (500 GB/466 GiB)
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: Attached scsi generic sg9 type 0
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] No Caching mode page found
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] Assuming drive cache: write 
> through
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: [sdh] Write Protect is off
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: [sdh] Mode Sense: 67 00 10 08
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: [sdi] 488395055 512-byte logical 
> blocks: (250 GB/233 GiB)
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] No Caching mode page found
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] Assuming drive cache: write 
> through
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: [sdi] Write Protect is off
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: [sdi] Mode Sense: 67 00 10 08
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: [sdh] No Caching mode page found
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:2: [sdh] Assuming drive cache: write 
> through
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: [sdi] No Caching mode page found
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:3: [sdi] Assuming drive cache: write 
> through
> Apr 23 08:45:39 rakete kernel:  sdf: sdf1
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:0: [sdf] Attached SCSI disk
> Apr 23 08:45:39 rakete kernel: sd 8:0:0:1: [sdg] Attached SCSI disk
> Apr 23 08:45:40 rakete kernel: sd 8:0:0:2: [sdh] Attached SCSI disk
> Apr 23 08:45:40 rakete kernel: sd 8:0:0:3: [sdi] Attached SCSI disk
> Apr 23 08:45:40 rakete kernel: BTRFS: device fsid 
> 16d5891f-5d52-4b29-8591-588ddf11e73d devid 1 transid 89 /dev/sdg
> Apr 23 08:45:40 rakete kernel: BTRFS: device fsid 
> 16d5891f-5d52-4b29-8591-588ddf11e73d devid 2 transid 89 /dev/sdh
> Apr 23 08:45:40 rakete kernel: BTRFS: device fsid 
> 16d5891f-5d52-4b29-8591-588ddf11e73d devid 3 transid 89 /dev/sdi
> Apr 23 08:45:40 rakete kernel: EXT4-fs (sdf1): mounted filesystem with 
> ordered data mode. Opts: (null)
> Apr 23 08:45:40 rakete udisksd[2422]: Mounted /dev/sdf1 at 
> /media/matthias/BACKUP on behalf of uid 1000
>
> 
>
> 7# mount /mnt/raid1/
>
> Apr 23 08:47:31 rakete kernel: BTRFS 

Re: Question: raid1 behaviour on failure

2016-04-21 Thread Henk Slager
On Thu, Apr 21, 2016 at 8:23 AM, Satoru Takeuchi
 wrote:
> On 2016/04/20 14:17, Matthias Bodenbinder wrote:
>>
>> Am 18.04.2016 um 09:22 schrieb Qu Wenruo:
>>>
>>> BTW, it would be better to post the dmesg for better debug.
>>
>>
>> So here we. I did the same test again. Here is a full log of what i did.
>> It seems to be mean like a bug in btrfs.
>> Sequenz of events:
>> 1. mount the raid1 (2 disc with different size)
>> 2. unplug the biggest drive (hotplug)
>> 3. try to copy something to the degraded raid1
>> 4. plugin the device again (hotplug)
>>
>> This scenario does not work. The disc array is NOT redundant! I can not
>> work with it while a drive is missing and I can not reattach the device so
>> that everything works again.
>>
>> The btrfs module crashes during the test.
>>
>> I am using LMDE2 with backports:
>> btrfs-tools 4.4-1~bpo8+1
>> linux-image-4.4.0-0.bpo.1-amd64
>>
>> Matthias
>>
>>
>> rakete - root - /root
>> 1# mount /mnt/raid1/
>>
>> Journal:
>>
>> Apr 20 07:01:16 rakete kernel: BTRFS info (device sdi): enabling auto
>> defrag
>> Apr 20 07:01:16 rakete kernel: BTRFS info (device sdi): disk space caching
>> is enabled
>> Apr 20 07:01:16 rakete kernel: BTRFS: has skinny extents
>>
>> rakete - root - /mnt/raid1
>> 3# ll
>> insgesamt 0
>> drwxrwxr-x 1 root root   36 Nov 14  2014 AfterShot2(64-bit)
>> drwxrwxr-x 1 root root 5082 Apr 17 09:06 etc
>> drwxr-xr-x 1 root root  108 Mär 24 07:31 var
>>
>> 4# btrfs fi show
>> Label: none  uuid: 16d5891f-5d52-4b29-8591-588ddf11e73d
>> Total devices 3 FS bytes used 1.60GiB
>> devid1 size 698.64GiB used 3.03GiB path /dev/sdg
>> devid2 size 465.76GiB used 3.03GiB path /dev/sdh
>> devid3 size 232.88GiB used 0.00B path /dev/sdi
>>
>> 
>> unplug device sdg:
>>
>> Apr 20 07:03:05 rakete kernel: Buffer I/O error on dev sdf1, logical block
>> 243826688, lost sync page write
>> Apr 20 07:03:05 rakete kernel: JBD2: Error -5 detected when updating
>> journal superblock for sdf1-8.
>> Apr 20 07:03:05 rakete kernel: Aborting journal on device sdf1-8.
>> Apr 20 07:03:05 rakete kernel: Buffer I/O error on dev sdf1, logical block
>> 243826688, lost sync page write
>> Apr 20 07:03:05 rakete kernel: JBD2: Error -5 detected when updating
>> journal superblock for sdf1-8.
>> Apr 20 07:03:05 rakete umount[16405]: umount: /mnt/raid1: target is busy
>> Apr 20 07:03:05 rakete umount[16405]: (In some cases useful info about
>> processes that
>> Apr 20 07:03:05 rakete umount[16405]: use the device is found by lsof(8)
>> or fuser(1).)
>> Apr 20 07:03:05 rakete systemd[1]: mnt-raid1.mount mount process exited,
>> code=exited status=32
>> Apr 20 07:03:05 rakete systemd[1]: Failed unmounting /mnt/raid1.
>> Apr 20 07:03:24 rakete kernel: usb 3-1: new SuperSpeed USB device number 3
>> using xhci_hcd
>> Apr 20 07:03:24 rakete kernel: usb 3-1: New USB device found,
>> idVendor=152d, idProduct=0567
>> Apr 20 07:03:24 rakete kernel: usb 3-1: New USB device strings: Mfr=10,
>> Product=11, SerialNumber=5
>> Apr 20 07:03:24 rakete kernel: usb 3-1: Product: USB to ATA/ATAPI Bridge
>> Apr 20 07:03:24 rakete kernel: usb 3-1: Manufacturer: JMicron
>> Apr 20 07:03:24 rakete kernel: usb 3-1: SerialNumber: 152D00539000
>> Apr 20 07:03:24 rakete kernel: usb-storage 3-1:1.0: USB Mass Storage
>> device detected
>> Apr 20 07:03:24 rakete kernel: usb-storage 3-1:1.0: Quirks match for vid
>> 152d pid 0567: 500
>> Apr 20 07:03:24 rakete kernel: scsi host9: usb-storage 3-1:1.0
>> Apr 20 07:03:24 rakete mtp-probe[16424]: checking bus 3, device 3:
>> "/sys/devices/pci:00/:00:1c.5/:04:00.0/usb3/3-1"
>> Apr 20 07:03:24 rakete mtp-probe[16424]: bus: 3, device: 3 was not an MTP
>> device
>> Apr 20 07:03:25 rakete kernel: scsi 9:0:0:0: Direct-Access WDC WD20
>> 02FAEX-007BA00125 PQ: 0 ANSI: 6
>> Apr 20 07:03:25 rakete kernel: scsi 9:0:0:1: Direct-Access WDC WD50
>> 01AALS-00L3B20125 PQ: 0 ANSI: 6
>> Apr 20 07:03:25 rakete kernel: scsi 9:0:0:2: Direct-Access SAMSUNG
>> SP2504C  0125 PQ: 0 ANSI: 6
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: Attached scsi generic sg6 type
>> 0
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:1: Attached scsi generic sg7 type
>> 0
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: [sdf] 3907029168 512-byte
>> logical blocks: (2.00 TB/1.82 TiB)
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: [sdf] Write Protect is off
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: [sdf] Mode Sense: 67 00 10 08
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:2: Attached scsi generic sg8 type
>> 0
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:1: [sdj] 976773168 512-byte
>> logical blocks: (500 GB/466 GiB)
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: [sdf] No Caching mode page
>> found
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:0: [sdf] Assuming drive cache:
>> write through
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:1: [sdj] Write Protect is off
>> Apr 20 07:03:25 rakete kernel: sd 9:0:0:1: [sdj] 

Re: btrfs-image and btrfs send related queries

2016-04-19 Thread Henk Slager
> I have /dev/sdb , /dev/sdc. Using wipefs -fa I cleared both devices and
> created btrfs on /dev/sdb. Mounted and written some files and unmounted it.
>
> Then I ran btrfs-image /dev/sdc /img1.img and got the dump.

It looks like you imaged the wrong device, that might clarify the IO
errors later on

> Once image created I again ran wipefs -fa /dev/sdb.
>
> Then I ran btrfs-image -r /img1.img /dev/sdc and mounted /dev/sdc.
>
> ls to dumped filesystem shows the file size as original and no file content.
> I guess btrfs-image doesn't modify files stat so i feel it is showing size
> as original.
>
> However running cat on some of files give i/o error
>
> qd67:/btrfs1 # cat shadow.h
> cat: shadow.h: Input/output error
>
> These errors are not on all files on other files, since dump doesn't
> contains any data it just prompts for cat as below.
>
> qd67:/btrfs1 # cat stab.h
> qd67:/btrfs1 # cat stdio_ext.h
>
> Not sure why i/o errors are coming.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [resend] btrfs-send -c fails: reproduction case

2016-04-18 Thread Henk Slager
>>> Reproduction case after running into the same problem as Paride
>>> Legovini:
>>> http://article.gmane.org/gmane.comp.file-systems.btrfs/48706/match=send

Your case is not the same as in this thread from Paride IMO. The error
message is the same, but that doesn't mean the call tree leading to it
is the same. My first impression from your 1st and also your 2nd (in
resend thread) example was that btrfs' error message is correct and
usage of -c is wrong.

More important is that the problem that Paride describes, is solved
for kernel/tools v4.5.0/v4.5.1

[...]
> Every time I start asking about snapshots people tell me how I could
> set up my process differently which is annoying because:
> 1) They assume I'm doing a single, incremental backup process, which is wrong
> 2) I already have years/terabytes pf data I need to transfer, so I
> can't change what I did in the past
> 3) They inevitably recommend their favorite method, which doesn't work
> for me, because I have different requirements

You would have to write a clear set of requirements, as an extension
to what btrfs send|receive currently can do.
>From the examples it is simply not clear what you want.

[...]
> Anyway, I'm doing a one-time transfer of a large number of snapshots
> which are currently stuck on a .img file. Over the years they've been

So, this is yet another problem; What is the background here?

> re-organized repeatedly and I've snapshotted many from writable to
> read-only (most of the cases of the parent being deleted, I think). My
> goal is to get them into a new filesystem, but I can't do a full send
> in every case, because then it would balloon up to a petabyte or so.

This gives some hint what the situation is. How many ro snapshots is
it? How much data in total as uncompressed diskblocks?

>> Does that make more sense now, or were you actually already doing it that
>> way and still getting the errors, and just didn't say so, or did I just
>> confuse you even further (hopefully not!)?
>
> I was already making sure all -c references were both present and
> unmodified, I think the confusion is mostly around whether the parent
> required to use -c, and whether it's an implicit reference volume in
> particular. If it's required, it's impossible to preserve
> de-duplication after deleting the original parent which would be
> really bad.

Can you give examples for your real data how bad it is? And how you
expect or want it to be ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image and btrfs send related queries

2016-04-18 Thread Henk Slager
On Mon, Apr 18, 2016 at 4:26 PM, Roman Mamedov <r...@romanrm.net> wrote:
> On Mon, 18 Apr 2016 16:13:28 +0200
> Henk Slager <eye...@gmail.com> wrote:
>
>> (your email keeps ending up in gmail spam folder)
>>
>> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_srid...@yahoo.co.in> wrote:
>> > I tried btrfs-image and created image file and ran btrfs-image -r to a
>> > different disk. Once recovered and mounted, I can able to see data is
>> > not zeroed out as mentioned in btrfs-image man page.
>>
>> "different disk"  you mention, that is important info. If you doe the
>> restore to a image file, that image file is sparse and all data blocks
>> are read as zeros.
>>
>> However, if you restore to a block device, then you can assume it just
>> writes the device blocks for metadata and leaves the rest untouched.
>> So trim whole device first or brute-force overwrite completely with
>> zeros.
>>
>> So maybe the man pages needs some correction / extra notes.
>>
>> > I tried on same machine.
>
> Does btrfs-image store/restore the FS UUID? If it does, then potentially both
> the source FS and the restored one were visible at the same time to the kernel
> with identical UUIDs, and maybe it was actually accessing/mounting the source
> one.

Very good point! The UUID's are the same. I remember I used a VM to
separate the source FS from the restored FS

Also, the assumption I made about restoring to a block device is not
correct: If you restore to a loopdev that has a file with all
non-zeros as backing-store, the files in the mounted restored FS are
read as zeros.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image and btrfs send related queries

2016-04-18 Thread Henk Slager
(your email keeps ending up in gmail spam folder)

On Mon, Apr 18, 2016 at 9:24 AM, sri  wrote:
> I tried btrfs-image and created image file and ran btrfs-image -r to a
> different disk. Once recovered and mounted, I can able to see data is
> not zeroed out as mentioned in btrfs-image man page.

"different disk"  you mention, that is important info. If you doe the
restore to a image file, that image file is sparse and all data blocks
are read as zeros.

However, if you restore to a block device, then you can assume it just
writes the device blocks for metadata and leaves the rest untouched.
So trim whole device first or brute-force overwrite completely with
zeros.

So maybe the man pages needs some correction / extra notes.

> I tried on same machine.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6, errors at missing device replacement

2016-04-15 Thread Henk Slager
On Fri, Apr 15, 2016 at 9:49 PM, Yauhen Kharuzhy
 wrote:
> Hi.
>
> I have discovered case when replacement of missing devices causes
> metadata corruption. Does anybody know anything about this?

I just can confirm that there is corruption when doing replacement for
both raid5 and raid6, and not only metadata.
If the replace is done in a very stepwise way, so no other
transactions ongoing on the fs and also when the device
'faillure'/removal is done in a planned way, the replace can be
successfull.

For raid5 extention from 3x100GB -> 4x100GB balance with stripe filter
worked as expected (some 4.4 kernel). I still had this images stored
and tried how the fs would survive an overwite of 1 device with a DVD
image (kernel 4.6.0-rc1). To summarize, i had to do a replace and
scrub and although tons of errors, some very weird/wrong, all files
seemed still be there. Until I unmounted and tried to remount: fs was
totally corrupted and no way to recover.

> I use 4.4.5 kernel with latest global spare patches.
>
> If we have RAID6 (may be reproducible on RAID5 too) and try to replace
> one missing drive by other and after this try to remove another drive
> and replace it, plenty of errors are shown in the log:
>
> [  748.641766] BTRFS error (device sdf): failed to rebuild valid
> logical 7366459392 for dev /dev/sde
> [  748.678069] BTRFS error (device sdf): failed to rebuild valid
> logical 7381139456 for dev /dev/sde
> [  748.693559] BTRFS error (device sdf): failed to rebuild valid
> logical 7290974208 for dev /dev/sde
> [  752.039100] BTRFS error (device sdf): bad tree block start
> 13048831955636601734 6919258112
> [  752.647869] BTRFS error (device sdf): bad tree block start
> 12819300352 6919290880
> [  752.658520] BTRFS error (device sdf): bad tree block start
> 31618367488 6919290880
> [  752.712633] BTRFS error (device sdf): bad tree block start
> 31618367488 6919290880
>
> After device replacement finish, scrub shows uncorrectable errors.
> Btrfs check complains about errors too:
> root@test:~/# btrfs check -p /dev/sdc
> Checking filesystem on /dev/sdc
> UUID: 833fef31-5536-411c-8f58-53b527569fa5
> checksum verify failed on 9359163392 found E4E3BDB6 wanted 
> checksum verify failed on 9359163392 found E4E3BDB6 wanted 
> checksum verify failed on 9359163392 found 4D1F4197 wanted DE0E50EC
> bytenr mismatch, want=9359163392, have=9359228928
>
> Errors found in extent allocation tree or chunk allocation
> checking free space cache [.]
> checking fs roots [.]
> checking csums
> checking root refs
> found 1049788420 bytes used err is 0
> total csum bytes: 1024000
> total tree bytes: 1179648
> total fs tree bytes: 16384
> total extent tree bytes: 16384
> btree space waste bytes: 124962
> file data blocks allocated: 1049755648
>  referenced 1049755648
>
> After first replacement metadata seems not spread across all devices:
> Label: none  uuid: 3db39446-6810-47bf-8732-d5a8793500f3
> Total devices 4 FS bytes used 1002.00MiB
> devid1 size 8.00GiB used 1.28GiB path /dev/sdc
> devid2 size 8.00GiB used 1.28GiB path /dev/sdd
> devid3 size 8.00GiB used 1.28GiB path /dev/sdf
> devid4 size 8.00GiB used 1.25GiB path /dev/sdg
>
> # btrfs device usage /mnt/
> /dev/sdc, ID: 1
>Device size: 8.00GiB
>Data,RAID6:  1.00GiB
>Metadata,RAID6:256.00MiB
>System,RAID6:   32.00MiB
>Unallocated: 6.72GiB
>
> /dev/sdd, ID: 2
>Device size: 8.00GiB
>Data,RAID6:  1.00GiB
>Metadata,RAID6:256.00MiB
>System,RAID6:   32.00MiB
>Unallocated: 6.72GiB
>
> /dev/sdf, ID: 3
>Device size: 8.00GiB
>Data,RAID6:  1.00GiB
>Metadata,RAID6:256.00MiB
>System,RAID6:   32.00MiB
>Unallocated: 6.72GiB
>
> /dev/sdg, ID: 4
>Device size: 8.00GiB
>Data,RAID6:  1.00GiB
>Metadata,RAID6:256.00MiB
>Unallocated: 6.75GiB
>
>
> Steps to reproduce:
> 1) Create and mount RAID6
> 2) remove drive belonging to RAID, try write and let kernel code close
> the device
> 3) replace missing device by 'btrfs replace start' command
> 4) remove drive in another slot, try write, wait for closing of it
> 5) start replacing of missing drive -> ERRORS.
>
> If full balance after step 3) was done, no errors appeared.

I used kernel 4.6.0-rc3  running in a Virtualbox, deleted and added
drives as one would do in a live system, rsyncing files to the fs in
the meantime. Both 1st and 2nd replace device show device errors later
on, but the steps 1) to 5) seem to have worked fine, also btrfs de us
shows correct and regular numbers. So the step 5) ERRORS don't seem to
occur.
BUT:
- when scrub is done, it just stops way too early, but no errors in dmesg
- umount works
- then mount again seems successfully but no mount is done 

Re: btrfs-image and btrfs send related queries

2016-04-15 Thread Henk Slager
On Fri, Apr 15, 2016 at 2:49 PM, Hugo Mills  wrote:
> On Fri, Apr 15, 2016 at 12:41:36PM +, sri wrote:
>> Hi,
>>
>> I have couple of queries related to btrfs-image, btrfs send and with
>> combination of two.
>> 1)
>> I would like to know if a btrfs source file system is spread across more
>> than 1 disks, does btrfs-image require same number of disks to create
>> empty file system without files content??
>
>I don't _think_ you need as many devices as there were originally.

Indeed, if you runbtrfs-image -r on a dump from a multi device
fs, you get 1 big fs image. I once did that for a 4x4TB RAID10 fs
(tools v4.3.x at that time I believe), resulting in a 17TB (sparse)
file. I was expecting that option  -m would create multiple files,
however, scanning the source-code revealed that there are things not
implemented (it resulted in a 34T file that was simply not a valid
fs). Or I did something wrong or there is a bug.

For just my case, it was much quicker to patch the kernel so that it
worked with the 17T file. There are then still issues w.r.t. devices,
but data is missing so anyhow only a limited set of tool actions or
issues can be researched with such a generated image. But for a
multi-TB fs, the data volume is acceptable (roughly in 1G or 10G
order).

I think it would make sense that the btrfs-image restore output can be
split into multiple files, so that the multidevice aspects are better
represented (or modelled).

>> 2) would btrfs-image can be modified to keep only given subvolume foot
>> print and related meta data to bring back file system live on destination
>> disk?
>>
>>To elaborate more on this, Lets say I have 5 subvolumes on source btrfs
>> and i run btrfs-image written to destination disk say /dev/sdd. In this
>> process, can btrfs-image modified to just have only 1 subvolume and skipp
>> other 4 subvolumes and write to destination i.. /dev/sdd so that when I
>> mount /dev/sdd , I will have btrfs with only 1 subvolume with no data.
>
>For a first approximation, you could just drop any FS tree from the
> image which wasn't the target one. After that, it turns into a
> complicated accounting exercise to drop all of the back-refs to the
> missing FS trees, and to drop all the extent records for the
> non-shared data and the metadata for the missing FS trees.
>
>It's probably going to be complicated, and will basically involve
> rewriting most of the image to avoid the metadata you didn't want.
>
>> 3) If 3 can successful, can btrfs-image further changed to include data of
>> selected subvolume which gives files data also written to /dev/sdd which
>> would be kind of a backup of a subvolume taken out of a btrfs file system
>> which is having more than 1 subvolumes.
>
>If you're going to do all the hard work of (2), then (3) is a
> reasonable logical(?) extension.
>
>On the other hand, what's wrong with simply using send/receive? It
> gives you a data structure (a FAR-format send stream) which contains
> everything you need to reconstruct a subvolume on a btrfs different
> to the original.
>
>Hugo.
>
> --
> Hugo Mills | Mary had a little lamb
> hugo@... carfax.org.uk | You've heard this tale before
> http://carfax.org.uk/  | But did you know she passed her plate
> PGP: E2AB1DE4  | And had a little more?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enospace regression in 4.4

2016-04-13 Thread Henk Slager
On Tue, Apr 12, 2016 at 5:52 PM, Julian Taylor
 wrote:
> smaller testcase that shows the immediate enospc after fallocate -> rm,
> though I don't know if it is really related to the full filesystem
> bugging out as the balance does work if you wait a few seconds after the
> balance.
> But this sequence of commands did work in 4.2.
>
>  $ sudo btrfs fi show /dev/mapper/lvm-testing
> Label: none  uuid: 25889ba9-a957-415a-83b0-e34a62cb3212
> Total devices 1 FS bytes used 225.18MiB
> devid1 size 5.00GiB used 788.00MiB path /dev/mapper/lvm-testing
>
>  $ fallocate -l 4.4G test.dat
>  $ rm -f test.dat
>  $ sudo btrfs fi balance start -dusage=0 .
> ERROR: error during balancing '.': No space left on device
> There may be more info in syslog - try dmesg | tail

The effect is the same with kernel / progs  v4.6.0-rc3 / v4.5.1
It also doesn't matter if   fallocate -l 4400M test.dat   or   dd
if=/dev/zero of=test.dat bs=1M count=4400   is used to create test.dat
(I was looking at --dig-holes and --punch-hole options earlier and was
wondering if the use of fallocate would make a difference).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enospace regression in 4.4

2016-04-12 Thread Henk Slager
On Tue, Apr 12, 2016 at 5:52 PM, Julian Taylor
 wrote:
> smaller testcase that shows the immediate enospc after fallocate -> rm,
> though I don't know if it is really related to the full filesystem
> bugging out as the balance does work if you wait a few seconds after the
> balance.
> But this sequence of commands did work in 4.2.
>
>  $ sudo btrfs fi show /dev/mapper/lvm-testing
> Label: none  uuid: 25889ba9-a957-415a-83b0-e34a62cb3212
> Total devices 1 FS bytes used 225.18MiB
> devid1 size 5.00GiB used 788.00MiB path /dev/mapper/lvm-testing
>
>  $ fallocate -l 4.4G test.dat
>  $ rm -f test.dat
>  $ sudo btrfs fi balance start -dusage=0 .
> ERROR: error during balancing '.': No space left on device
> There may be more info in syslog - try dmesg | tail

It seems that kernel 4.4.6 waits longer with de-allocating empty
chunks and the balance kicks in at a time when the 5 GiB is still
completely filled with chunks. As balance needs uncallocated space (on
device level, how much depends on profiles), this error can be
expected.

> On 04/12/2016 12:24 PM, Julian Taylor wrote:
>> hi,
>> I have a system with two filesystems which are both affected by the
>> notorious enospace bug when there is plenty of unallocated space
>> available. The system is a raid0 on two 900 GiB disks and an iscsi
>> single/dup 1.4TiB.
>> To deal with the problem I use a cronjob that uses fallocate to give me
>> an advance notice on the issue so I can apply the only workaround that
>> works for me, which is shrink the fs to the minimum and grow it again.
>> This has worked fine for a couple of month.
>>
>> I now updated from 4.2 to 4.4.6 and it appears my cronjob actually
>> triggers an immediate enospc in the balance after removing the
>> fallocated file and the shrink/resize workaround does not work anymore.

The filesystem itself is not resized AFAIU, correct?

>> it is mounted with enospc_debug but that just says "2 enospc in
>> balance". Nothing else useful in the log.
>>
>> I had to revert back to 4.2 to get the system running again so it is
>> currently not available for more testing, but I may be able to do more
>> tests if required in future.
>>
>> The cronjob does this once a day:
>>
>> #!/bin/bash
>> sync
>>
>> check() {
>>   date
>>   mnt=$1
>>   time btrfs fi balance start -mlimit=2 $mnt
>>   btrfs fi balance start -dusage=5 $mnt
>>   sync
>>   freespace=$(df -B1 $mnt | tail -n 1 | awk '{print $4 -
>> 50*1024*1024*1024}')
>>   fallocate -l $freespace $mnt/falloc
>>   /usr/sbin/filefrag $mnt/falloc
>>   rm -f $mnt/falloc
>>   btrfs fi balance start -dusage=0 $mnt

See comment for smaller test; Maybe you could put a delay of larger
than the commit time before this balance. To give the kernel itself
the possibility to cleanup empty chunks.

>>   time btrfs fi balance start -mlimit=2 $mnt
>>   time btrfs fi balance start -dlimit=10 $mnt
>>   date
>> }
>>
>> check /data
>> check /data/nas

It could be that now with kernel 4.4.6 or newer, the original enospc
(so not the ones due to balances) does not popup anymore. That would
mean the cronjob workaround itself creates a problem now. Can you give
some background on what other (types of) enospc occurred in the past
and was this with 4.2 kernel ? or older?

You could shrink a file-system by a few GiB's (without changing the
size of the underlying device), so that once it really gets filled up
and hits enospc, you resize to max again and delete files or snapshot
or something. Of course no option for a 24/7 unattended system, but
maybe for a client laptop as testing.

>> btrfs info:
>>
>>
>>  ~ $ btrfs --version
>> btrfs-progs v4.4
>> sagan5 ~ $ sudo btrfs fi show
>> Label: none  uuid: e4aef349-7a56-4287-93b1-79233e016aae
>>   Total devices 2 FS bytes used 898.18GiB
>>   devid1 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear1
>>   devid2 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear2
>>
>> Label: none  uuid: 14040f9b-53c8-46cf-be6b-35de746c3153
>>   Total devices 1 FS bytes used 557.19GiB
>>   devid1 size 1.36TiB used 585.95GiB path /dev/sdd
>>
>>  ~ $ sudo btrfs fi df /data
>> Data, RAID0: total=938.00GiB, used=895.09GiB
>> System, RAID1: total=32.00MiB, used=112.00KiB
>> Metadata, RAID1: total=4.00GiB, used=3.10GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> sagan5 ~ $ sudo btrfs fi usage /data
>> Overall:
>> Device size: 1.72TiB
>> Device allocated:  946.06GiB
>> Device unallocated:813.94GiB
>> Device missing:0.00B
>> Used:  901.27GiB
>> Free (estimated):  856.85GiB  (min: 449.88GiB)
>> Data ratio: 1.00
>> Metadata ratio: 2.00
>> Global reserve:512.00MiB  (used: 0.00B)
>>
>> Data,RAID0: Size:938.00GiB, Used:895.09GiB
>>/dev/dm-1   469.00GiB
>>/dev/mapper/data-linear1

Re: csum failed on innexistent inode

2016-04-11 Thread Henk Slager
On Mon, Apr 11, 2016 at 3:48 AM, Jérôme Poulin <jeromepou...@gmail.com> wrote:
> Sorry for the confusion, allow me to clarify and I will summarize with
> what I learned since I now understand that corruption was present
> before disk went bad.
>
> Note that this BTRFS was once on a MD RAID5 on LVM on LUKS before
> being moved in-place to LVM on LUKS on BTRFS RAID10. But since balance
> worked at the time.

I haven't used LVM for years, but those in-place actions normally work
if size calculations etc are correct. Otherwise you would know
immediately.

> Also note that this computer was booted twice for about 30 minutes
> period with bad ram before it was replaced.

This is very important info. It is clear now that there was bad memory
and that it is just half an hour.

> I think my checksums errors were present, but unknown to me, before
> the hardware disk failure. The bad memory might be the root cause of
> this problem but I can't be sure.

When I look at all the info now and also think of my own experience
with bad ram module and btrfs, I think this bad memory is the root
cause. I have seen btrfs RAID10 correcting a few errors (likely coming
from earlier crashes with btrfs RAID5 on older disks). If it can't
correct, there is something else wrong and likely affecting more
devices than the RAID profile is able to correct.

> On Sun, Apr 10, 2016 at 1:25 PM, Henk Slager <eye...@gmail.com> wrote:
>> It was not fully clear what the sequence of events were:
>> - HW problem
>> - btrfs SW problem
>> - 1st scrub
>> - the --repair-sector with hdparm
>> - 2nd scrub
>> - 3rd scrub?
>>
>
> 1. Errors in dmesg and confirmation from smartd that hardware problems
> were present.
> 2. Attempt to repair sector using --repair-sector which reset the
> sector to zeroes.
> 3. Scrub detected errors and fixed some but there were 18 uncorrectable.
> 4. Disk has been changed using btrfs replace. Corruption still present.
> 5. Balance attempted but aborts when encountering the first uncorrectable 
> error.
> 6. Tentative to locate bad sector/inode without success leading to
> another scrub with the same errors.
> 7. Attempt to reset stats and scrub again. Still getting the same errors.
> 8. New disk added and data profile converted from RAID10 to RAID1,
> balance abort on first uncorrectable error.
>
>
>> There is also DM between the harddisk and btrfs and I am not sure if
>> whether the hdparm action did repair or further corrupt things.
>>
>
> I confirmed after using --repair-sector that the sector has been reset
> to zeroes using --read-sector. I also tried read-sector first which
> failed and added an entry to the SMART log. After repair-sector,
> read-sector returned the zeroed sector.
>
>> How do you know for sure that the contents of the 'logical blocks' are
>> the same on both devices?
>>
>
> After a balance, here is what dmesg shows (complete warning output):
> BTRFS warning (device dm-36): csum failed ino 330 off 1809084416 csum
> 4147641019 expected csum 1755301217
> BTRFS warning (device dm-36): csum failed ino 330 off 1809195008 csum
> 1515428513 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809199104 csum
> 1927504681 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809211392 csum
> 3086571080 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809149952 csum
> 3254083717 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809162240 csum
> 3157020538 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809166336 csum
> 1092724678 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809178624 csum
> 4235459038 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809182720 csum
> 1764946502 expected csum 2566472073
> BTRFS warning (device dm-36): csum failed ino 330 off 1809084416 csum
> 4147641019 expected csum 1755301217
>
>
> After a scrub (complete error output):
> BTRFS error (device dm-36): bdev /dev/dm-32 errs: wr 0, rd 0, flush 0,
> corrupt 1, gen 0
> BTRFS error (device dm-36): bdev /dev/dm-32 errs: wr 0, rd 0, flush 0,
> corrupt 2, gen 0
> BTRFS error (device dm-36): unable to fixup (regular) error at logical
> 1296334876672 on dev /dev/dm-32
> BTRFS error (device dm-36): unable to fixup (regular) error at logical
> 1296334987264 on dev /dev/dm-32
> BTRFS error (device dm-36): bdev /dev/dm-32 errs: wr 0, rd 0, flush 0,
> corrupt 3, gen 0
> BTRFS error (device dm-36): unable to fixup (regular) error at logical
> 1296334991360 on dev /dev/dm-32
> BTRFS error (device dm-36): bdev /de

Re: csum failed on innexistent inode

2016-04-10 Thread Henk Slager
>> You might want this patch:
>> http://www.spinics.net/lists/linux-btrfs/msg53552.html
>>
>> As workaround, you can reset the counters on new/healty device with:
>>
>> btrfs device stats [-z] |
>>
>
> I did reset the stats and launched another scrub, and still, since the
> logical blocks are the same on both devices and checksum is different,
> is really seems like my problem was originally created when I booted
> this computer with bad memory (maybe?), could it have been that the
> checksum was saved on disk as bad in the first place and BTRFS doesn't
> want to read it back?

It was not fully clear what the sequence of events were:
- HW problem
- btrfs SW problem
- 1st scrub
- the --repair-sector with hdparm
- 2nd scrub
- 3rd scrub?

There is also DM between the harddisk and btrfs and I am not sure if
whether the hdparm action did repair or further corrupt things.

How do you know for sure that the contents of the 'logical blocks' are
the same on both devices?

If btrfs wants to read a diskblock and its csum doesn't match, then it
is an I/O error, same effect as an uncorrected badsector in the old
days. But in this case your (former/old) disk might still be OK, as
you suggest it might be due to some other error (HW or SW) that
content and csum don't match. It is hard to traceback based on the
info in the email thread. It looks like replace just copied the
problem and it seems a bottleneck now on filesystem level.

> Is it possible to reset the checksum on those? I couldn't find what
> file or metadata the blocks were pointing too.

Could it be that they in the meantime have been removed?
It might be that you again need to run scrub in order to try to find
the problem spot/files.

Fixing individual csum's has been asked before, I don't remember if
there are people who did fix them by own extra scripts/C-code or
whatever. A brute force method is to recalculate and rewrite all
csums:  btrfs check --init-csum-tree , you probably know that. But
maybe you want a rsync -c compare with backups first. Kernel/tools
versions and btrfs fi us output might also give some hints.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive using generation number as source

2016-04-08 Thread Henk Slager
On Fri, Apr 8, 2016 at 1:01 PM, Martin Steigerwald  wrote:
> Hello!
>
> As far as I understood, for differential btrfs send/receive – I didn´t use it
> yet – I need to keep a snapshot on the source device to then tell btrfs send
> to send the differences between the snapshot and the current state.

During the incremental send operation you need 2 ro snapshots
available (a parent and a current snapshot) and after that, you just
need to keep the current one and promote that to parent snapshot and
keep it around until the next incremental send. So indeed that locks
space and you might run out of free space if there is a long time
before the next incremental send|receive and changes in the filesystem
are large in volume.

Alternatively, you could do non-incremental send, if the fs is
relatively small and you have some method to dedupe on the receiving
filesystem. But the rsync method is by far preferred in this case I
would say.

> Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep
> any snapshots except for one during rsync or borgbackup script run-time.
>
> Is it possible to tell btrfs send to use generation number xyz to calculate
> the difference? This way, I wouldn´t have to keep a snapshot around, I
> believe.
>
> I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot
> of the same state on the destination as well. Well on the destination I let
> the script make a snapshot after the backup so… what I would need is to

You can use -p for incremental send and you can also send back (new)
increments from backup to master.

> remember the generation number of the source snapshot that the script creates
> to backup from and then tell btrfs send that generation number + the
> destination snapshots.
>
> Well, or get larger SSDs or get rid of some data on them.

I switched from ext4 to btrfs rootfs on an old netbook which has only
4G soldered flash and no option for extension  (except via USB/SDcard
which turned out to be not reliable enough over a longer period of
time).
Basically compress=lzo mount option extents the lifetime of this
netbook while still using a modern full-sized linux distro. But I
guess you have already compressed/compacted what is possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-05 Thread Henk Slager
On Tue, Apr 5, 2016 at 4:37 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted:
>
>> I've got a btrfs file system set up on 6 drbd disks running on 2Tb
>> spinning disks. The server is moderately loaded with various regular
>> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs
>> scrub for the best quiet time in the week.
>>
>> The command that is run is:
>> /usr/local/bin/btrfs scrub start -Bd -c idle /data
>>
>> Which is my best attempt to try and get it to have a low impact on user
>> operations
>>
>> But iotop shows me:
>>
>> 1765 be/4 root   14.84 M/s0.00 B/s  0.00 % 96.65 % btrfs scrub
>> start -Bd -c idle /data
>>  1767 be/4 root   14.70 M/s0.00 B/s  0.00 % 95.35 % btrfs
>> scrub start -Bd -c idle /data
>>  1768 be/4 root   13.47 M/s0.00 B/s  0.00 % 92.59 % btrfs
>> scrub start -Bd -c idle /data
>>  1764 be/4 root   12.61 M/s0.00 B/s  0.00 % 88.77 % btrfs
>> scrub start -Bd -c idle /data
>>  1766 be/4 root   11.24 M/s0.00 B/s  0.00 % 85.18 % btrfs
>> scrub start -Bd -c idle /data
>>  1763 be/4 root7.79 M/s0.00 B/s  0.00 % 63.30 % btrfs
>> scrub start -Bd -c idle /data
>> 28858 be/4 root0.00 B/s  810.50 B/s  0.00 % 61.32 % [kworker/
> u16:25]
>>
>>
>> Which doesn't look like an idle priority to me. And the system sure
>> feels like a system with a lot of heavy io going on. Is there something
>> I'm doing wrong?

When I see the throughput numbers, it lets me think that the
filesystem is raid5 or raid6. On single or raid1 or raid10 one easily
gets around 100M/s without the notice/feeling of heavy IO ongoing,
mostly independent of scrub options.

> Two points:
>
> 1) It appears btrfs scrub start's -c option only takes numeric class, so
> try -c3 instead of -c idle.

Thanks to Duncan for pointing this out. I don't remember exactly, but
I think I also had issues with this in the past, but did not realize
or have a further look at it.

> Works for me with the numeric class (same results as you with spelled out
> class), tho I'm on ssd with multiple independent btrfs on partitions, the
> biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of
> 20 seconds, so I don't need and hadn't tried the -c option at all until
> now.
>
> 2) What a difference an ssd makes!
>
> $$ sudo btrfs scrub start -c3 /p
> scrub started on /p, [...]
>
> $$ sudo iotop -obn1
> Total DISK READ : 626.53 M/s | Total DISK WRITE :   0.00 B/s
> Actual DISK READ: 596.93 M/s | Actual DISK WRITE:   0.00 B/s
>  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
>  872 idle root  268.40 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
> start -c3 /p
>  873 idle root  358.13 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub
> start -c3 /p
>
> CPU bound, 0% IOWait even at idle IO priority, in addition to the
> hundreds of M/s values per thread/device, here.  You OTOH are showing
> under 20 M/s per thread/device on spinning rust, with an IOWait near 90%,
> thus making it IO bound.

This low M/s and high IOWait is the kind of behavior I noticed with 3x
2TB raid5 when scrubbing or balancing (no bcache activated, kernel
4.3.3).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed on innexistent inode

2016-04-04 Thread Henk Slager
On Mon, Apr 4, 2016 at 9:50 AM, Jérôme Poulin  wrote:
> Hi all,
>
> I have a BTRFS on disks running in RAID10 meta+data, one of the disk
> has been going bad and scrub was showing 18 uncorrectable errors
> (which is weird in RAID10). I tried using --repair-sector with hdparm
> even if it shouldn't be necessary since BTRFS would overwrite the
> sector. Repair sector fixed the sector in SMART but BTRFS was still
> showing 18 uncorr. errors.
>
> I finally decided to give up this opportunity to test the error
> correction property of BTRFS (this is a home system, backed up) and
> installed a brand new disk in the machine. After running btrfs
> replace, everything was fine, I decided to run btrfs scrub again and I
> still have the same 18 uncorrectable errors.

You might want this patch:
http://www.spinics.net/lists/linux-btrfs/msg53552.html

As workaround, you can reset the counters on new/healty device with:

btrfs device stats [-z] |

> Later on, since I had a new disk with more space, I decided to run a
> balance to free up the new space but the balance has stopped with csum
> errors too. Here are the output of multiple programs.
>
> How is it possible to get rid of the referenced csum errors if they do
> not exist? Also, the expected checksum looks suspiciously the same for
> multiple errors. Could it be bad RAM in that case? Can I convince
> BTRFS to update the csum?
>
> # btrfs inspect-internal logical-resolve -v 1809149952 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory
> # btrfs inspect-internal inode-resolve -v 296 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory
>
>
> dmesg after first bad sector:
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368716288 (dev /dev/dm-42 sector
> 2939136)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368720384 (dev /dev/dm-42 sector
> 2939144)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368724480 (dev /dev/dm-42 sector
> 2939152)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368728576 (dev /dev/dm-42 sector
> 2939160)
>
> dmesg after balance:
> [1738474.444648] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809195008 csum 1515428513 expected csum 2566472073
> [1738474.444649] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809084416 csum 4147641019 expected csum 1755301217
> [1738474.444702] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809199104 csum 1927504681 expected csum 2566472073
> [1738474.444717] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809211392 csum 3086571080 expected csum 2566472073
> [1738474.444917] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809084416 csum 4147641019 expected csum 1755301217
> [1738474.444962] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809195008 csum 1515428513 expected csum 2566472073
> [1738474.444998] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809199104 csum 1927504681 expected csum 2566472073
> [1738474.445034] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809211392 csum 3086571080 expected csum 2566472073
> [1738474.473286] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809149952 csum 3254083717 expected csum 2566472073
> [1738474.473357] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809162240 csum 3157020538 expected csum 2566472073
>
> btrfs check:
> ./btrfs check /dev/mapper/luksbtrfsdata2
> Checking filesystem on /dev/mapper/luksbtrfsdata2
> UUID: 805f6ad7-1188-448d-aee4-8ddeeb70c8a7
> checking extents
> bad metadata [1453741768704, 1453741785088) crossing stripe boundary
> bad metadata [1454487764992, 1454487781376) crossing stripe boundary
> bad metadata [1454828552192, 1454828568576) crossing stripe boundary
> bad metadata [1454879735808, 1454879752192) crossing stripe boundary
> bad metadata [1455087222784, 1455087239168) crossing stripe boundary
> bad metadata [1456269426688, 1456269443072) crossing stripe boundary
> bad metadata [1456273227776, 1456273244160) crossing stripe boundary
> bad metadata [1456404234240, 1456404250624) crossing stripe boundary
> bad metadata [1456418914304, 1456418930688) crossing stripe boundary

Those are false alerts; This patch handles that:
https://patchwork.kernel.org/patch/8706891/

> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 689292505473 bytes used err is 0
> total csum bytes: 660112536
> total tree bytes: 1764098048
> total fs tree bytes: 961921024
> total extent tree bytes: 79331328
> btree space waste bytes: 232774315
> file data blocks allocated: 4148513517568
>  referenced 972284129280
>
> btrfs scrub:
> I don't have the output handy but the dmesg output were pairs of
> logical blocks like balance and no errors were corrected.
> --
> 

Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Henk Slager
On Sat, Apr 2, 2016 at 11:00 AM, Kai Krakow <hurikha...@gmail.com> wrote:
> Am Fri, 1 Apr 2016 01:27:21 +0200
> schrieb Henk Slager <eye...@gmail.com>:
>
>> It is not clear to me what 'Gentoo patch-set r1' is and does. So just
>> boot a vanilla v4.5 kernel from kernel.org and see if you get csum
>> errors in dmesg.
>
> It is the gentoo patchset, I don't think anything there relates to
> btrfs:
> https://dev.gentoo.org/~mpagano/genpatches/trunk/4.5/
>
>> Also, where does 'duplicate object' come from? dmesg ? then please
>> post its surroundings, straight from dmesg.
>
> It was in dmesg. I already posted it in the other thread and Qu took
> note of it. Apparently, I didn't manage to capture anything else than:
>
> btrfs_run_delayed_refs:2927: errno=-17 Object already exists
>
> It hit me unexpected. This was the first time btrfs went RO for me. It
> was with kernel 4.4.5 I think.
>
> I suspect this is the outcome of unnoticed corruptions that sneaked in
> earlier over some period of time. The system had no problems until this
> incident, and only then I discovered the huge pile of corruptions when I
> ran btrfsck.
>
> I'm also pretty convinced now that VirtualBox itself is not the problem
> but only victim of these corruptions, that's why it primarily shows up
> in the VDI file.
>
> However, I now found csum errors in unrelated files (see other post in
> this thread), even for files not touched in a long time.

Ok, this is some good further status and background. That there are
more csum errors elsewhere is quite worrying I would say. You said HW
is tested, are you sure there no rare undetected failures, like due to
overclocking or just aging or whatever. It might just be that spurious
HW errors just now start to happen and are unrelated to kernel upgrade
from 4.4.x to 4.5.
I had once a RAM module going bad; Windows7 ran fine (at least no
crashes), but when I booted with Linux/btrfs, all kinds of strange
btrfs errors started to appear including csum errors.

The other thing you could think about is the SSD cache partition. I
don't remember if blocks from RAM to SSD get an extra CRC attached
(independent of BTRFS). But if data gets corrupted while in the SSD,
you could get very nasty errors, how nasty depends a bit on the
various bcache settings. It is not unthinkable that dirty changed data
gets written to the harddisks. But at least btrfs (scub) can detect
that (the situation you are in now).

Maybe to further isolate just btrfs, you could temporary rule out
bcache by making sure the cache is clean and then increase the
startsectors of second partitions on the harddisks by 16 (8KiB) and
then reboot. Of course after any write to the partitions, you'll have
to recreate all bcache.

But maybe it is just due to bugs in older kernels that the fs has been
silently corrupted and now kernel 4.5 cannot handle it anymore and any
use of the fs increases corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 10:40 PM, Marc Haber <mh+linux-bt...@zugschlus.de> wrote:
> On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote:
>> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber <mh+linux-bt...@zugschlus.de> 
>> wrote:
>> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber 
>> >> > <mh+linux-bt...@zugschlus.de> wrote:
>> >> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> >> > > btrfs-transaction process hog one CPU core each for hours, while
>> >> > > blocking the filesystem for minutes apiece, which leads to the host
>> >> > > being nearly unuseable up to the point of "clock and mouse pointer
>> >> > > frozen for nearly ten minutes".
>> >> >
>> >> > I assume you still have your every 10 minutes snapshotting running
>> >> > while balancing?
>> >>
>> >> No, I disabled the cronjob before trying the balance. I might be
>> >> crazy, but not stup^wunexperienced.
>> >
>> > That being said, I would still expect the code not to allow _this_
>> > kind of effect on the entire system when two alledgely incompatible
>> > operations run simultaneously. I mean, Linux is a multi-user,
>> > multi-tasking operating system where one simply cannot expect all
>> > processes to be cooperative to each other. We have the operating
>> > systems to prevent this kind of issues, not to cause them.
>>
>> Maybe look at it differently: Does user mh have trouble using this
>> laptop w.r.t. storing files?
>
> No. I would have cried murder otherwise.
>
>> In openSUSE Tumbleweed (the snapshot from end of march), root access
>> is needed to change the default snapshotting config, otherwise you
>> will have a 10 year history. After that change has been done according
>> to needs of the user, there is no need to run manual balance.
>
> So you are saying the balancing a filesystem should never be
> necessary? Or what are you trying to say?

There is a package  bbtrfsmaintenance  which does balancing for the
user after it is configured by root according to user's wishes and
needs.

Key thing I want to say is that you should change you snapshotting
rate and/or policy. It has been hinted before and it is more a
psychological issue than technical I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber <mh+linux-bt...@zugschlus.de> wrote:
> On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-bt...@zugschlus.de> 
>> > wrote:
>> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> > > btrfs-transaction process hog one CPU core each for hours, while
>> > > blocking the filesystem for minutes apiece, which leads to the host
>> > > being nearly unuseable up to the point of "clock and mouse pointer
>> > > frozen for nearly ten minutes".
>> >
>> > I assume you still have your every 10 minutes snapshotting running
>> > while balancing?
>>
>> No, I disabled the cronjob before trying the balance. I might be
>> crazy, but not stup^wunexperienced.
>
> That being said, I would still expect the code not to allow _this_
> kind of effect on the entire system when two alledgely incompatible
> operations run simultaneously. I mean, Linux is a multi-user,
> multi-tasking operating system where one simply cannot expect all
> processes to be cooperative to each other. We have the operating
> systems to prevent this kind of issues, not to cause them.

Maybe look at it differently: Does user mh have trouble using this
laptop w.r.t. storing files?

In openSUSE Tumbleweed (the snapshot from end of march), root access
is needed to change the default snapshotting config, otherwise you
will have a 10 year history. After that change has been done according
to needs of the user, there is no need to run manual balance.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  wrote:
> Hi,
>
> just for a change, this is another btrfs on a different host. The host
> is also running Debian unstable with mainline kernels, the btrfs in
> question was created (not converted) in March 2015 with btrfs-tools
> 3.17. It is the root fs of my main work notebook which is under
> workstation load, with lots of snapshots being created and deleted.
>
> Balance immediately fails with ENOSPC
>
> balance -dprofiles=single -dusage=1 goes through "fine" ("had to
> relocate 0 out of 602 chunks")
>
> balance -dprofiles=single -dusage=2 also ENOSPCes immediately.
>
> [4/502]mh@swivel:~$ sudo btrfs fi usage /
> Overall:
> Device size: 600.00GiB
> Device allocated:600.00GiB
> Device unallocated:1.00MiB
> Device missing:  0.00B
> Used:413.40GiB
> Free (estimated):148.20GiB  (min: 148.20GiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:553.93GiB, Used:405.73GiB
>/dev/mapper/swivelbtr 553.93GiB
>
> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>/dev/mapper/swivelbtr  46.00GiB
>
> System,DUP: Size:32.00MiB, Used:112.00KiB
>/dev/mapper/swivelbtr  64.00MiB
>
> Unallocated:
>/dev/mapper/swivelbtr   1.00MiB
> [5/503]mh@swivel:~$
>
> btrfs balance -mprofiles seems to do something. one kworked and one
> btrfs-transaction process hog one CPU core each for hours, while
> blocking the filesystem for minutes apiece, which leads to the host
> being nearly unuseable up to the point of "clock and mouse pointer
> frozen for nearly ten minutes".

I assume you still have your every 10 minutes snapshotting running
while balancing?

> The btrfs balance cancel I issued after four hours of this state took
> eleven minutes alone to complete.
>
> These are all log entries that were obtained after starting btrfs
> balance -mprofiles on 09:43
> Apr  1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): 
> found 3523 extents
> Apr  1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): 
> relocating block group 1538365849600 flags 36
> Apr  1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): 
> found 3585 extents
> Apr  1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): 
> relocating block group 1536755236864 flags 36
> Apr  1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): 
> found 3047 extents
> Apr  1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): 
> relocating block group 1528702173184 flags 36
> Apr  1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): 
> found 4200 extents
>
> This kernel trace from 11:16 is not btrfs-related, is it? I guess it's
> bluetooth related since it happened simultaneously to the bluetooth
> device popping out an in:
> Apr  1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, 
> device number 39
> Apr  1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status...
> Apr  1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status.
> Apr  1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. 
> Stopping.
> Apr  1 11:16:38 swivel systemd[1]: Stopped target Bluetooth.
> Apr  1 11:16:38 swivel laptop-mode: Laptop mode
> Apr  1 11:16:38 swivel laptop-mode: enabled, not active
> Apr  1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB 
> device number 40 using ehci-pci
> Apr  1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device 
> found, idVendor=0a5c, idProduct=217f
> Apr  1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device 
> strings: Mfr=1, Product=2, SerialNumber=3
> Apr  1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom 
> Bluetooth Device
> Apr  1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: 
> Broadcom Corp
> Apr  1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: 
> CCAF78F1274F
> Apr  1 11:16:39 swivel systemd[1]: Reached target Bluetooth.
> Apr  1 11:16:39 swivel kernel: [249949.507794] [ cut here 
> ]
> Apr  1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at 
> arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326()
> Apr  1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer 
> allocation failure
> Apr  1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid 
> hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc 
> cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave 
> nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl 
> x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm 
> snd_hda_codec_conexant 

Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 10:44 PM, Kai Krakow  wrote:
> Hello!
>
> I already reported this in another thread but it was a bit confusing by
> intermixing multiple volumes. So let's start a new thread:
>
> Since one of the last kernel upgrades, I'm experiencing one VDI file
> (containing a NTFS image with Windows 7) getting damaged when running
> the machine in VirtualBox. I got knowledge about this after
> experiencing an error "duplicate object" and btrfs went RO. I fixed it
> by deleting the VDI and restoring from backup - but no I get csum
> errors as soon as some VM IO goes into the VDI file.
>
> The FS is still usable. One effect is, that after reading all files
> with rsync (to copy to my backup), each call of "du" or "df" hangs, also
> similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one
> outcome of this is, that the FS does not properly unmount during
> shutdown.
>
> Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x
> series, and never had problems), including Gentoo patch-set r1.

One possibility could be that the vbox kernel modules somehow corrupt
btrfs kernel area since kernel 4.5.

In order to make this reproducible (or an attempt to reproduce) for
others, you could unload VirtualBox stuff and restore the VDI file
from backup (or whatever big file) and then make pseudo-random, but
reproducible writes to the file.

It is not clear to me what 'Gentoo patch-set r1' is and does. So just
boot a vanilla v4.5 kernel from kernel.org and see if you get csum
errors in dmesg.

Also, where does 'duplicate object' come from? dmesg ? then please
post its surroundings, straight from dmesg.

> The device layout is:
>
> $ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT
> NAMEMODELFSTYPE LABEL  MOUNTPOINT
> sda Crucial_CT128MX1
> ├─sda1   vfat   ESP/boot
> ├─sda2
> └─sda3   bcache
>   ├─bcache0  btrfs  system
>   ├─bcache1  btrfs  system
>   └─bcache2  btrfs  system /usr/src
> sdb SAMSUNG HD103SJ
> ├─sdb1   swap   swap0  [SWAP]
> └─sdb2   bcache
>   └─bcache2  btrfs  system /usr/src
> sdc SAMSUNG HD103SJ
> ├─sdc1   swap   swap1  [SWAP]
> └─sdc2   bcache
>   └─bcache1  btrfs  system
> sdd SAMSUNG HD103UJ
> ├─sdd1   swap   swap2  [SWAP]
> └─sdd2   bcache
>   └─bcache0  btrfs  system
>
> Mount options are:
>
> $ mount|fgrep btrfs
> /dev/bcache2 on / type btrfs 
> (rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs)
>
> The FS uses mraid=1 and draid=0.
>
> Output of btrfsck is:
> (also available here:
> https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae)
>
> $ btrfsck /dev/disk/by-label/system
> checking extents
> ref mismatch on [10443660537856 524288] extent item 1, found 2
This   10443660537856  number is bigger than the  1832931324360 number
found for total bytes. AFAIK, this is already wrong.

[...]

> checking fs roots
> root 4336 inode 4284125 errors 1000, some csum missing
What is in this inode?

> Checking filesystem on /dev/disk/by-label/system
> UUID: d2bb232a-2e8f-4951-8bcc-97e237f1b536
> found 1832931324360 bytes used err is 1
> total csum bytes: 1730105656
> total tree bytes: 6494474240
> total fs tree bytes: 3789783040
> total extent tree bytes: 608219136
> btree space waste bytes: 1221460063
> file data blocks allocated: 2406059724800
>  referenced 2040857763840
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-31 Thread Henk Slager
>> Would you please try the following patch based on v4.5 btrfs-progs?
>> https://patchwork.kernel.org/patch/8706891/
>>
>> According to your output, all the output is false alert.
>> All the extent starting bytenr can be divided by 64K, and I think at
>> initial time, its 'max_size' may be set to 0, causing "start + 0 - 1"
>> to be inside previous 64K range.
>>
>> The patch would update cross_stripe every time the extent is updated,
>> so such temporary false alert should disappear.
>
> Applied and no more reports of crossing stripe boundary - thanks.
>
> Will this go into 4.5.1 or 4.5.2?

It is not in 4.5.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Mon, Mar 28, 2016 at 4:37 PM, Marc Haber  wrote:
> Hi,
>
> I have a btrfs which btrfs check --repair doesn't fix:
>
> # btrfs check --repair /dev/mapper/fanbtr
> bad metadata [4425377054720, 4425377071104) crossing stripe boundary
> bad metadata [4425380134912, 4425380151296) crossing stripe boundary
> bad metadata [4427532795904, 4427532812288) crossing stripe boundary
> bad metadata [4568321753088, 4568321769472) crossing stripe boundary
> bad metadata [4568489656320, 4568489672704) crossing stripe boundary
> bad metadata [4571474493440, 4571474509824) crossing stripe boundary
> bad metadata [4571946811392, 4571946827776) crossing stripe boundary
> bad metadata [4572782919680, 4572782936064) crossing stripe boundary
> bad metadata [4573086351360, 4573086367744) crossing stripe boundary
> bad metadata [4574221041664, 4574221058048) crossing stripe boundary
> bad metadata [4574373412864, 4574373429248) crossing stripe boundary
> bad metadata [4574958649344, 4574958665728) crossing stripe boundary
> bad metadata [4575996018688, 4575996035072) crossing stripe boundary
> bad metadata [4580376772608, 4580376788992) crossing stripe boundary

In this case, for all  ... [X,Y) ...

X is 64K aligned and Y - X = 16K

So also false alerts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 4:23 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> Henk Slager wrote on 2016/03/30 16:03 +0200:
>>
>> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>> First of all.
>>>
>>> The "crossing stripe boundary" error message itself is *HARMLESS* for
>>> recent
>>> kernels.
>>>
>>> It only means, that metadata extent won't be checked by scrub on recent
>>> kernels.
>>> Because scrub by its codes, has a limitation that, it can only check tree
>>> blocks which are inside a 64K block.
>>>
>>> Old kernel won't have anything wrong, until that tree block is being
>>> scrubbed.
>>> When scrubbed, old kernel just BUG_ON().
>>>
>>> Now recent kernel will handle such limitation by checking extent
>>> allocation
>>> and avoid crossing boundary, so new created fs with new kernel won't
>>> cause
>>> such error message at all.
>>>
>>> But for old created fs, the problem can't be avoided, but at least, new
>>> kernels will not BUG_ON() when you scrub these extents, they just get
>>> ignored (not that good, but at least no BUG_ON).
>>>
>>> And new fsck will check such case, gives such warning.
>>>
>>> Overall, you're OK if you are using recent kernels.
>>>
>>> Marc Haber wrote on 2016/03/29 08:43 +0200:
>>>>
>>>>
>>>> On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>>>>>
>>>>>
>>>>> Did you convert this filesystem from ext4 (or ext3)?
>>>>
>>>>
>>>>
>>>> No.
>>>>
>>>>> You hadn't mentioned what version of btrfs-progs you're using, and that
>>>>> is
>>>>> somewhat important for recovery.  I'm not sure if current versions of
>>>>> btrfs
>>>>> check can fix this issue, but I know for a fact that older versions
>>>>> (prior
>>>>> to at least 4.1) can not fix it.
>>>>
>>>>
>>>>
>>>> 4.1 for creation and btrfs check.
>>>
>>>
>>>
>>> I assume that you have run older kernel on it, like v4.1 or v4.2.
>>>
>>> In those old kernels, it lacks the check to avoid such extent allocation
>>> check.
>>>
>>>>
>>>>> As far as what the kernel is involved with, the easy way to check is if
>>>>> it's
>>>>> operating on a mounted filesystem or not.  If it only operates on
>>>>> mounted
>>>>> filesystems, it almost certainly goes through the kernel, if it only
>>>>> operates on unmounted filesystems, it's almost certainly done in
>>>>> userspace
>>>>> (except dev scan and technically fi show).
>>>>
>>>>
>>>>
>>>> Then btrfs check is a userspace-only matter, as it wants the fs
>>>> unmounted, and it is irrelevant that I did btrfs check from a rescue
>>>> system with an older kernel, 3.16 if I recall correctly.
>>>
>>>
>>>
>>> Not recommended to use older kernel to RW mount or use older fsck to do
>>> repair.
>>> As it's possible that older kernel/btrfsck may allocate extent that cross
>>> the 64K boundary.
>>>
>>>>
>>>>> 2. Regarding general support:  If you're using an enterprise
>>>>> distribution
>>>>> (RHEL, SLES, CentOS, OEL, or something similar), you are almost
>>>>> certainly
>>>>> going to get better support from your vendor than from the mailing list
>>>>> or
>>>>> IRC.
>>>>
>>>>
>>>>
>>>> My "productive" desktops (fan is one of them) run Debian unstable with
>>>> a current vanilla kernel. At the moment, I can't use 4.5 because it
>>>> acts up with KVM.  When I need a rescue system, I use grml, which
>>>> unfortunately hasn't released since November 2014 and is still with
>>>> kernel 3.16
>>>
>>>
>>>
>>> To fix your problem(make these error message just disappear, even they
>>> are
>>> harmless on recent kernels), the most easy one, is to balance your
>>> metadata.
>>
>>
>> I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
>> filesystem mentioned in here:
>> http://www.sp

Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 2:28 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> Henk Slager wrote on 2016/03/30 16:03 +0200:
>>
>> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>> First of all.
>>>
>>> The "crossing stripe boundary" error message itself is *HARMLESS* for
>>> recent
>>> kernels.
>>>
>>> It only means, that metadata extent won't be checked by scrub on recent
>>> kernels.
>>> Because scrub by its codes, has a limitation that, it can only check tree
>>> blocks which are inside a 64K block.
>>>
>>> Old kernel won't have anything wrong, until that tree block is being
>>> scrubbed.
>>> When scrubbed, old kernel just BUG_ON().
>>>
>>> Now recent kernel will handle such limitation by checking extent
>>> allocation
>>> and avoid crossing boundary, so new created fs with new kernel won't
>>> cause
>>> such error message at all.
>>>
>>> But for old created fs, the problem can't be avoided, but at least, new
>>> kernels will not BUG_ON() when you scrub these extents, they just get
>>> ignored (not that good, but at least no BUG_ON).
>>>
>>> And new fsck will check such case, gives such warning.
>>>
>>> Overall, you're OK if you are using recent kernels.
>>>
>>> Marc Haber wrote on 2016/03/29 08:43 +0200:
>>>>
>>>>
>>>> On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>>>>>
>>>>>
>>>>> Did you convert this filesystem from ext4 (or ext3)?
>>>>
>>>>
>>>>
>>>> No.
>>>>
>>>>> You hadn't mentioned what version of btrfs-progs you're using, and that
>>>>> is
>>>>> somewhat important for recovery.  I'm not sure if current versions of
>>>>> btrfs
>>>>> check can fix this issue, but I know for a fact that older versions
>>>>> (prior
>>>>> to at least 4.1) can not fix it.
>>>>
>>>>
>>>>
>>>> 4.1 for creation and btrfs check.
>>>
>>>
>>>
>>> I assume that you have run older kernel on it, like v4.1 or v4.2.
>>>
>>> In those old kernels, it lacks the check to avoid such extent allocation
>>> check.
>>>
>>>>
>>>>> As far as what the kernel is involved with, the easy way to check is if
>>>>> it's
>>>>> operating on a mounted filesystem or not.  If it only operates on
>>>>> mounted
>>>>> filesystems, it almost certainly goes through the kernel, if it only
>>>>> operates on unmounted filesystems, it's almost certainly done in
>>>>> userspace
>>>>> (except dev scan and technically fi show).
>>>>
>>>>
>>>>
>>>> Then btrfs check is a userspace-only matter, as it wants the fs
>>>> unmounted, and it is irrelevant that I did btrfs check from a rescue
>>>> system with an older kernel, 3.16 if I recall correctly.
>>>
>>>
>>>
>>> Not recommended to use older kernel to RW mount or use older fsck to do
>>> repair.
>>> As it's possible that older kernel/btrfsck may allocate extent that cross
>>> the 64K boundary.
>>>
>>>>
>>>>> 2. Regarding general support:  If you're using an enterprise
>>>>> distribution
>>>>> (RHEL, SLES, CentOS, OEL, or something similar), you are almost
>>>>> certainly
>>>>> going to get better support from your vendor than from the mailing list
>>>>> or
>>>>> IRC.
>>>>
>>>>
>>>>
>>>> My "productive" desktops (fan is one of them) run Debian unstable with
>>>> a current vanilla kernel. At the moment, I can't use 4.5 because it
>>>> acts up with KVM.  When I need a rescue system, I use grml, which
>>>> unfortunately hasn't released since November 2014 and is still with
>>>> kernel 3.16
>>>
>>>
>>>
>>> To fix your problem(make these error message just disappear, even they
>>> are
>>> harmless on recent kernels), the most easy one, is to balance your
>>> metadata.
>>
>>
>> I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
>> filesystem mentioned in here:
>> http://www.spinics.net/lists/linux-btrfs/msg51405.html
>>
>> but still   bad metadata [ ),  crossing stripe boundary   messages,
>> double amount compared to 2 months ago
>
>
> Would you please give an example of the output?
> So I can check if it's really crossing the boundary.

This is the 1st one of the 105 messages:
bad metadata [8263437058048, 8263437062144) crossing stripe boundary

For all ... [X,Y) ...
X is 64K aligned and Y - X = 4K

So in my case, all false alerts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 5:31 PM, Andreas Dilger  wrote:
> On Mar 31, 2016, at 1:55 AM, Christoph Hellwig  wrote:
>>
>> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>>> because it thinks the space is already allocated.  So a later overwrite
>>> over this shared extent may hit enospc errors.
>>
>> And this makes it an incorrect implementation of posix_fallocate,
>> which glibcs implements using fallocate if available.
>
> It isn't really useful for a COW filesystem to implement fallocate()
> to reserve blocks.  Even if it did allocate all of the blocks on the
> initial fallocate() call, when it comes time to overwrite these blocks
> new blocks need to be allocated as the old ones will not be overwritten.

There are also use-cases on BTRFS with CoW disabled, like operations
on virtual machine images that aren't snapshotted.
Those files tend to be big and having fallocate() implemented and
working like for e.g. XFS, in order to achieve space and speed
efficiency, makes sense IMHO.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Henk Slager
On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo  wrote:
> First of all.
>
> The "crossing stripe boundary" error message itself is *HARMLESS* for recent
> kernels.
>
> It only means, that metadata extent won't be checked by scrub on recent
> kernels.
> Because scrub by its codes, has a limitation that, it can only check tree
> blocks which are inside a 64K block.
>
> Old kernel won't have anything wrong, until that tree block is being
> scrubbed.
> When scrubbed, old kernel just BUG_ON().
>
> Now recent kernel will handle such limitation by checking extent allocation
> and avoid crossing boundary, so new created fs with new kernel won't cause
> such error message at all.
>
> But for old created fs, the problem can't be avoided, but at least, new
> kernels will not BUG_ON() when you scrub these extents, they just get
> ignored (not that good, but at least no BUG_ON).
>
> And new fsck will check such case, gives such warning.
>
> Overall, you're OK if you are using recent kernels.
>
> Marc Haber wrote on 2016/03/29 08:43 +0200:
>>
>> On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>>>
>>> Did you convert this filesystem from ext4 (or ext3)?
>>
>>
>> No.
>>
>>> You hadn't mentioned what version of btrfs-progs you're using, and that
>>> is
>>> somewhat important for recovery.  I'm not sure if current versions of
>>> btrfs
>>> check can fix this issue, but I know for a fact that older versions
>>> (prior
>>> to at least 4.1) can not fix it.
>>
>>
>> 4.1 for creation and btrfs check.
>
>
> I assume that you have run older kernel on it, like v4.1 or v4.2.
>
> In those old kernels, it lacks the check to avoid such extent allocation
> check.
>
>>
>>> As far as what the kernel is involved with, the easy way to check is if
>>> it's
>>> operating on a mounted filesystem or not.  If it only operates on mounted
>>> filesystems, it almost certainly goes through the kernel, if it only
>>> operates on unmounted filesystems, it's almost certainly done in
>>> userspace
>>> (except dev scan and technically fi show).
>>
>>
>> Then btrfs check is a userspace-only matter, as it wants the fs
>> unmounted, and it is irrelevant that I did btrfs check from a rescue
>> system with an older kernel, 3.16 if I recall correctly.
>
>
> Not recommended to use older kernel to RW mount or use older fsck to do
> repair.
> As it's possible that older kernel/btrfsck may allocate extent that cross
> the 64K boundary.
>
>>
>>> 2. Regarding general support:  If you're using an enterprise distribution
>>> (RHEL, SLES, CentOS, OEL, or something similar), you are almost certainly
>>> going to get better support from your vendor than from the mailing list
>>> or
>>> IRC.
>>
>>
>> My "productive" desktops (fan is one of them) run Debian unstable with
>> a current vanilla kernel. At the moment, I can't use 4.5 because it
>> acts up with KVM.  When I need a rescue system, I use grml, which
>> unfortunately hasn't released since November 2014 and is still with
>> kernel 3.16
>
>
> To fix your problem(make these error message just disappear, even they are
> harmless on recent kernels), the most easy one, is to balance your metadata.

I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
filesystem mentioned in here:
http://www.spinics.net/lists/linux-btrfs/msg51405.html

but still   bad metadata [ ),  crossing stripe boundary   messages,
double amount compared to 2 months ago

Kernel operating this fs has always been maximum 1 month behind
'Latest Stable Kernel' (kernel.org terminology)

> As I explained, the bug only lies in metadata, and balance will allocate new
> tree blocks, then copy old data into new locations.
>
> In the allocation process of recent kernel, it will avoid such cross
> boundary, and to fix your problem.
>
> But if you are using old kernels, don't scrub your metadata.
>
> Thanks,
> Qu
>>
>>
>> Greetings
>> Marc
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-27 Thread Henk Slager
On Sun, Mar 27, 2016 at 4:59 PM, John Marrett  wrote:
>>> If you do want to use a newer one, I'd build against kernel.org, just
>>> because the developers only use that base. And use 4.4.6 or 4.5.
>>
>> At this point I could remove the overlays and recover the filesystem
>> permanently, however I'm also deeply indebted to the btrfs community
>> and want to give anything I can back. I've built (but not installed ;)
>> ) a straight kernel.org 4.5 w/my missing device check patch applied.
>> Is there any interest or value in attempting to switch to this kernel,
>> add/delete a device and see if I experience the same errors as before
>> I tried replace? What information should I gather if I do this?
>
> I've built and installed a 4.5 straight from kernel.org with my patch.
>
> I encountered the same errors in recovery when I use add/delete
> instead of using replace, here's the sequence of commands:
>
> ubuntu@btrfs-recovery:~$ sudo mount -o degraded,ro /dev/sda /mnt
> ubuntu@btrfs-recovery:~$ sudo mount -o remount,rw /mnt
> # Remove first empty device
> ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt
> # Add blank drive
> ubuntu@btrfs-recovery:~$ sudo btrfs device add /dev/sde /mnt
> # Remove second missing device with data
> ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt
>
> And the resulting error:
>
> ubuntu@btrfs-recovery:~$ sudo btrfs device delete missing /mnt
> ERROR: error removing the device 'missing' - Input/output error
>
> Here's what we see in dmesg after deleting the missing device:
>
> [  588.231341] BTRFS info (device sdd): relocating block group
> 10560347308032 flags 17
> [  664.306122] BTRFS warning (device sdd): csum failed ino 257 off
> 695730176 csum 2566472073 expected csum 2706136415
> [  664.306164] BTRFS warning (device sdd): csum failed ino 257 off
> 695734272 csum 2566472073 expected csum 2558511802
> [  664.306182] BTRFS warning (device sdd): csum failed ino 257 off
> 695746560 csum 2566472073 expected csum 3360772439
> [  664.306191] BTRFS warning (device sdd): csum failed ino 257 off
> 695750656 csum 2566472073 expected csum 1205516886
> [  664.344179] BTRFS warning (device sdd): csum failed ino 257 off
> 695730176 csum 2566472073 expected csum 2706136415
> [  664.344213] BTRFS warning (device sdd): csum failed ino 257 off
> 695734272 csum 2566472073 expected csum 2558511802
> [  664.344224] BTRFS warning (device sdd): csum failed ino 257 off
> 695746560 csum 2566472073 expected csum 3360772439
> [  664.344233] BTRFS warning (device sdd): csum failed ino 257 off
> 695750656 csum 2566472073 expected csum 1205516886
> [  664.344684] BTRFS warning (device sdd): csum failed ino 257 off
> 695730176 csum 2566472073 expected csum 2706136415
> [  664.344693] BTRFS warning (device sdd): csum failed ino 257 off
> 695734272 csum 2566472073 expected csum 2558511802
>
> Is there anything of value I can do here to help address this possible
> issue in btrfs itself, or should I remove the overlays, replace the
> device and move on?
>
> Please let me know,

I think it is great that with your local patch you managed to get into
a writable situation.
In theory, with for example already a new spare disk already attached
and standby (hot spare patchset and more etc), a direct replace of the
failing disk, so internally or manually with btrfs-replace would have
prevented the few csum and other small errors. It could be that the
errors have another cause than due to the complete failing harddisk
initially, but that won't be easy to trackdown black and white. Also
the ddrescue action and local patch make tracking back difficult and
it was also based on outdated kernel+tools.

I think it is best that you just repeat the fixing again on the real
disks and just make sure you have an uptodate/latest kernel+tools when
fixing the few damaged files.
With   btrfs inspect-internal inode-resolve 257 
you can see what file(s) are damaged.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 refuses to balance large drive

2016-03-25 Thread Henk Slager
On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
 wrote:
> On 23 March 2016 at 20:33, Chris Murphy  wrote:
>>
>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton  wrote:
>> >
>> > I am surprised to hear it said that having the mixed sizes is an odd
>> > case.
>>
>> Not odd as in wrong, just uncommon compared to other arrangements being 
>> tested.
>
> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
> where you replace an old smaller drive with the latest and largest
> when you need more storage.
>
> My raid1 currently consists of 6TB+3TB+3*2TB.

For the original OP situation, with chunks all filled op with extents
and devices all filled up with chunks, 'integrating' a new 6TB drive
in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
way in order to avoid immediate balancing needs:
- 'plug-in' the 6TB
- btrfs-replace  4TB by 6TB
- btrfs fi resize max 6TB_devID
- btrfs-replace  2TB by 4TB
- btrfs fi resize max 4TB_devID
- 'unplug' the 2TB

So then there would be 2 devices with roughly 2TB space available, so
good for continued btrfs raid1 writes.

An offline variant with dd instead of btrfs-replace could also be done
(I used to do that sometimes when btrfs-replace was not implemented).
My experience is that btrfs-replace speed is roughly at max speed (so
harddisk magnetic media transferspeed) during the whole replace
process and it does in a more direct way what you actually want. So in
total mostly way faster device replace/upgrade than with the
add+delete method. And raid1 redundancy is active all the time. Of
course it means first make sure the system runs up-to-date/latest
kernel+tools.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >