Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Dmitrii Tcvetkov
> Scott E. Blomquist writes:
>  > Hi All,
>  > 
>  > Early this morning there was a power glitch that affected our
>  > system.
>  > 
>  > The second enclosure went offline but the file system stayed up
>  > for a bit before rebooting and recovering the 2 missing arrays
>  > sdb1 and sdc1.
>  > 
>  > When mounting we get
>  > 
>  > Aug 12 14:52:43 localhost kernel: [ 8536.649270] BTRFS info
>  > (device sda1): has skinny extents Aug 12 14:54:52 localhost
>  > kernel: [ 8665.900321] BTRFS error (device sda1): parent transid
>  > verify failed on 177443463479296 wanted 2159304 found 2159295 Aug
>  > 12 14:54:52 localhost kernel: [ 8665.985512] BTRFS error (device
>  > sda1): parent transid verify failed on 177443463479296 wanted
>  > 2159304 found 2159295 Aug 12 14:54:52 localhost kernel:
>  > [ 8666.056845] BTRFS error (device sda1): failed to read block
>  > groups: -5 Aug 12 14:54:52 localhost kernel: [ 8666.254178] BTRFS
>  > error (device sda1): open_ctree failed
>  > 
>  > We are here...
>  > 
>  > # uname -a
>  > Linux localhost 4.17.14-custom #1 SMP Sun Aug 12 11:54:00 EDT
>  > 2018 x86_64 x86_64 x86_64 GNU/Linux
>  > 
>  > # btrfs --version
>  > btrfs-progs v4.17.1
>  > 
>  > # btrfs filesystem show
>  > Label: none  uuid: 8337c837-58cb-430a-a929-7f6d2f50bdbb
>  > Total devices 3 FS bytes used 75.05TiB
>  > devid1 size 47.30TiB used 42.07TiB path /dev/sda1
>  > devid2 size 21.83TiB used 16.61TiB path /dev/sdb1
>  > devid3 size 21.83TiB used 16.61TiB path /dev/sdc1
>  > 
>  > Thanks for any help.
>  > 
>  > sb. Scott Blomquist  
> Hi All,
> 
> Is there any more info needed here?
> 
> I can restore from backup if needed but that will take a bit of time.
> 
> Checking around it looks like I could try...
> 
> btrfs-zero-log /dev/sda1
> 
> Or maybe ..
> 
>btrfsck --repair /dev/sda1
> 
> I am just not sure here and would prefer to do the right thing.
> 
> Any help would be much appreciated.
> 
> Thanks,
> 
> sb. Scott Blomquist
> 
> 

I'm not a dev, just user.
btrfs-zero-log is for very specific case[1], not for transid errors.
Transid errors mean that some metadata writes are missing, if
they prevent you from mounting filesystem it's pretty much fatal. If
btrfs could recover metadata from good copy it'd have done that.

"wanted 2159304 found 2159295" means that some metadata is stale by 
9 commits. You could try to mount it with "ro,usebackuproot" mount
options as readonly mount is less strict. If that works you can try
"usebackuproot" without ro option. But 9 commits is probably too much
and there isn't enough data to rollback so far.

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log


Re: 4.17-rc1 FS went read-only during balance

2018-04-23 Thread Dmitrii Tcvetkov
> > TL;DR It seems as regression in 4.17, but I managed to find a
> > workaround to make filesystem rw mountable again.
> >
> > Kernel built from tag v4.17-rc1
> > btrfs-progs 4.16
> >
> > Tonight two my machines (PC (ECC RAM) and laptop(non-ECC RAM)) were
> > doing usual weekly balance with this command via cron:
> > btrfs balance start -musage=50 -dusage=50 
> > Both machines run same kernel version. 
> >
> > On PC that caused root and "data" filesystems to go readonly. Root
> > is on an SSD with data single and metadata DUP, "data" filesystem
> > is on 2 HDDs with RAID1 for data and metadata.
> >
> > On laptop only /home went ro, it's on NVMe SSD with data single and
> > metadata DUP. 
> >
> > Btrfs check of PC rootfs was without any errors in both modes, I did
> > them once each before reboot on readonly filesystem with --force
> > flag and then from live usb. Same output without any errors.
> >
> > After reboot kernel refused rw mount rootfs with the same error as
> > during cron balance, ro mount was accepted, error during rw mount:
> > BTRFS: error (device dm-17) in merge_reloc_roots:2465: errno=-117  
> >>> 
>  117 means EUCLEAN, which could be caused by the newly introduced
>  first_key and level check.
> >>> 
>  Please apply this hotfix to fix it.
>  btrfs: Only check first key for committed tree blocks
>  (Which is included in latest pull request)
> >>> 
>  Also, please consider enable CONFIG_BTRFS_DEBUG to provide extra
>  debug info.
> >>> 
>  Thanks,
>  Qu
> >>>
> >>> I tried 4.17-rc2 (as the pull request was pulled) with
> >>> CONFIG_BTRFS_DEBUG on LVM snapshot of laptop home partition (/dev/vdb)
> >>> in a VM (VM kernel sees only snapshot so no UUID collisions). Dmesg
> >>> attached.
> >>
> >> Thanks for the info and your previous btrfs-image.
> >>
> >> The image itself shows nothing wrong, so it should be runtime problem.
> >> Would you please apply these two debug patches?
> >> https://patchwork.kernel.org/patch/10335133/
> >> https://patchwork.kernel.org/patch/10335135/
> >>
> >> And the attached diff file?
> >>
> >> My guess is the parent node is not initialized correctly in this case.
> >>
> >> Thanks,
> >> Qu  
> > 
> > Dmesg from kernel with all three patches applied attached.
> >   
> Thanks for the debug info, it really helps a lot!
> 
> It turns out that I'm just a super idiot, a typo in replace_path()
> caused this, and it could not be trigger unless we enter it from
> relocation recovery.
> 
> Please try the attached patch to see if it solves the problem.
> 
> Thanks,
> Qu
Glad to help, the patch solved the problem, 
rw mount is successful and balance finished, no errors or debug output,
btrfs check is clean in both modes.

[2.842718] BTRFS: device label home devid 1 transid 277952 /dev/vdb
[2.924965] BTRFS: device label root devid 1 transid 84092 /dev/vda2
[3.072271] BTRFS info (device vda2): use lzo compression, level 0
[3.072897] BTRFS info (device vda2): enabling auto defrag
[3.073476] BTRFS info (device vda2): using free space tree
[3.074049] BTRFS info (device vda2): has skinny extents
[5.411821] BTRFS info (device vda2): using free space tree
[   24.925293] BTRFS info (device vdb): using free space tree
[   24.925324] BTRFS info (device vdb): has skinny extents
[   31.711868] BTRFS info (device vdb): continuing balance
[   31.721658] BTRFS info (device vdb): checking UUID tree
[   31.822920] BTRFS info (device vdb): relocating block group 69889687552flags 
data 
[   33.730399] BTRFS info (device vdb): found 12 extents
[   36.950699] BTRFS info (device vdb): found 12 extents
[   37.030813] BTRFS info (device vdb): relocating block group 67742203904flags 
metadata|dup 
[   37.104174] BTRFS info (device vdb): relocating block group 67708649472 
flags system|dup 
[   37.189843] BTRFS info (device vdb): found 1 extents



pgppgUIF6oj1v.pgp
Description: OpenPGP digital signature


Re: 4.17-rc1 FS went read-only during balance

2018-04-22 Thread Dmitrii Tcvetkov
> I saved /home filesystem from laptop in unmountable 
> by 4.17-rc1 state and can test patches and/or create 
> btrfs-image if it's needed.

Here is link to the image (103 MB): 
https://demfloro.ru/static/home-btrfs.image
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


4.17-rc1 FS went read-only during balance

2018-04-21 Thread Dmitrii Tcvetkov
TL;DR It seems as regression in 4.17, but I managed to find a
workaround to make filesystem rw mountable again.

Kernel built from tag v4.17-rc1
btrfs-progs 4.16

Tonight two my machines (PC (ECC RAM) and laptop(non-ECC RAM)) were
doing usual weekly balance with this command via cron:
btrfs balance start -musage=50 -dusage=50 
Both machines run same kernel version. 

On PC that caused root and "data" filesystems to go readonly. Root is on
an SSD with data single and metadata DUP, "data" filesystem is on 2 HDDs
with RAID1 for data and metadata.

On laptop only /home went ro, it's on NVMe SSD with data single and
metadata DUP. 

Btrfs check of PC rootfs was without any errors in both modes, I did
them once each before reboot on readonly filesystem with --force flag
and then from live usb. Same output without any errors.

After reboot kernel refused rw mount rootfs with the same error as
during cron balance, ro mount was accepted, error during rw mount:
BTRFS: error (device dm-17) in merge_reloc_roots:2465: errno=-117
unknown BTRFS info (device dm-17): forced readonly BTRFS info (device
dm-17): delayed_refs has NO entry BTRFS error (device dm-17): cleaner
transaction attach returned -3

mount rw with skip_balance parameter didn't help to mount.

After that I mounted rw the rootfs with 4.16.2 kernel, mount was
successful and kernel finished balance. After that the filesystem is
mountable rw by 4.17-rc1 kernel without errors, btrfs check is clean
too.

Data filesystem behaves the same, rw mount on 4.17-rc1 kernel yields:
[ 2321.370113] BTRFS: error (device dm-17) in merge_reloc_roots:2465:
errno=-117 unknown [ 2321.370119] BTRFS warning (device dm-17): failed
to recover relocation: -30 [ 2321.370137] BTRFS info (device dm-17):
delayed_refs has NO entry [ 2321.370155] BTRFS error (device dm-17):
cleaner transaction attach returned -30 [ 2321.414219] BTRFS error
(device dm-17): open_ctree failed

Rw mount on 4.16.2 goes ok and after balance finishes the filesystem is
mountable by 4.17-rc1 again. I saved /home filesystem from laptop in
unmountable by 4.17-rc1 state and can test patches and/or create
btrfs-image if it's needed.
Apr 20 23:46:00 fire kernel: BTRFS: device label root devid 1 transid 350197 
/dev/dm-2
Apr 20 23:46:00 fire kernel: BTRFS info (device dm-2): enabling auto defrag
Apr 20 23:46:00 fire kernel: BTRFS info (device dm-2): use lzo compression, 
level 0
Apr 20 23:46:00 fire kernel: BTRFS info (device dm-2): using free space tree
Apr 20 23:46:00 fire kernel: BTRFS info (device dm-2): has skinny extents
Apr 20 23:46:00 fire kernel: BTRFS info (device dm-2): using free space tree
Apr 20 23:46:10 fire kernel: BTRFS: device label home devid 2 transid 358906 
/dev/dm-5
Apr 20 23:46:13 fire kernel: BTRFS: device label home devid 1 transid 358906 
/dev/dm-12
Apr 20 23:46:13 fire kernel: BTRFS info (device dm-12): use zstd compression, 
level 0
Apr 20 23:46:13 fire kernel: BTRFS info (device dm-12): enabling auto defrag
Apr 20 23:46:13 fire kernel: BTRFS info (device dm-12): using free space tree
Apr 20 23:46:13 fire kernel: BTRFS info (device dm-12): has skinny extents
Apr 20 23:52:32 fire kernel: BTRFS: device label storage devid 1 transid 357668 
/dev/dm-17
Apr 20 23:52:32 fire kernel: BTRFS: device label backup devid 1 transid 5383 
/dev/dm-18
Apr 20 23:52:41 fire kernel: BTRFS: device label storage devid 2 transid 357668 
/dev/dm-21
Apr 20 23:52:41 fire kernel: BTRFS: device label backup devid 2 transid 5383 
/dev/dm-22
Apr 20 23:52:42 fire kernel: BTRFS info (device dm-17): enabling auto defrag
Apr 20 23:52:42 fire kernel: BTRFS info (device dm-17): use zstd compression, 
level 0
Apr 20 23:52:42 fire kernel: BTRFS info (device dm-17): using free space tree
Apr 20 23:52:42 fire kernel: BTRFS info (device dm-17): has skinny extents
Apr 20 23:52:45 fire kernel: BTRFS info (device dm-18): enabling auto defrag
Apr 20 23:52:45 fire kernel: BTRFS info (device dm-18): use zstd compression, 
level 0
Apr 20 23:52:45 fire kernel: BTRFS info (device dm-18): using free space tree
Apr 20 23:52:45 fire kernel: BTRFS info (device dm-18): has skinny extents
Apr 21 01:30:00 fire kernel: BTRFS info (device dm-2): relocating block group 
27309113344 flags system|dup
Apr 21 01:30:00 fire kernel: BTRFS info (device dm-12): relocating block group 
62910365696 flags system|raid1
Apr 21 01:30:00 fire kernel: BTRFS info (device dm-17): relocating block group 
2140869230592 flags metadata|raid1
Apr 21 01:30:00 fire kernel: BTRFS info (device dm-2): relocating block group 
27040677888 flags metadata|dup
Apr 21 01:30:00 fire kernel: BTRFS info (device dm-12): relocating block group 
61836623872 flags data|raid1
Apr 21 01:30:01 fire kernel: BTRFS info (device dm-12): found 5 extents
Apr 21 01:30:01 fire kernel: BTRFS: error (device dm-2) in 
merge_reloc_roots:2465: errno=-117 unknown
Apr 21 01:30:01 fire kernel: BTRFS info (device dm-2): forced readonly
Apr 21 01:30:03 fire kernel: BTRFS info (device dm-12): found 5 extents
Apr 

Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2017-12-30 Thread Dmitrii Tcvetkov
On Sat, 30 Dec 2017 23:32:04 +0300
Timofey Titovets <nefelim...@gmail.com> wrote:

> Currently btrfs raid1/10 balancer balance requests to mirrors,
> based on pid % num of mirrors.
> 
> Make logic understood:
>  - if one of underline devices are non rotational
>  - Queue leght to underline devices
> 
> By default try use pid % num_mirrors guessing, but:
>  - If one of mirrors are non rotational, repick optimal to it
>  - If underline mirror have less queue leght then optimal,
>repick to that mirror
> 
> For avoid round-robin request balancing,
> lets round down queue leght:
>  - By 8 for rotational devs
>  - By 2 for all non rotational devs
> 
> Changes:
>   v1 -> v2:
> - Use helper part_in_flight() from genhd.c
>   to get queue lenght
> - Move guess code to guess_optimal()
> - Change balancer logic, try use pid % mirror by default
>   Make balancing on spinning rust if one of underline devices
>   are overloaded
>   v2 -> v3:
> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
> 
> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>

Reviewed-by: Dmitrii Tcvetkov <demfl...@demfloro.ru>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-30 Thread Dmitrii Tcvetkov
On Sat, 30 Dec 2017 03:15:20 +0300
Timofey Titovets <nefelim...@gmail.com> wrote:

> 2017-12-29 22:14 GMT+03:00 Dmitrii Tcvetkov <demfl...@demfloro.ru>:
> > On Fri, 29 Dec 2017 21:44:19 +0300
> > Dmitrii Tcvetkov <demfl...@demfloro.ru> wrote:  
> >> > +/**
> >> > + * guess_optimal - return guessed optimal mirror
> >> > + *
> >> > + * Optimal expected to be pid % num_stripes
> >> > + *
> >> > + * That's generaly ok for spread load
> >> > + * Add some balancer based on queue leght to device
> >> > + *
> >> > + * Basic ideas:
> >> > + *  - Sequential read generate low amount of request
> >> > + *so if load of drives are equal, use pid % num_stripes
> >> > balancing
> >> > + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as
> >> > optimal
> >> > + *and repick if other dev have "significant" less queue
> >> > lenght
> >> > + *  - Repick optimal if queue leght of other mirror are less
> >> > + */
> >> > +static int guess_optimal(struct map_lookup *map, int optimal)
> >> > +{
> >> > +   int i;
> >> > +   int round_down = 8;
> >> > +   int num = map->num_stripes;  
> >>
> >> num has to be initialized from map->sub_stripes if we're reading
> >> RAID10, otherwise there will be NULL pointer dereference
> >>  
> >
> > Check can be like:
> > if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> > num = map->sub_stripes;
> >  
> >>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct
> >>btrfs_fs_info *fs_info,
> >>   stripe_index += mirror_num - 1;
> >>   else {
> >>   int old_stripe_index = stripe_index;
> >>+  optimal = guess_optimal(map,
> >>+  current->pid %
> >>map->num_stripes);
> >>   stripe_index = find_live_mirror(fs_info, map,
> >> stripe_index,
> >> map->sub_stripes,
> >> stripe_index +
> >>-current->pid %
> >>map->sub_stripes,
> >>+optimal,
> >> dev_replace_is_ongoing);
> >>   mirror_num = stripe_index - old_stripe_index
> >> + 1; }
> >>--
> >>2.15.1  
> >
> > Also here calculation should be with map->sub_stripes too.
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-btrfs" in the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> 
> Why you think we need such check?
> I.e. guess_optimal always called for find_live_mirror()
> Both in same context, like that:
> 
> if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
>   u32 factor = map->num_stripes / map->sub_stripes;
> 
>   stripe_nr = div_u64_rem(stripe_nr, factor, _index);
>   stripe_index *= map->sub_stripes;
> 
>   if (need_full_stripe(op))
> num_stripes = map->sub_stripes;
>   else if (mirror_num)
> stripe_index += mirror_num - 1;
>   else {
> int old_stripe_index = stripe_index;
> stripe_index = find_live_mirror(fs_info, map,
>   stripe_index,
>   map->sub_stripes, stripe_index +
>   current->pid % map->sub_stripes,
>   dev_replace_is_ongoing);
> mirror_num = stripe_index - old_stripe_index + 1;
> }
> 
> That useless to check that internally

My bad, so only need to call 
guess_optimal(map, current->pid % map->sub_stripes)
in RAID10 branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-29 Thread Dmitrii Tcvetkov
On Fri, 29 Dec 2017 21:44:19 +0300
Dmitrii Tcvetkov <demfl...@demfloro.ru> wrote:
> > +/**
> > + * guess_optimal - return guessed optimal mirror
> > + *
> > + * Optimal expected to be pid % num_stripes
> > + *
> > + * That's generaly ok for spread load
> > + * Add some balancer based on queue leght to device
> > + *
> > + * Basic ideas:
> > + *  - Sequential read generate low amount of request
> > + *so if load of drives are equal, use pid % num_stripes
> > balancing
> > + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as
> > optimal
> > + *and repick if other dev have "significant" less queue lenght
> > + *  - Repick optimal if queue leght of other mirror are less
> > + */
> > +static int guess_optimal(struct map_lookup *map, int optimal)
> > +{
> > +   int i;
> > +   int round_down = 8;
> > +   int num = map->num_stripes;  
> 
> num has to be initialized from map->sub_stripes if we're reading
> RAID10, otherwise there will be NULL pointer dereference
> 

Check can be like:
if (map->type & BTRFS_BLOCK_GROUP_RAID10)
num = map->sub_stripes;

>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct
>btrfs_fs_info *fs_info,
>   stripe_index += mirror_num - 1;
>   else {
>   int old_stripe_index = stripe_index;
>+  optimal = guess_optimal(map,
>+  current->pid %
>map->num_stripes);
>   stripe_index = find_live_mirror(fs_info, map,
> stripe_index,
> map->sub_stripes,
> stripe_index +
>-current->pid %
>map->sub_stripes,
>+optimal,
> dev_replace_is_ongoing);
>   mirror_num = stripe_index - old_stripe_index
> + 1; }
>-- 
>2.15.1

Also here calculation should be with map->sub_stripes too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic

2017-12-29 Thread Dmitrii Tcvetkov
On Fri, 29 Dec 2017 05:09:14 +0300
Timofey Titovets  wrote:

> Currently btrfs raid1/10 balancer balance requests to mirrors,
> based on pid % num of mirrors.
> 
> Make logic understood:
>  - if one of underline devices are non rotational
>  - Queue leght to underline devices
> 
> By default try use pid % num_mirrors guessing, but:
>  - If one of mirrors are non rotational, repick optimal to it
>  - If underline mirror have less queue leght then optimal,
>repick to that mirror
> 
> For avoid round-robin request balancing,
> lets round down queue leght:
>  - By 8 for rotational devs
>  - By 2 for all non rotational devs
> 
> Changes:
>   v1 -> v2:
> - Use helper part_in_flight() from genhd.c
>   to get queue lenght
> - Move guess code to guess_optimal()
> - Change balancer logic, try use pid % mirror by default
>   Make balancing on spinning rust if one of underline devices
>   are overloaded
> 
> Signed-off-by: Timofey Titovets 
> ---
>  block/genhd.c  |   1 +
>  fs/btrfs/volumes.c | 116
> - 2 files
> changed, 115 insertions(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 96a66f671720..a77426a7 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct
> hd_struct *part, atomic_read(>in_flight[1]);
>   }
>  }
> +EXPORT_SYMBOL_GPL(part_in_flight);
>  
>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>  {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9a04245003ab..1c84534df9a5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -5216,6 +5217,112 @@ int btrfs_is_parity_mirror(struct
> btrfs_fs_info *fs_info, u64 logical, u64 len) return ret;
>  }
>  
> +/**
> + * bdev_get_queue_len - return rounded down in flight queue lenght
> of bdev
> + *
> + * @bdev: target bdev
> + * @round_down: round factor big for hdd and small for ssd, like 8
> and 2
> + */
> +static int bdev_get_queue_len(struct block_device *bdev, int
> round_down) +{
> + int sum;
> + struct hd_struct *bd_part = bdev->bd_part;
> + struct request_queue *rq = bdev_get_queue(bdev);
> + uint32_t inflight[2] = {0, 0};
> +
> + part_in_flight(rq, bd_part, inflight);
> +
> + sum = max_t(uint32_t, inflight[0], inflight[1]);
> +
> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by some value
> +  */
> + return ALIGN_DOWN(sum, round_down);
> +}
> +
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue leght to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *and repick if other dev have "significant" less queue lenght
> + *  - Repick optimal if queue leght of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int optimal)
> +{
> + int i;
> + int round_down = 8;
> + int num = map->num_stripes;

num has to be initialized from map->sub_stripes if we're reading RAID10,
otherwise there will be NULL pointer dereference

> + int qlen[num];
> + bool is_nonrot[num];
> + bool all_bdev_nonrot = true;
> + bool all_bdev_rotate = true;
> + struct block_device *bdev;
> +
> + if (num == 1)
> + return optimal;
> +
> + /* Check accessible bdevs */
> + for (i = 0; i < num; i++) {
> + /* Init for missing bdevs */
> + is_nonrot[i] = false;
> + qlen[i] = INT_MAX;
> + bdev = map->stripes[i].dev->bdev;
> + if (bdev) {
> + qlen[i] = 0;
> + is_nonrot[i] =
> blk_queue_nonrot(bdev_get_queue(bdev));
> + if (is_nonrot[i])
> + all_bdev_rotate = false;
> + else
> + all_bdev_nonrot = false;
> + }
> + }
> +
> + /*
> +  * Don't bother with computation
> +  * if only one of two bdevs are accessible
> +  */
> + if (num == 2 && qlen[0] != qlen[1]) {
> + if (qlen[0] < qlen[1])
> + return 0;
> + else
> + return 1;
> + }
> +
> + if (all_bdev_nonrot)
> + round_down = 2;
> +
> + for (i = 0; i < num; i++) {
> + if (qlen[i])
> + continue;
> + bdev = map->stripes[i].dev->bdev;
> + qlen[i] = bdev_get_queue_len(bdev, round_down);
> + }
> +

Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-28 Thread Dmitrii Tcvetkov
On Thu, 28 Dec 2017 01:39:31 +0300
Timofey Titovets  wrote:

> Currently btrfs raid1/10 balancer blance requests to mirrors,
> based on pid % num of mirrors.
> 
> Update logic and make it understood if underline device are non rotational.
> 
> If one of mirrors are non rotational, then all read requests will be moved to
> non rotational device.
> 
> If both of mirrors are non rotational, calculate sum of
> pending and in flight request for queue on that bdev and use
> device with least queue leght.
> 
> P.S.
> Inspired by md-raid1 read balancing
> 
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/volumes.c | 59
> ++ 1 file changed, 59
> insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9a04245003ab..98bc2433a920 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info
> *fs_info, u64 logical, u64 len) return ret;
>  }
>  
> +static inline int bdev_get_queue_len(struct block_device *bdev)
> +{
> + int sum = 0;
> + struct request_queue *rq = bdev_get_queue(bdev);
> +
> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
> +

This won't work as expected if bdev is controlled by blk-mq, these
counters will be zero. AFAIK to get this info in block layer agnostic way
part_in_flight[1] has to be used. It extracts these counters approriately.

But it needs to be EXPORT_SYMBOL()'ed in block/genhd.c so we can continue
to build btrfs as module.

> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by 2
> +  */
> + return ALIGN(sum, 2);
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first, int num,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
>   int tolerance;
> + struct block_device *bdev;
>   struct btrfs_device *srcdev;
> + bool all_bdev_nonrot = true;
>  
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info
> *fs_info, else
>   srcdev = NULL;
>  
> + /*
> +  * Optimal expected to be pid % num
> +  * That's generaly ok for spinning rust drives
> +  * But if one of mirror are non rotating,
> +  * that bdev can show better performance
> +  *
> +  * if one of disks are non rotating:
> +  *  - set optimal to non rotating device
> +  * if both disk are non rotating
> +  *  - set optimal to bdev with least queue
> +  * If both disks are spinning rust:
> +  *  - leave old pid % nu,
> +  */
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
> + optimal = i;
> + else
> + all_bdev_nonrot = false;
> + }
> +
> + if (all_bdev_nonrot) {
> + int qlen;
> + /* Forse following logic choise by init with some big number
> */
> + int optimal_dev_rq_count = 1 << 24;

Probably better to use INT_MAX macro instead.

[1] https://elixir.free-electrons.com/linux/v4.15-rc5/source/block/genhd.c#L68

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARN_ON in __writeback_inodes_sb_nr when btrfs mounted with flushoncommit

2017-12-14 Thread Dmitrii Tcvetkov
On Thu, 14 Dec 2017 15:21:52 +0200
Nikolay Borisov <nbori...@suse.com> wrote:

> On 14.12.2017 13:02, Dmitrii Tcvetkov wrote:
> > Since 4.15-rc1 if btrfs filesystem is mounted with flushoncommit mount
> > option then during fsync this trace appears in dmesg:
> > 
> > [   17.323092] WARNING: CPU: 0 PID: 364 at fs/fs-writeback.c:2339
> > __writeback_inodes_sb_nr+0xbf/0xd0 [   17.323925] Modules linked in:
> > [   17.324697] CPU: 0 PID: 364 Comm: systemd-journal Not tainted 4.15.0-rc3
> > #2 [   17.325424] Hardware name: To be filled by O.E.M. To be filled by
> > O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016 [   17.326177] RIP:
> > 0010:__writeback_inodes_sb_nr+0xbf/0xd0 [   17.326875] RSP:
> > 0018:8bcd40a77d08 EFLAGS: 00010246 [   17.327598] RAX: 
> > RBX: 8a3fa9764488 RCX:  [   17.328321] RDX:
> > 0002 RSI: 18ae RDI: 8a3fa96c7070 [   17.329012]
> > RBP: 8bcd40a77d0c R08: ff80 R09: 00ff
> > [   17.329740] R10: 8bcd40a77c10 R11: 1000 R12:
> >  [   17.330439] R13: 8a3fa915e698 R14: 8a3fb04ed780
> > R15: 8a3fa9a16610 [   17.331169] FS:  7f72d53338c0()
> > GS:8a3fbec0() knlGS: [   17.331880] CS:  0010
> > DS:  ES:  CR0: 80050033 [   17.332624] CR2:
> > 7f72d09a5000 CR3: 000329334000 CR4: 000406f0 [   17.83]
> > Call Trace: [   17.334113]  btrfs_commit_transaction+0x857/0x920
> > [   17.334874]  btrfs_sync_file+0x30c/0x3e0 [   17.335622]
> > do_fsync+0x33/0x60 [   17.336332]  SyS_fsync+0x7/0x10
> > [   17.337069]  do_syscall_64+0x63/0x360
> > [   17.337776]  entry_SYSCALL64_slow_path+0x25/0x25
> > [   17.338513] RIP: 0033:0x7f72d4f29094
> > [   17.339244] RSP: 002b:7ffd71b078f8 EFLAGS: 0246 ORIG_RAX:
> > 004a [   17.339962] RAX: ffda RBX: 
> > RCX: 7f72d4f29094 [   17.340718] RDX: 0009 RSI:
> > 5630b6f8b090 RDI: 0010 [   17.341431] RBP: 5630b6f8b090
> > R08: 000f R09:  [   17.342169] R10:
> >  R11: 0246 R12: 0010 [   17.342902]
> > R13: 5630b6f88f60 R14: 0001 R15: 0001
> > [   17.343604] Code: df 0f b6 d1 e8 a3 fc ff ff 48 89 ee 48 89 df e8 78 f5
> > ff ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4 50 5b 5d c3
> > <0f> ff eb ca e8 38 1e ec ff 0f 1f 84 00 00 00 00 00 41 54 55 48
> > [   17.344408] ---[ end trace ff4cf41ec70ec0a7 ]---  
> 
> So this is due to writeback_inodes_sb being called without holding
> s_umount. So 4.15-rc1 the first kernel that started exhibiting this or
> did you also see it with earlier kernel
>

I didn't test kernels during merge window, the behavior was since 4.15-rc1 and
it's relevant until current mainline. Can't reproduce on 4.14 and earlier.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARN_ON in __writeback_inodes_sb_nr when btrfs mounted with flushoncommit

2017-12-14 Thread Dmitrii Tcvetkov
Since 4.15-rc1 if btrfs filesystem is mounted with flushoncommit mount option
then during fsync this trace appears in dmesg:

[   17.323092] WARNING: CPU: 0 PID: 364 at fs/fs-writeback.c:2339 
__writeback_inodes_sb_nr+0xbf/0xd0
[   17.323925] Modules linked in:
[   17.324697] CPU: 0 PID: 364 Comm: systemd-journal Not tainted 4.15.0-rc3 #2
[   17.325424] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
[   17.326177] RIP: 0010:__writeback_inodes_sb_nr+0xbf/0xd0
[   17.326875] RSP: 0018:8bcd40a77d08 EFLAGS: 00010246
[   17.327598] RAX:  RBX: 8a3fa9764488 RCX: 
[   17.328321] RDX: 0002 RSI: 18ae RDI: 8a3fa96c7070
[   17.329012] RBP: 8bcd40a77d0c R08: ff80 R09: 00ff
[   17.329740] R10: 8bcd40a77c10 R11: 1000 R12: 
[   17.330439] R13: 8a3fa915e698 R14: 8a3fb04ed780 R15: 8a3fa9a16610
[   17.331169] FS:  7f72d53338c0() GS:8a3fbec0() 
knlGS:
[   17.331880] CS:  0010 DS:  ES:  CR0: 80050033
[   17.332624] CR2: 7f72d09a5000 CR3: 000329334000 CR4: 000406f0
[   17.83] Call Trace:
[   17.334113]  btrfs_commit_transaction+0x857/0x920
[   17.334874]  btrfs_sync_file+0x30c/0x3e0
[   17.335622]  do_fsync+0x33/0x60
[   17.336332]  SyS_fsync+0x7/0x10
[   17.337069]  do_syscall_64+0x63/0x360
[   17.337776]  entry_SYSCALL64_slow_path+0x25/0x25
[   17.338513] RIP: 0033:0x7f72d4f29094
[   17.339244] RSP: 002b:7ffd71b078f8 EFLAGS: 0246 ORIG_RAX: 
004a
[   17.339962] RAX: ffda RBX:  RCX: 7f72d4f29094
[   17.340718] RDX: 0009 RSI: 5630b6f8b090 RDI: 0010
[   17.341431] RBP: 5630b6f8b090 R08: 000f R09: 
[   17.342169] R10:  R11: 0246 R12: 0010
[   17.342902] R13: 5630b6f88f60 R14: 0001 R15: 0001
[   17.343604] Code: df 0f b6 d1 e8 a3 fc ff ff 48 89 ee 48 89 df e8 78 f5 ff
ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4 50 5b 5d c3 <0f> ff
eb ca e8 38 1e ec ff 0f 1f 84 00 00 00 00 00 41 54 55 48
[   17.344408] ---[ end trace ff4cf41ec70ec0a7 ]---

If fs mounted without flushoncommit then no warnings. Other mount options don't 
influence the behaviour.

Steps to reproduce:
mkfs.btrfs 
mount -o flushoncommit  
echo test > /test 
btrfs filesystem sync 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAQ / encryption / error handling?

2017-11-27 Thread Dmitrii Tcvetkov
On Mon, 27 Nov 2017 09:06:12 +0100
Daniel Pocock  wrote:

> Hi all,
> 
> The FAQ has a couple of sections on encryption (general and dm-crypt)
> 
> One thing that isn't explained there: if you create multiple encrypted
> volumes (e.g. using dm-crypt) and use Btrfs to combine them into
> RAID1, how does error recovery work when a read operation returns
> corrupted data?
> 
> Without encryption, reading from one disk would give a checksum
> mismatch and Btrfs would read from the other disk to (hopefully) get
> a good copy of the data.
> 
> With this encryption scenario, the failure would potentially be
> detected in the decryption layer code and instead of returning bad
> data to Btrfs, it would return some error code. In that case, will
> Btrfs attempt to read from the other volume and allow the application
> to proceed as if nothing was wrong?
> 
> Regards,
> 
> Daniel

Default (aes-xts-plain64) dm-crypt setup can't verify integrity
of encrypted block and in case of silent corruption will decrypt it to
garbage which btrfs will catch. In case of AEAD encryption
(dm-crypt plus dm-integrity) it can verify integrity itself but I'm not
sure right now which exact error it returns to upper layer as I didn't
used it yet.

I use btrfs raid1 on top of LVM on top of dm-crypt devices and
it handled bad blocks on physical devices normally (there was a burst of
about 900 reallocates on one device which btrfs caught and fixed).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-16 Thread Dmitrii Tcvetkov
On Wed, 15 Nov 2017 20:23:44 + (UTC)
Duncan <1i5t5.dun...@cox.net> wrote:
> Tho from my understanding and last I read, btrfs restore (I believe
> it was) hadn't been updated to handle zstd yet, tho btrfs check and
> btrfs filesystem defrag had been, and of course btrfs balance if the
> kernel handles it since all balance does is call the kernel to do it.
> 
> So just confirming, does btrfs restore handle zstd from -progs 4.13?  
> 
> Because once a filesystem goes unmountable, restore is what I've had
> best luck with, so if it doesn't understand zstd, no zstd for me.  
> (Regardless, being the slightly cautious type I'll very likely wait a 
> couple kernel cycles before switching from lzo to zstd here, just to
> see if any weird reports about it from the first-out testers hit the
> list in the mean time.)
> 
> Meanwhile, it shouldn't need said but just in case, if you're using
> it, be sure you have backups /not/ using zstd, for at least a couple
> kernel cycles. =:^)
> 

Btrfs-progs 4.13 can be optionally built with libzstd, then btrfs
restore can restore from zstd compressed file systems (I tested that
right now with temporary fs using compress-force=zstd mount option).

Btrfs-progs 4.14 will require libzstd by default during build.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should cp default to reflink?

2017-11-07 Thread Dmitrii Tcvetkov
On Mon, 6 Nov 2017 15:37:21 -0700
Chris Murphy wrote:
> Seems to me any request to duplicate should be optimized by default
> with an auto reflink when possible, and require an explicit option to
> inhibit.

"cp --reflink=auto" by default might create unexpected behaviour of
slower balance on a filesystem with a lot of reflinks. Especially if
the filesystem already has many shapshots.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: yet another "out of space" on a filesystem with >100 GB free space, and strange files which exist but don't exist

2017-10-04 Thread Dmitrii Tcvetkov
> "Ghost file" is still there:
> 
> # ls -l 
> /var/lib/lxd/containers/mongo-repl04b/rootfs/var/lib/mongodb|grep set
> ls: cannot access 
> '/var/lib/lxd/containers/mongo-repl04b/rootfs/var/lib/mongodb/WiredTiger.turtle.set':
>  
> No such file or directory
> -? ? ?  ??? 
> WiredTiger.turtle.set

I had similiar issue couple of times, both after unclean shutdown
(power loss), only btrfs check --repair helped with the issue, but I'd
suggest to wait for someone's else input about that as I'm not a
developer.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov
> > Drive1  Drive2Drive3
> > X   X
> > X X
> > X X
> > 
> > Where X is a chunk of raid1 block group.  
> 
> But this table clearly shows that adding third drive increases free
> space by 50%. You need to reallocate data to actually make use of it,
> but it was done in this case.

It increases it but I don't see how this space is in any way useful
unless data is in single profile. After full balance chunks will be
spread over 3 devices, how it helps in raid1 data profile case?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov
> @Kai and Dmitrii
> thank you for your explanations if I understand you correctly, you're
> saying that btrfs makes no attempt to "optimally" use the physical
> devices it has in the FS, once a new RAID1 block group needs to be
> allocated it will semi-randomly pick two devices with enough space and
> allocate two equal sized chunks, one on each. This new chunk may or
> may not fall onto my newly added 8 TB drive. Am I understanding this
> correctly?
If I remember correctly chunk allocator allocates new chunks on device
which has the most unallocated space. 

> Is there some sort of balance filter that would speed up this sort of
> balancing? Will balance be smart enough to make the "right" decision?
> As far as I read the chunk allocator used during balance is the same
> that is used during normal operation. If the allocator is already
> sub-optimal during normal operations, what's the guarantee that it
> will make a "better" decision during balancing?

I don't really see any way that being possible in raid1 profile. How
can you fill all three devices if you can split data only twice? There
will be moment when two of three disks are full and BTRFS can't
allocate new raid1 block group because it has only one drive with
unallocated space.

> 
> When I say "right" and "better" I mean this:
> Drive1(8) Drive2(3) Drive3(3)
> X1X1
> X2X2
> X3X3
> X4X4
> I was convinced until now that the chunk allocator at least tries a
> best possible allocation. I'm sure it's complicated to develop a
> generic algorithm to fit all setups, but it should be possible.
 

Problem is that each raid1 block group contains two chunks on two
separate devices, it can't utilize fully three devices no matter what.
If that doesn't suit you then you need to add 4th disk. After
that FS will be able to use all unallocated space on all disks in raid1
profile. But even then you'll be able to safely lose only one disk
since BTRFS still will be storing only 2 copies of data.

This behavior is not relevant for single or raid0 profiles of
multidevice BTRFS filesystems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov
>Actually based on http://carfax.org.uk/btrfs-usage/index.html I
>would've expected 6 TB of usable space. Here I get 6.4 which is odd,
>but that only 1.5 TB is available is even stranger.
>
>Could anyone explain what I did wrong or why my expectations are wrong?
>
>Thank you in advance

I'd say df and the website calculate different things. In btrfs raid1 profile 
stores exactly 2 copies of data, each copy is on separate device. 
So by adding third drive, no matter how big, effective free space didn't expand 
because btrfs still needs space on any one of other two drives to store second 
half of each raid1 chunk stored on that third drive. 

Basically:

Drive1  Drive2Drive3
X   X
X   X
  X X

Where X is a chunk of raid1 block group.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAID 1 not mountable: open_ctree failed, super_num_devices 3 mismatch with num_devices 2 found here

2017-08-24 Thread Dmitrii Tcvetkov
>  I rebootet with HWE K4.11
> 
> and took a pic of the error message (see attachment).
> 
> It seems btrfs still sees the removed NVME. 
> There is a mismatch from super_num_devices (3) to num_devices (2)
> with indicates something strage is going on here, imho. 
> 
> Then i returned and booted K4.4, which boots fine.
> 
> root@vHost1:~# btrfs dev stat /
> [/dev/nvme0n1p1].write_io_errs   0
> [/dev/nvme0n1p1].read_io_errs0
> [/dev/nvme0n1p1].flush_io_errs   0
> [/dev/nvme0n1p1].corruption_errs 0
> [/dev/nvme0n1p1].generation_errs 0
> [/dev/sda1].write_io_errs   0
> [/dev/sda1].read_io_errs0
> [/dev/sda1].flush_io_errs   0
> [/dev/sda1].corruption_errs 0
> [/dev/sda1].generation_errs 0
> 
> Btw i edited the subject to match the correct error.
> 
> 
> Sash

Thats very odd, if super_num_devices in superblocks don't match real number of 
devices
then 4.4 kernel shouldn't mount the filesystem too.

We probably need help from one of btrfs developers since I'm not one, I'm just 
btrfs user.
Can you provide outpus of:
btrfs inspect-internal dump-super -f /dev/sda1
btrfs inspect-internal dump-super -f /dev/nvme0n1p1

Depending on version of btrfs-progs you may need to use btrfs-dump-super 
instead of "btrfs inspect-internal dump-super"

>3rd i saw https://patchwork.kernel.org/patch/9419189/ from Roman. Did
>he receive any comments on his patch? This one could help on this
>problem, too. 

Don't know about this patch from Roman per se, but there is a patchset[1] which 
is aimed to be merged in 4.14 merge window AFAIK.

[1] https://www.spinics.net/lists/linux-btrfs/msg66891.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: user snapshots

2017-08-23 Thread Dmitrii Tcvetkov
> >Also in https://btrfs.wiki.kernel.org/index.php/Mount_options
> >"user_subvol_rm_allowed (...) Use with caution."
> >
> >Why? What is the problem?  
> 
> Because with the mount option any user can delete any subvolume,
> including root one (subvol_id=5)

Apologies, it works somewhat different:
filesystem doesn't allow to delete subvolume with id 5 and POSIX access
is checked before deleting subvolume with user_subvol_rm_allowed mount
option.

From btrfs-progs cmds-subvolume.c:

res = ioctl(fd, BTRFS_IOC_SNAP_DESTROY, );
if(res < 0 ){
 error("cannot delete '%s/%s': %s", dname, vname,
 strerror(errno));
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: user snapshots

2017-08-23 Thread Dmitrii Tcvetkov


>Also in https://btrfs.wiki.kernel.org/index.php/Mount_options
>"user_subvol_rm_allowed (...) Use with caution."
>
>Why? What is the problem?

Because with the mount option any user can delete any subvolume, including root 
one (subvol_id=5)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2017-08-22 Thread Dmitrii Tcvetkov
On Tue, 22 Aug 2017 11:31:23 +0200
g6094...@freenet.de wrote:
> So 1st should be investigating why did the disk not get removed
> correctly? Btrfs dev del should remove the device corretly, right? Is
> there a bug?

It should and probably did. To check that we need to see output of 
btrfs filesystem show 
and output of 
btrfs filesystem usage 

If there are non-raid1 chunks then you need to do soft balance:
btrfs balance start -mconvert=raid1,soft -dconvert=raid1,soft 

The balance should finish very quickly as you probably have only one of
data and metadata single chunks. They appeared during writes when the
filesystem was mounted read-write in degraded mode.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] Chunk level degradable check

2017-07-10 Thread Dmitrii Tcvetkov
On Wed, 28 Jun 2017 13:43:29 +0800
Qu Wenruo  wrote:

> The patchset can be fetched from my github repo:
> https://github.com/adam900710/linux/tree/degradable
> 
> The patchset is based on David's for-4.13-part1 branch.
> 
> Btrfs currently uses num_tolerated_disk_barrier_failures to do global
> check for tolerated missing device.
> 
> Although the one-size-fit-all solution is quite safe, it's too strict
> if data and metadata has different duplication level.
> 
> For example, if one use Single data and RAID1 metadata for 2 disks, it
> means any missing device will make the fs unable to be degraded
> mounted.
> 
> But in fact, some times all single chunks may be in the existing
> device and in that case, we should allow it to be rw degraded mounted.
> 
> Such case can be easily reproduced using the following script:
>  # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
>  # wipefs -f /dev/sdc
>  # mount /dev/sdb -o degraded,rw
> 
> If using btrfs-debug-tree to check /dev/sdb, one should find that the
> data chunk is only in sdb, so in fact it should allow degraded mount.
> 
> This patchset will introduce a new per-chunk degradable check for
> btrfs, allow above case to succeed, and it's quite small anyway.
> 
> And enhance kernel error message for missing device, at least user
> can know what's making mount failed, other than meaningless
> "failed to read system chunk/chunk tree -5".
> 
> v2:
>   Update after almost 2 years.
>   Add the last patch to enhance the kernel output, so user can know
>   it's missing devices that prevents btrfs to be mounted.
> v3:
>   Remove one duplicated missing device output
>   Use the advice from Anand Jain, not to add new members in
> btrfs_device, but use a new structure extra_rw_degrade_errors, to
> record error when sending down/waiting device.
> v3.1:
>   Reduce the critical section in btrfs_check_rw_degradable(), follow
> other caller to only acquire the lock when searching, as extent_map
> has refcount to avoid concurrency already.
>   The modification itself won't affect the behavior, so tested-by
> tags are added to each patch.
> v4:
>   Thanks Anand for this dev flush work, which makes us more easier to
>   detect flush error in previous transaction.
>   Now this patchset won't need to alloc memory, and can just use
>   btrfs_device->last_flush_error to check if last flush finished
>   correctly.
>   New rebase, so old tested by tags are all removed, sorry guys.
> 
> Qu Wenruo (6):
>   btrfs: Introduce a function to check if all chunks a OK for degraded
> rw mount
>   btrfs: Do chunk level rw degrade check at mount time
>   btrfs: Do chunk level degradation check for remount
>   btrfs: Allow barrier_all_devices to do chunk level device check
>   btrfs: Cleanup num_tolerated_disk_barrier_failures
>   btrfs: Enhance missing device kernel message
> 
>  fs/btrfs/ctree.h   |  2 --
>  fs/btrfs/disk-io.c | 81 
>  fs/btrfs/disk-io.h |  2 --
>  fs/btrfs/super.c   |  3 +-
>  fs/btrfs/volumes.c | 99
> +-
> fs/btrfs/volumes.h |  3 ++ 6 files changed, 85 insertions(+), 105
> deletions(-)
> 

Tested on top of current mainline master (commit 
af3c8d98508d37541d4bf57f13a984a7f73a328c). Didn't find any regressions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3.1 1/7] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount

2017-05-01 Thread Dmitrii Tcvetkov
> >> +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
> >> +{
> >> +struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
> >> +struct extent_map *em;
> >> +u64 next_start = 0;
> >> +bool ret = true;
> >> +
> >> +read_lock(_tree->map_tree.lock);
> >> +em = lookup_extent_mapping(_tree->map_tree, 0, (u64)-1);
> >> +read_unlock(_tree->map_tree.lock);
> >> +/* No chunk at all? Return false anyway */
> >> +if (!em) {
> >> +ret = false;
> >> +goto out;
> >> +}
> >> +while (em) {
> >> +struct map_lookup *map;
> >> +int missing = 0;
> >> +int max_tolerated;
> >> +int i;
> >> +
> >> +map = (struct map_lookup *) em->bdev;  
> >
> >
> >any idea why not   map = em->map_lookup;  here?  
> 
> 
> My fault, will update the patch.
> 
> Thanks,
> Qu

Sorry to bother, but looks like this patchset suddenly got forgotten.
It still applies to 4.11 but I'm afraid it won't after 4.12 merge
window. Any update on it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: force btrfs to release underlying block device(s)

2017-04-02 Thread Dmitrii Tcvetkov
> Tho another part of that patchset, the per-chunk availability check
> for degraded filesystems that allows writable mount of multi-device 
> filesystems with single chunks, etc, as long as all chunks are
> available, has seen renewed activity recently as the problem it
> addresses, formerly two-device raid1 filesystems going read-only
> after one degraded-writable mount, has become an increasingly
> frequently list-reported problem.  That smaller patchset has I
> believe now been review and is I believe now in btrfs-next, scheduled
> for merge in 4.12.

Unfortunately I couldn't find it not in btrfs-next nor linux-next
branches. The last version 3.1 of the patch was published[1] 08.03.2017. 

Qu Wenruo was going[2] to update the patch but that didn't happen yet.

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg62255.html
[2] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg62302.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/7] Chunk level degradable check

2017-03-08 Thread Dmitrii Tcvetkov
On Wed, 8 Mar 2017 10:41:17 +0800
Qu Wenruo  wrote:
> This patchset will introduce a new per-chunk degradable check for
> btrfs, allow above case to succeed, and it's quite small anyway.

> v2:
>   Update after almost 2 years.
>   Add the last patch to enhance the kernel output, so user can know
>   it's missing devices prevent btrfs to mount.
> v3:
>   Remove one duplicated missing device output
>   Use the advice from Anand Jain, not to add new members in
> btrfs_device, but use a new structure extra_rw_degrade_errors, to
> record error when sending down/waiting device.

Tested raid1/raid10 cases for loosing 1 and more devices: behaviour of
the patchset in regard of allowing degraded mount is still correct.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Dmitrii Tcvetkov
On Mon, 6 Mar 2017 16:58:49 +0800
Qu Wenruo <quwen...@cn.fujitsu.com> wrote:

> Btrfs currently uses num_tolerated_disk_barrier_failures to do global
> check for tolerated missing device.
> 
> Although the one-size-fit-all solution is quite safe, it's too strict
> if data and metadata has different duplication level.
> 
> For example, if one use Single data and RAID1 metadata for 2 disks, it
> means any missing device will make the fs unable to be degraded
> mounted.
> 
> But in fact, some times all single chunks may be in the existing
> device and in that case, we should allow it to be rw degraded mounted.
> 
> Such case can be easily reproduced using the following script:
>  # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
>  # wipefs -f /dev/sdc
>  # mount /dev/sdb -o degraded,rw
> 
> If using btrfs-debug-tree to check /dev/sdb, one should find that the
> data chunk is only in sdb, so in fact it should allow degraded mount.
> 
> This patchset will introduce a new per-chunk degradable check for
> btrfs, allow above case to succeed, and it's quite small anyway.
> 
> And enhance kernel error message for missing device, at least kernel
> can know what's making mount failed, other than meaningless
> "failed to read system chunk/chunk tree -5".

Hello,

Tested the patchset for raid1 and raid10. Successfully allows
degraded mount with single chunks on the filesystems without one drive.

Feel free to add Tested-By: Dmitrii Tcvetkov <demfl...@demfloro.ru>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Couldn't delete a directory after power failure

2016-12-06 Thread Dmitrii Tcvetkov
Hello,

I had this problem today after power failure of the pc. Bitcoin wallet couldn't 
use his database, it said that the database is corrupted, 
so I decided to delete blockchain database from the wallet. But I couldn't 
delete directory "chainstate". 

The problem is solved at the moment I wrote the message to the list by btrfs 
check and I deleted corrupted directory so no need for support :). Just posting 
in case this trace and Sysrq+w may be useful for somebody.

07:28:40-user@host ~ $ btrfs version
btrfs-progs v4.8.5
07:29:00-user@host ~ $ uname -a
Linux host 4.9.0-rc8 #2 SMP PREEMPT Tue Dec 6 23:10:04 MSK 2016 x86_64 GNU/Linux
07:29:10 -user@host ~ $ sudo btrfs fi df storage
Data, single: total=494.00GiB, used=392.04GiB
System, DUP: total=32.00MiB, used=96.00KiB
Metadata, DUP: total=1.50GiB, used=532.12MiB
GlobalReserve, single: total=506.92MiB, used=0.00B
07:29:12-user@host ~ $ sudo btrfs fi sh storage
Label: 'storage'  uuid: f387eb37-f009-4723-9fda-2cc8f94c8b8d
Total devices 1 FS bytes used 392.55GiB
devid1 size 996.26GiB used 497.06GiB path /dev/mapper/container
07:29:12-user@host ~/storage/.bitcoin $ ls -l
drwx-- 1 user user   22250 Dec  7 07:20 chainstate
-rw--- 1 user user   0 Oct 20 23:11 db.log
-rw-r--r-- 1 user user 7513379 Dec  7 07:23 debug.log
-rw--- 1 user user   28534 Dec  6 20:02 fee_estimates.dat
-rw--- 1 user user 4372424 Dec  7 01:56 peers.dat
-rw--- 1 user user  139264 Dec  7 01:57 wallet.dat
07:29:13-user@host ~/storage/.bitcoin $ rm -rf chainstate/
Segmentation fault
07:29:19-user@host ~/storage/.bitcoin $ ls -l
total 4436
drwx-- 1 user user1892 Dec  7 07:29 chainstate
-rw--- 1 user user   0 Oct 20 23:11 db.log
-rw--- 1 user user   28534 Dec  6 20:02 fee_estimates.dat
-rw--- 1 user user 4372424 Dec  7 01:56 peers.dat
-rw--- 1 user user  139264 Dec  7 01:57 wallet.dat
07:29:24-user@host ~/storage/.bitcoin $ rm -rf chainstate/

After that rm hanged, subsequent ls of the chainstate directory also hanged.

dmesg with Sysrq+W included:

[  190.429798] BTRFS: device label storage devid 1 transid 486419 /dev/dm-6
[  190.459791] BTRFS info (device dm-6): enabling auto defrag
[  190.459796] BTRFS info (device dm-6): force lzo compression
[  190.459797] BTRFS info (device dm-6): using free space tree
[  190.459799] BTRFS info (device dm-6): has skinny extents
[  197.896560] BTRFS info (device dm-6): checking UUID tree
[  715.237873] BTRFS error (device dm-6): err add delayed dir index item(index: 
667) into the deletion tree of the delayed node(root id: 3106, inode id: 1613, 
errno: -17)
[  715.237885] [ cut here ]
[  715.239455] kernel BUG at fs/btrfs/delayed-inode.c:1555!
[  715.241014] invalid opcode:  [#1] PREEMPT SMP
[  715.242575] Modules linked in: radeon ttm
[  715.244143] CPU: 6 PID: 2257 Comm: rm Not tainted 4.9.0-rc8 #2
[  715.245750] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./SABERTOOTH 990FX R2.0, BIOS 2501 04/08/2014
[  715.247352] task: 9134f45c3200 task.stack: 9ebb0392c000
[  715.248931] RIP: 0010:[]  [] 
btrfs_delete_delayed_dir_index+0x219/0x220
[  715.250508] RSP: 0018:9ebb0392fd68  EFLAGS: 00010286
[  715.252114] RAX:  RBX: 91355e687b00 RCX: 
[  715.253706] RDX:  RSI: 91357ed8c7a8 RDI: 91357ed8c7a8
[  715.255301] RBP: 9134cecc8130 R08: 0003a131 R09: 0005
[  715.256904] R10: 0040 R11: b9f6a12d R12: 9134cecc8178
[  715.258544] R13: 029b R14: 913570bbe000 R15: 91356ae3f500
[  715.260149] FS:  7f102b9c6480() GS:91357ed8() 
knlGS:
[  715.261762] CS:  0010 DS:  ES:  CR0: 80050033
[  715.263358] CR2: 006ceff4 CR3: 000290ddd000 CR4: 000406e0
[  715.264892] Stack:
[  715.266422]  0004 4dff913555d705f0 6006 
029b
[  715.267981]  f32e21fa 913546f60a50 9ebb0392fe40 
913546ea44f0
[  715.269546]  00040ffd 064d 9134d45dc5b0 
b928143c
[  715.271134] Call Trace:
[  715.272683]  [] ? __btrfs_unlink_inode+0x1ac/0x4b0
[  715.274246]  [] ? btrfs_unlink_inode+0x12/0x40
[  715.275797]  [] ? btrfs_unlink+0x61/0xb0
[  715.277371]  [] ? vfs_unlink+0xb9/0x180
[  715.278903]  [] ? do_unlinkat+0x28d/0x310
[  715.280426]  [] ? entry_SYSCALL_64_fastpath+0x13/0x94
[  715.281950] Code: ff 0f 0b 48 8b 55 10 49 8b be f0 01 00 00 41 89 c1 4c 8b 
45 00 48 c7 c6 10 b8 95 b9 48 8b 8a 48 03 00 00 4c 89 ea e8 77 55 f7 ff <0f> 0b 
e8 d0 1e de ff 53 48 89 fb e8 c7 d8 ff ff 48 85 c0 74 32 
[  715.283627] RIP  [] 
btrfs_delete_delayed_dir_index+0x219/0x220
[  715.285213]  RSP 
[  715.293587] ---[ end trace fbbdb097ac89a28e ]---
[  808.445004] sysrq: SysRq : Show Blocked State
[  808.445008]   taskPC stack   pid father
[  808.445085] btrfs-transacti D0   936  2 

Re: How to cancel btrfs balance on unmounted filesystem

2016-03-31 Thread Dmitrii Tcvetkov
Hello.
There is no tool to disable balance on unmounted filesystem. But you can use 
mount option skip_balance for this.


 Original Message 
From: Marc Haber 
Sent: March 31, 2016 9:21:12 AM GMT+03:00
To: linux-btrfs@vger.kernel.org
Subject: How to cancel btrfs balance on unmounted filesystem

Hi,

one of my problem btrfs instances went into a hung process state
while blancing metadata. This process is recorded in the file system
somehow and the balance restarts immediately after mounting the
filesystem with no chance to issue a btrfs balance cancel command
before the system hangs again.

Is there any possiblity to cancel the pending balance without mounting
the fs first?

I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581
to adress this in a more elegant way.

Greetings
Marc


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html