Re: [PATCH] btrfs-progs: mkfs: fix xattr enumeration

2019-09-09 Thread Vladimir Panteleev
Hi Nikolay,

Unfortunately, as I mentioned in the issue, I have also been unable to
reproduce this issue locally.

Please see here:
https://github.com/kdave/btrfs-progs/issues/194

The reporter tested the patch and confirmed that it worked.
Additionally, they have provided strace output which appears to
confirm that the bug description in the patch commit message indeed
corresponds to the behavior they are observing on their machine.

Perhaps the issue might be reproducible in an environment closer to
the reporter's (looks like some Fedora distro judging by the uname).

On Mon, 9 Sep 2019 at 11:22, Nikolay Borisov  wrote:
>
>
>
> On 6.09.19 г. 12:58 ч., Vladimir Panteleev wrote:
> > Use the return value of listxattr instead of tokenizing.
> >
> > The end of the extended attribute list is indicated by the return
> > value, not an empty list item (two consecutive NULs). Using strtok
> > in this way thus sometimes caused add_xattr_item to reuse stack data
> > in xattr_list from the previous invocation, thus querying attributes
> > that are not actually in the file's xattr list.
> >
> > Issue: #194
> > Signed-off-by: Vladimir Panteleev 
>
> Can you elaborate how to trigger this? I tried by creating a folder with
> 2 files and set 5 xattr to the first file and 1 to the second. Then I
> run mkfs.btrfs -r /path/to/dir /dev/vdc and stepped through the code
> with gdb and didn't see any issues. Ideally I'd like to see a regression
> test for this issue.
>
> Your code looks correct.
>
> > ---
> >  mkfs/rootdir.c | 11 +--
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
> > index 51411e02..c86159e7 100644
> > --- a/mkfs/rootdir.c
> > +++ b/mkfs/rootdir.c
> > @@ -228,10 +228,9 @@ static int add_xattr_item(struct btrfs_trans_handle 
> > *trans,
> >   int ret;
> >   int cur_name_len;
> >   char xattr_list[XATTR_LIST_MAX];
> > + char *xattr_list_end;
> >   char *cur_name;
> >   char cur_value[XATTR_SIZE_MAX];
> > - char delimiter = '\0';
> > - char *next_location = xattr_list;
> >
> >   ret = llistxattr(file_name, xattr_list, XATTR_LIST_MAX);
> >   if (ret < 0) {
> > @@ -243,10 +242,10 @@ static int add_xattr_item(struct btrfs_trans_handle 
> > *trans,
> >   if (ret == 0)
> >   return ret;
> >
> > - cur_name = strtok(xattr_list, &delimiter);
> > - while (cur_name != NULL) {
> > + xattr_list_end = xattr_list + ret;
> > + cur_name = xattr_list;
> > + while (cur_name < xattr_list_end) {
> >   cur_name_len = strlen(cur_name);
> > - next_location += cur_name_len + 1;
> >
> >   ret = lgetxattr(file_name, cur_name, cur_value, 
> > XATTR_SIZE_MAX);
> >   if (ret < 0) {
> > @@ -266,7 +265,7 @@ static int add_xattr_item(struct btrfs_trans_handle 
> > *trans,
> >   file_name);
> >   }
> >
> > - cur_name = strtok(next_location, &delimiter);
> > + cur_name += cur_name_len + 1;
> >   }
> >
> >   return ret;
> >


[PATCH] btrfs-progs: mkfs: fix xattr enumeration

2019-09-06 Thread Vladimir Panteleev
Use the return value of listxattr instead of tokenizing.

The end of the extended attribute list is indicated by the return
value, not an empty list item (two consecutive NULs). Using strtok
in this way thus sometimes caused add_xattr_item to reuse stack data
in xattr_list from the previous invocation, thus querying attributes
that are not actually in the file's xattr list.

Issue: #194
Signed-off-by: Vladimir Panteleev 
---
 mkfs/rootdir.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
index 51411e02..c86159e7 100644
--- a/mkfs/rootdir.c
+++ b/mkfs/rootdir.c
@@ -228,10 +228,9 @@ static int add_xattr_item(struct btrfs_trans_handle *trans,
int ret;
int cur_name_len;
char xattr_list[XATTR_LIST_MAX];
+   char *xattr_list_end;
char *cur_name;
char cur_value[XATTR_SIZE_MAX];
-   char delimiter = '\0';
-   char *next_location = xattr_list;
 
ret = llistxattr(file_name, xattr_list, XATTR_LIST_MAX);
if (ret < 0) {
@@ -243,10 +242,10 @@ static int add_xattr_item(struct btrfs_trans_handle 
*trans,
if (ret == 0)
return ret;
 
-   cur_name = strtok(xattr_list, &delimiter);
-   while (cur_name != NULL) {
+   xattr_list_end = xattr_list + ret;
+   cur_name = xattr_list;
+   while (cur_name < xattr_list_end) {
cur_name_len = strlen(cur_name);
-   next_location += cur_name_len + 1;
 
ret = lgetxattr(file_name, cur_name, cur_value, XATTR_SIZE_MAX);
if (ret < 0) {
@@ -266,7 +265,7 @@ static int add_xattr_item(struct btrfs_trans_handle *trans,
file_name);
}
 
-   cur_name = strtok(next_location, &delimiter);
+   cur_name += cur_name_len + 1;
}
 
return ret;
-- 
2.23.0



[PATCH 1/1] btrfs-progs: docs: document btrfs-balance exit status in detail

2019-08-24 Thread Vladimir Panteleev
Signed-off-by: Vladimir Panteleev 
---
 Documentation/btrfs-balance.asciidoc | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-balance.asciidoc 
b/Documentation/btrfs-balance.asciidoc
index bfb76742..8afd76da 100644
--- a/Documentation/btrfs-balance.asciidoc
+++ b/Documentation/btrfs-balance.asciidoc
@@ -354,8 +354,16 @@ This should lead to decrease in the 'total' numbers in the 
*btrfs filesystem df*
 
 EXIT STATUS
 ---
-*btrfs balance* returns a zero exit status if it succeeds. Non zero is
-returned in case of failure.
+Unless indicated otherwise below, all *btrfs balance* subcommands
+return a zero exit status if they succeed, and non zero in case of
+failure.
+
+The *pause*, *cancel*, and *resume* subcommands exit with a status of
+*2* if they fail because a balance operation was not running.
+
+The *status* subcommand exits with a status of *0* if a balance
+operation is not running, *1* if the command-line usage is incorrect
+or a balance operation is still running, and *2* on other errors.
 
 AVAILABILITY
 
-- 
2.23.0



[PATCH 0/1] btrfs-progs: docs: document btrfs-balance exit status in detail

2019-08-24 Thread Vladimir Panteleev
The `balance status` exit code is a bit of a mess, and the opposite of
pause/cancel/resume. I assume it's too late to fix it, so documenting
it is the best we can do.

Vladimir Panteleev (1):
  btrfs-progs: docs: document btrfs-balance exit status in detail

 Documentation/btrfs-balance.asciidoc | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

-- 
2.23.0



Re: [PATCH 4/5] btrfs: do not account global reserve in can_overcommit

2019-08-21 Thread Vladimir Panteleev
Hi Josef,

On Fri, 16 Aug 2019 at 15:21, Josef Bacik  wrote:
> Fix the can_overcommit code to simply see if our current usage + what we
> want is less than our current free space plus whatever slack space we
> have in the disk is.  This solves the problem we were seeing in
> production and keeps us from flushing as aggressively as we approach our
> actual metadata size usage.

FYI - I'm not sure if it was the intended effect of this patch, but it
fixes the problem I was seeing where deleting a device or rebalancing
a filesystem with a lot of snapshots and extents would fail with
-ENOSPC. With this patch, the operation succeeds - although, the
operation seems to require more RAM than merely upping the global
reserve size (which also allowed the operation to succeed).

Hope this helps. Thanks!


Re: [PATCH 1/1] btrfs: Simplify parsing max_inline mount option

2019-08-21 Thread Vladimir Panteleev
On Wed, 21 Aug 2019 at 14:35, David Sterba  wrote:
> match_strdup takes an opaque type
> substring_t that is used by the parser.

Sorry, how would one determine that the type is opaque?

> I've checked some other
> usages in the tree and the match_strdup/memparse/kstrtoull is quite
> common.

Sorry, I also see there are places where substring_t's .from / .to are
still accessed directly (including in btrfs code). Do you think they
ought to use match_strdup instead?


Re: [PATCH] btrfs-progs: balance: check for full-balance before background fork

2019-08-20 Thread Vladimir Panteleev

Hi David,

On 20/08/2019 14.32, David Sterba wrote:

On Sat, Aug 17, 2019 at 11:14:34PM +, Vladimir Panteleev wrote:

- Don't use grep -q, as it causes a SIGPIPE during the countdown, and
   the balance thus doesn't start.


This needs -q, otherwise the text appears in the output of make. Fixed.


What of the SIGPIPE problem mentioned in the commit message?

If using -q is preferred despite of that, then probably the note about 
it should be removed from the commit message, and the "cancel" 
afterwards should probably be removed as well (along with its note in 
the commit message too), as the SIGPIPE will prevent the balance from 
ever starting.


Perhaps redirecting the output of grep to /dev/null is a better option.

> Applied, thanks.

Not a big issue but for some reason my email address was mangled 
(@panteleev.md instead of @vladimir.panteleev.md). Looks fine when I 
look at https://patchwork.kernel.org/patch/11099359/mbox/.


--
Best regards,
 Vladimir


[PATCH] btrfs-progs: tests: fix cli-tests/003-fi-resize-args

2019-08-17 Thread Vladimir Panteleev
grep's exit code was never checked (and -o errexit is not in effect),
thus the test was ineffectual and regressed.

Add the missing exit code check, and update the error messages to
make the test pass again.

Signed-off-by: Vladimir Panteleev 
---
 tests/cli-tests/003-fi-resize-args/test.sh | 24 ++
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/tests/cli-tests/003-fi-resize-args/test.sh 
b/tests/cli-tests/003-fi-resize-args/test.sh
index 4249c1ce..c9267035 100755
--- a/tests/cli-tests/003-fi-resize-args/test.sh
+++ b/tests/cli-tests/003-fi-resize-args/test.sh
@@ -16,21 +16,29 @@ run_check_mount_test_dev
 # missing the one of the required arguments
 for sep in '' '--'; do
run_check_stdout "$TOP/btrfs" filesystem resize $sep |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
0 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep "$TEST_MNT" |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep -128M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep +128M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep 512M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep 1:-128M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep 1:512M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
run_check_stdout "$TOP/btrfs" filesystem resize $sep 1:+128M |
-   grep -q "btrfs filesystem resize: too few arguments"
+   grep -q "btrfs filesystem resize: exactly 2 arguments expected, 
1 given" ||
+   _fail "no expected error message in the output"
 done
 
 # valid resize
-- 
2.22.0



[PATCH] btrfs-progs: balance: check for full-balance before background fork

2019-08-17 Thread Vladimir Panteleev
Move the full-balance warning to before the fork, so that the user can
see and react to it.

Notes on test:

- Don't use grep -q, as it causes a SIGPIPE during the countdown, and
  the balance thus doesn't start.

- The "balance cancel" is superfluous as the last command, but it
  provides some idempotence and allows adding more tests below it.

Fixes: https://github.com/kdave/btrfs-progs/issues/168

Signed-off-by: Vladimir Panteleev 
---
 cmds/balance.c| 36 +--
 .../002-balance-full-no-filters/test.sh   |  5 +++
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/cmds/balance.c b/cmds/balance.c
index 6f2d4803..32830002 100644
--- a/cmds/balance.c
+++ b/cmds/balance.c
@@ -437,24 +437,6 @@ static int do_balance(const char *path, struct 
btrfs_ioctl_balance_args *args,
if (fd < 0)
return 1;
 
-   if (!(flags & BALANCE_START_FILTERS) && !(flags & 
BALANCE_START_NOWARN)) {
-   int delay = 10;
-
-   printf("WARNING:\n\n");
-   printf("\tFull balance without filters requested. This 
operation is very\n");
-   printf("\tintense and takes potentially very long. It is 
recommended to\n");
-   printf("\tuse the balance filters to narrow down the scope of 
balance.\n");
-   printf("\tUse 'btrfs balance start --full-balance' option to 
skip this\n");
-   printf("\twarning. The operation will start in %d seconds.\n", 
delay);
-   printf("\tUse Ctrl-C to stop it.\n");
-   while (delay) {
-   printf("%2d", delay--);
-   fflush(stdout);
-   sleep(1);
-   }
-   printf("\nStarting balance without any filters.\n");
-   }
-
ret = ioctl(fd, BTRFS_IOC_BALANCE_V2, args);
if (ret < 0) {
/*
@@ -634,6 +616,24 @@ static int cmd_balance_start(const struct cmd_struct *cmd,
}
}
 
+   if (!(start_flags & BALANCE_START_FILTERS) && !(start_flags & 
BALANCE_START_NOWARN)) {
+   int delay = 10;
+
+   printf("WARNING:\n\n");
+   printf("\tFull balance without filters requested. This 
operation is very\n");
+   printf("\tintense and takes potentially very long. It is 
recommended to\n");
+   printf("\tuse the balance filters to narrow down the scope of 
balance.\n");
+   printf("\tUse 'btrfs balance start --full-balance' option to 
skip this\n");
+   printf("\twarning. The operation will start in %d seconds.\n", 
delay);
+   printf("\tUse Ctrl-C to stop it.\n");
+   while (delay) {
+   printf("%2d", delay--);
+   fflush(stdout);
+   sleep(1);
+   }
+   printf("\nStarting balance without any filters.\n");
+   }
+
if (force)
args.flags |= BTRFS_BALANCE_FORCE;
if (verbose)
diff --git a/tests/cli-tests/002-balance-full-no-filters/test.sh 
b/tests/cli-tests/002-balance-full-no-filters/test.sh
index 9c31dd6f..daadcc44 100755
--- a/tests/cli-tests/002-balance-full-no-filters/test.sh
+++ b/tests/cli-tests/002-balance-full-no-filters/test.sh
@@ -18,4 +18,9 @@ run_check $SUDO_HELPER "$TOP/btrfs" balance start "$TEST_MNT"
 run_check $SUDO_HELPER "$TOP/btrfs" balance --full-balance "$TEST_MNT"
 run_check $SUDO_HELPER "$TOP/btrfs" balance "$TEST_MNT"
 
+run_check_stdout $SUDO_HELPER "$TOP/btrfs" balance start --background 
"$TEST_MNT" |
+   grep -F "Full balance without filters requested." ||
+   _fail "full balance warning not in the output"
+run_mayfail $SUDO_HELPER "$TOP/btrfs" balance cancel "$TEST_MNT"
+
 run_check_umount_test_dev
-- 
2.22.0



Re: [PATCH 1/1] btrfs: Simplify parsing max_inline mount option

2019-08-15 Thread Vladimir Panteleev
On Thu, 15 Aug 2019 at 04:54, Anand Jain  wrote:
>   This causes regression, max_inline = 0 is a valid parameter.

Thank you for catching that. Apologies for making such a rudimentary mistake.

I will apply more diligence and resubmit.


[PATCH 0/1] btrfs: Simplify parsing max_inline mount option

2019-08-14 Thread Vladimir Panteleev
A nit I noticed when writing the global_reserve_size patch.

I'm assuming that rejecting garbage in mount options (where it was
silently accepted before) does not break the first rule of kernel
development?

Vladimir Panteleev (1):
  btrfs: Simplify parsing max_inline mount option

 fs/btrfs/super.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

-- 
2.22.0



[PATCH 1/1] btrfs: Simplify parsing max_inline mount option

2019-08-14 Thread Vladimir Panteleev
- Avoid an allocation;
- Properly handle non-numerical argument and garbage after numerical
  argument.

Signed-off-by: Vladimir Panteleev 
---
 fs/btrfs/super.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f56617dfb3cf..6fe8ef6667f3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -426,7 +426,7 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
unsigned long new_flags)
 {
substring_t args[MAX_OPT_ARGS];
-   char *p, *num;
+   char *p, *retptr;
u64 cache_gen;
int intarg;
int ret = 0;
@@ -630,22 +630,16 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
info->thread_pool_size = intarg;
break;
case Opt_max_inline:
-   num = match_strdup(&args[0]);
-   if (num) {
-   info->max_inline = memparse(num, NULL);
-   kfree(num);
-
-   if (info->max_inline) {
-   info->max_inline = min_t(u64,
-   info->max_inline,
-   info->sectorsize);
-   }
-   btrfs_info(info, "max_inline at %llu",
-  info->max_inline);
-   } else {
-   ret = -ENOMEM;
+   info->max_inline = memparse(args[0].from, &retptr);
+   if (retptr != args[0].to || info->max_inline == 0) {
+   ret = -EINVAL;
goto out;
}
+   info->max_inline = min_t(u64,
+   info->max_inline,
+   info->sectorsize);
+   btrfs_info(info, "max_inline at %llu",
+  info->max_inline);
break;
case Opt_alloc_start:
btrfs_info(info,
-- 
2.22.0



Re: [PATCH 1/1] btrfs: Add global_reserve_size mount option

2019-08-12 Thread Vladimir Panteleev
Hi Nikolay,

Thank you for looking at my patch!

You are completely correct in that this pampers over a bug I do not
understand. And, I would very much like to understand and fix the
underlying bug instead of settling for a workaround.

Unfortunately, after three days of looking at BTRFS code (and getting
to where I am now), I have realized that, as a developer with no
experience in filesystems or kernel development, it would take me a
lot more, possibly several weeks, to reach a level of understanding of
BTRFS to the point where I could contribute a meaningful fix. This is
not something I would be opposed to, as I have the time and I've
personally invested into BTRFS, but it certainly would be a lot easier
if I could at least get occasional confirmation that my findings and
understanding so far are correct and that I am on the right track.
Unfortunately the people in a position to do this seem to be too busy
with far more important issues than helping debug my particular edge
case, and the previous thread has not received any replies since my
last few posts there, so this patch is the least I could contribute so
far.

FWIW #1: My current best guess at why the problem occurs, using my
current level of understanding of BTRFS, is that the filesystem in
question (16TB of historical snapshots) has so many subvolumes and
fragmentation that balance or device delete operations allocate so
much metadata space while processing the chunk (by allocating new
blocks for splitting filled metadata tree nodes) that the global
reserve is overrun. Corrections or advice on how to verify this theory
would be appreciated! (Or perhaps I should just use my patch to fix my
filesystem and move on with my life. Would be good to know when I can
wipe the disks containing the test case FS which reproduces the bug
and use them for something else.)

FWIW #2: I noticed that Josef Bacik proposed a change back in 2013 to
increase the global reserve size to 1G. The comments on the patch was
the reason I proposed to make it configurable rather than raising the
size again: https://patchwork.kernel.org/patch/2517071/

Thanks!

On Mon, 12 Aug 2019 at 08:37, Nikolay Borisov  wrote:
>
>
>
> On 10.08.19 г. 15:41 ч., Vladimir Panteleev wrote:
> > In some circumstances (filesystems with many extents and backrefs),
> > the global reserve gets overrun causing balance and device deletion
> > operations to fail with -ENOSPC. Providing a way for users to increase
> > the global reserve size can allow them to complete the operation.
> >
> > Signed-off-by: Vladimir Panteleev 
>
> I'm inclined to NAK this patch. On the basis that it pampers over
> deficiencies in the current ENOSPC handling algorithms. Furthermore in
> your cover letter you state that you don't completely understand the
> root cause. So at the very best this is pampering over a bug.
>
> > ---
> >  fs/btrfs/block-rsv.c |  2 +-
> >  fs/btrfs/ctree.h |  3 +++
> >  fs/btrfs/disk-io.c   |  1 +
> >  fs/btrfs/super.c | 17 -
> >  4 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> > index 698470b9f32d..5e5f5521de0e 100644
> > --- a/fs/btrfs/block-rsv.c
> > +++ b/fs/btrfs/block-rsv.c
> > @@ -272,7 +272,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info 
> > *fs_info)
> >   spin_lock(&sinfo->lock);
> >   spin_lock(&block_rsv->lock);
> >
> > - block_rsv->size = min_t(u64, num_bytes, SZ_512M);
> > + block_rsv->size = min_t(u64, num_bytes, fs_info->global_reserve_size);
> >
> >   if (block_rsv->reserved < block_rsv->size) {
> >   num_bytes = btrfs_space_info_used(sinfo, true);
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 299e11e6c554..d975d4f5723c 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -775,6 +775,8 @@ struct btrfs_fs_info {
> >*/
> >   u64 max_inline;
> >
> > + u64 global_reserve_size;
> > +
> >   struct btrfs_transaction *running_transaction;
> >   wait_queue_head_t transaction_throttle;
> >   wait_queue_head_t transaction_wait;
> > @@ -1359,6 +1361,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> > btrfs_fs_info *info)
> >
> >  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
> >  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> > +#define BTRFS_DEFAULT_GLOBAL_RESERVE_SIZE (SZ_512M)
> >
> >  #define btrfs_clear_opt(o, opt)  ((o) &= ~BTRFS_MOUNT_##opt)
> >  #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt)
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 

[PATCH 0/1] btrfs: Add global_reserve_size mount option

2019-08-10 Thread Vladimir Panteleev
Hi,

This is a follow-up to my previous thread titled ""kernel BUG" and
segmentation fault with "device delete"". Since my last message there,
I discovered that the problem I was seeing was solved merely by
increasing the size of the global reserve to 1G. I don't claim to have
anywhere near a complete understanding of btrfs, its implementation,
or the problem at hand, but I figured a bad patch is better than no
patch, so here's a patch which does allow working around my problem
without building a custom kernel. Or, if anyone has better ideas how
to understand or address this problem, I'd be happy to help or test
things.

Vladimir Panteleev (1):
  btrfs: Add global_reserve_size mount option

 fs/btrfs/block-rsv.c |  2 +-
 fs/btrfs/ctree.h |  3 +++
 fs/btrfs/disk-io.c   |  1 +
 fs/btrfs/super.c | 17 -
 4 files changed, 21 insertions(+), 2 deletions(-)

-- 
2.22.0



[PATCH 1/1] btrfs: Add global_reserve_size mount option

2019-08-10 Thread Vladimir Panteleev
In some circumstances (filesystems with many extents and backrefs),
the global reserve gets overrun causing balance and device deletion
operations to fail with -ENOSPC. Providing a way for users to increase
the global reserve size can allow them to complete the operation.

Signed-off-by: Vladimir Panteleev 
---
 fs/btrfs/block-rsv.c |  2 +-
 fs/btrfs/ctree.h |  3 +++
 fs/btrfs/disk-io.c   |  1 +
 fs/btrfs/super.c | 17 -
 4 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 698470b9f32d..5e5f5521de0e 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -272,7 +272,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info 
*fs_info)
spin_lock(&sinfo->lock);
spin_lock(&block_rsv->lock);
 
-   block_rsv->size = min_t(u64, num_bytes, SZ_512M);
+   block_rsv->size = min_t(u64, num_bytes, fs_info->global_reserve_size);
 
if (block_rsv->reserved < block_rsv->size) {
num_bytes = btrfs_space_info_used(sinfo, true);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 299e11e6c554..d975d4f5723c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -775,6 +775,8 @@ struct btrfs_fs_info {
 */
u64 max_inline;
 
+   u64 global_reserve_size;
+
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
wait_queue_head_t transaction_wait;
@@ -1359,6 +1361,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
+#define BTRFS_DEFAULT_GLOBAL_RESERVE_SIZE (SZ_512M)
 
 #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f7ee70b3d1a..06f835a44b8a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2723,6 +2723,7 @@ int open_ctree(struct super_block *sb,
atomic64_set(&fs_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
+   fs_info->global_reserve_size = BTRFS_DEFAULT_GLOBAL_RESERVE_SIZE;
fs_info->metadata_ratio = 0;
fs_info->defrag_inodes = RB_ROOT;
atomic64_set(&fs_info->free_chunk_space, 0);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 78de9d5d80c6..f44223a44cb8 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -327,6 +327,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+   Opt_global_reserve_size,
 
/* Deprecated options */
Opt_alloc_start,
@@ -394,6 +395,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+   {Opt_global_reserve_size, "global_reserve_size=%s"},
 
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -426,7 +428,7 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
unsigned long new_flags)
 {
substring_t args[MAX_OPT_ARGS];
-   char *p, *num;
+   char *p, *num, *retptr;
u64 cache_gen;
int intarg;
int ret = 0;
@@ -746,6 +748,15 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_user_subvol_rm_allowed:
btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
break;
+   case Opt_global_reserve_size:
+   info->global_reserve_size = memparse(args[0].from, 
&retptr);
+   if (retptr != args[0].to || info->global_reserve_size 
== 0) {
+   ret = -EINVAL;
+   goto out;
+   }
+   btrfs_info(info, "global_reserve_size at %llu",
+  info->global_reserve_size);
+   break;
case Opt_enospc_debug:
btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
break;
@@ -1336,6 +1347,8 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",clear_cache");
if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
seq_puts(seq, ",user_subvol_rm_allowed");
+   if (info->global_reserve_size != BTRFS_DEFAULT_GLOBAL_RESERVE_SIZE)
+   seq_printf(seq, ",global_reserve_size=%llu", 
info->global_reserve_size);
if (btrfs_test_opt(info, ENOSPC_DEBUG))
seq_puts(seq, ",enospc_debug");
 

Re: "kernel BUG" and segmentation fault with "device delete"

2019-08-08 Thread Vladimir Panteleev

Did more digging today. Here is where the -ENOSPC is coming from:

btrfs_run_delayed_refs ->  // WARN here
__btrfs_run_delayed_refs ->
btrfs_run_delayed_refs_for_head ->
run_one_delayed_ref ->
run_delayed_data_ref ->
__btrfs_inc_extent_ref ->
insert_extent_backref ->
insert_extent_data_ref ->
btrfs_insert_empty_item ->
btrfs_insert_empty_items ->
btrfs_search_slot ->
split_leaf ->
alloc_tree_block_no_bg_flush ->
btrfs_alloc_tree_block ->
use_block_rsv ->
block_rsv_use_bytes / reserve_metadata_bytes

In use_block_rsv, first block_rsv_use_bytes (with the 
BTRFS_BLOCK_RSV_DELREFS one) fails, then reserve_metadata_bytes fails, 
then block_rsv_use_bytes with global_rsv fails again.


My understanding of this in plain English is as follows: btrfs attempted 
to finalize a transaction and add the queued backreferences. When doing 
so, it ran out of space in a B-tree, and attempted to allocate a new 
tree block; however, in doing so, it hit the limit it reserved for 
itself for how much space it was going to use during that operation, so 
it gave up on the whole thing, which led everything to go downhill from 
there. Is this anywhere close to being accurate?


BTW, the DELREFS rsv is 0 / 7GB reserved/free. So, it looks like it 
didn't expect to allocate the new tree node at all? Perhaps it should be 
using some other rsv for those?


Am I on the right track, or should I be discussing this elsewhere / with 
someone else?


On 20/07/2019 10.59, Vladimir Panteleev wrote:

Hi,

I've done a few experiments and here are my findings.

First I probably should describe the filesystem: it is a snapshot 
archive, containing a lot of snapshots for 4 subvolumes, totaling 2487 
subvolumes/snapshots. There are also a few files (inside the snapshots) 
that are probably very fragmented. This is probably what causes the bug.


Observations:

- If I delete all snapshots, the bug disappears (device delete succeeds).
- If I delete all but any single subvolume's snapshots, the bug disappears.
- If I delete one of two subvolumes' snapshots, the bug disappears, but 
stays if I delete one of the other two subvolumes' snapshots.


It looks like two subvolumes' snapshots' data participates in causing 
the bug.


In theory, I guess it would be possible to reduce the filesystem to the 
minimal one causing the bug by iteratively deleting snapshots / files 
and checking if the bug manifests, but it would be extremely 
time-consuming, probably requiring weeks.


Anything else I can do to help diagnose / fix it? Or should I just order 
more HDDs and clone the RAID10 the right way?


On 06/07/2019 05.51, Qu Wenruo wrote:



On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
[...]

I'm not sure if it's the degraded mount cause the problem, as the
enospc_debug output looks like reserved/pinned/over-reserved space has
taken up all space, while no new chunk get allocated.


The problem happens after replace-ing the missing device (which succeeds
in full) and then attempting to remove it, i.e. without a degraded 
mount.



Would you please try to balance metadata to see if the ENOSPC still
happens?


The problem also manifests when attempting to rebalance the metadata.


Have you tried to balance just one or two metadata block groups?
E.g using -mdevid or -mvrange?

And did the problem always happen at the same block group?

Thanks,
Qu


Thanks!







--
Best regards,
 Vladimir


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-20 Thread Vladimir Panteleev

Hi,

I've done a few experiments and here are my findings.

First I probably should describe the filesystem: it is a snapshot 
archive, containing a lot of snapshots for 4 subvolumes, totaling 2487 
subvolumes/snapshots. There are also a few files (inside the snapshots) 
that are probably very fragmented. This is probably what causes the bug.


Observations:

- If I delete all snapshots, the bug disappears (device delete succeeds).
- If I delete all but any single subvolume's snapshots, the bug disappears.
- If I delete one of two subvolumes' snapshots, the bug disappears, but 
stays if I delete one of the other two subvolumes' snapshots.


It looks like two subvolumes' snapshots' data participates in causing 
the bug.


In theory, I guess it would be possible to reduce the filesystem to the 
minimal one causing the bug by iteratively deleting snapshots / files 
and checking if the bug manifests, but it would be extremely 
time-consuming, probably requiring weeks.


Anything else I can do to help diagnose / fix it? Or should I just order 
more HDDs and clone the RAID10 the right way?


On 06/07/2019 05.51, Qu Wenruo wrote:



On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
[...]

I'm not sure if it's the degraded mount cause the problem, as the
enospc_debug output looks like reserved/pinned/over-reserved space has
taken up all space, while no new chunk get allocated.


The problem happens after replace-ing the missing device (which succeeds
in full) and then attempting to remove it, i.e. without a degraded mount.


Would you please try to balance metadata to see if the ENOSPC still
happens?


The problem also manifests when attempting to rebalance the metadata.


Have you tried to balance just one or two metadata block groups?
E.g using -mdevid or -mvrange?

And did the problem always happen at the same block group?

Thanks,
Qu


Thanks!





--
Best regards,
 Vladimir


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-06 Thread Vladimir Panteleev

On 06/07/2019 05.51, Qu Wenruo wrote:

The problem also manifests when attempting to rebalance the metadata.


Have you tried to balance just one or two metadata block groups?
E.g using -mdevid or -mvrange?


If I use -mdevid with the device ID of the device I'm trying to remove 
(2), I see the crash.


If I use -mvrange with a range covering one byte past a problematic 
virtual address, I see the crash.


Not sure if you had anything else / specific in mind. Here is a log of 
my experiments so far:


https://dump.thecybershadow.net/da241fb4b6e743b01a7e9f8734f70d6e/scratch.txt


And did the problem always happen at the same block group?


Upon reviewing my logs, it looks like for "device remove" it always 
tries to move block group 1998263943168 first, upon which it crashes.


For "balance", it seems to vary - looks like there is at least one other 
problematic block group at 48009543942144.


Happy to do more experiments or test kernel patches. I have a VM set up 
with a COW view of the devices, so I can do destructive tests too.


--
Best regards,
 Vladimir



signature.asc
Description: OpenPGP digital signature


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-05 Thread Vladimir Panteleev

On 06/07/2019 05.01, Qu Wenruo wrote:

After stubbing out btrfs_check_rw_degradable (because btrfs currently
can't realize when it has all drives needed for RAID10),


The point is, btrfs_check_rw_degradable() is already doing per-chunk
level rw degradable checking.

I would highly recommend not to comment out the function completely.
It has been a long (well, not that long) way from old fs level tolerance
to current per-chunk tolerance check.


Very grateful for this :)


I totally understand for RAID10 we can at most drop half of its stripes
as long as we have one device for each substripe.
If you really want that feature to allow RAID10 to tolerate more missing
devices, please do proper chunk stripe check.


This was my understanding of the situation as well; in any case, it was 
a temporary patch just so I could rebalance the RAID10 blocks to RAID1.



The fs should have enough space to allocate new metadata chunk (it's
metadata chunk lacking space and caused ENOSPC).

I'm not sure if it's the degraded mount cause the problem, as the
enospc_debug output looks like reserved/pinned/over-reserved space has
taken up all space, while no new chunk get allocated.


The problem happens after replace-ing the missing device (which succeeds 
in full) and then attempting to remove it, i.e. without a degraded mount.



Would you please try to balance metadata to see if the ENOSPC still happens?


The problem also manifests when attempting to rebalance the metadata.

Thanks!

--
Best regards,
 Vladimir



signature.asc
Description: OpenPGP digital signature


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-05 Thread Vladimir Panteleev

On 06/07/2019 02.38, Chris Murphy wrote:

On Fri, Jul 5, 2019 at 6:05 PM Vladimir Panteleev
 wrote:

Unfortunately as mentioned before that wasn't an option. I was
performing this operation on a DM snapshot target backed by a file that
certainly could not fit the result of a RAID10-to-RAID1 rebalance.


Then the total operation isn't possible. Maybe you could have made the
volume a seed, and then create a single device sprout on a new single
target, and later convert that sprout to raid1. But I'm not sure of
the state of multiple device seeds.


That's an interesting idea, thanks; I'll be sure to explore it if I run 
into this situation again.



What I found surprising, was that "btrfs device delete missing" deletes
exactly one device, instead of all missing devices. But, that might be
simply because a device with RAID10 blocks should not have been
mountable rw with two missing drives in the first place.


It's a really good question for developers if there is a good reason
to permit rw mount of a volume that's missing two or more devices for
raid 1, 10, or 5; and missing three or more for raid6. I cannot think
of a good reason to allow degraded,rw mounts for a raid10 missing two
devices.


Sorry, the code currently indeed does not permit mounting a RAID10 
filesystem with more than one missing device in rw. I needed to patch my 
kernel to force it to allow it, as I was working on the assumption that 
the two remaining drives contained a copy of all data (which turned out 
to be true).



Wow that's really interesting. So you did 'btrfs replace start' for
one of the missing drive devid's, with a loop device as the
replacement, and that worked and finished?!


Yes, that's right.


Does this three device volume mount rw and not degraded? I guess it
must have because 'btrfs fi us' worked on it.

 devid1 size 7.28TiB used 2.71TiB path /dev/sdd1
 devid2 size 7.28TiB used 22.01GiB path /dev/loop0
 devid3 size 7.28TiB used 2.69TiB path /dev/sdf1


Indeed - with the loop device attached, I can mount the filesystem rw 
just fine without any mount flags, with a stock kernel.



OK so what happens now if you try to 'btrfs device remove /dev/loop0' ?


Unfortunately it fails in the same way (warning followed by "kernel 
BUG"). The same thing happens if I try to rebalance the metadata.



Well there's definitely something screwy if Btrfs needs something on a
missing drive, which is indicated by its refusal to remove it from the
volume, and yet at same time it's possible to e.g. rsync every file to
/dev/null without any errors. That's a bug somewhere.


As I understand, I don't think it actually "needs" any data from that 
device, it's just having trouble updating some metadata as it tries to 
move one redundant copy of the data from there to somewhere else. It's 
not refusing to remove the device either, rather it tries and fails at 
doing so.



I'm not a developer but a dev very well might need to have a simple
reproducer for this in order to locate the problem. But the call trace
might tell them what they need to know. I'm not sure.


What I'm going to try to do next is to create another COW layer on top 
of the three devices I have, attach them to a virtual machine, and boot 
that (as it's not fun to reboot the physical machine each time the code 
crashes). Then I could maybe poke the related kernel code to try to 
understand the problem better.


--
Best regards,
 Vladimir


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-05 Thread Vladimir Panteleev

Hi Chris,

First, thank you very much for taking the time to reply. I greatly 
appreciate it.


On 05/07/2019 21.43, Chris Murphy wrote:

There's no parity on either raid10 or raid1.


Right, thank you for the correction. Of course, I meant the duplicate 
copies of the RAID1 data.



But I can't tell from the
above exactly when each drive was disconnected. In this scenario you
need to convert to raid1 first, wait for that to complete successfully
before you can do a device remove. That's clear.  Also clear is you
must use 'btrfs device remove' and it must complete before that device
is disconnected.


Unfortunately as mentioned before that wasn't an option. I was 
performing this operation on a DM snapshot target backed by a file that 
certainly could not fit the result of a RAID10-to-RAID1 rebalance.



What I've never tried, but the man page implies, is you can specify
two devices at one time for 'btrfs device remove' if the profile and
the number of devices permits it.


What I found surprising, was that "btrfs device delete missing" deletes 
exactly one device, instead of all missing devices. But, that might be 
simply because a device with RAID10 blocks should not have been 
mountable rw with two missing drives in the first place.



After stubbing out btrfs_check_rw_degradable (because btrfs currently
can't realize when it has all drives needed for RAID10),


Uhh? This implies it was still raid10 when you disconnected two drives
of a four drive raid10. That's definitely data loss territory.
However, your 'btrfs fi us' command suggests only raid1 chunks. What
I'm suspicious of is this:


Data,RAID1: Size:2.66TiB, Used:2.66TiB
  /dev/sdd1   2.66TiB
  /dev/sdf1   2.66TiB


All data block groups are only on sdf1 and sdd1.


Metadata,RAID1: Size:57.00GiB, Used:52.58GiB
   /dev/sdd1  57.00GiB
  /dev/sdf1  37.00GiB
   missing  20.00GiB


There's metadata still on one of the missing devices. You need to
physically reconnect this device. The device removal did not complete
before this device was physically disconnected.


System,RAID1: Size:8.00MiB, Used:416.00KiB
   /dev/sdd1   8.00MiB
   missing   8.00MiB


This is actually worse, potentially because it means there's only one
copy of the system chunk on sdd1. It has not been replicated to sdf1,
but is on the missing device.


I'm sorry, but that's not right. As I mentioned in my second email, if I 
use btrfs device replace, then it successfully rebuilds all missing 
data. So, there is no lost data with no remaining copies; btrfs is 
simply having some trouble moving it off of that device.


Here is the filesystem info with a loop device replacing the missing drive:

https://dump.thecybershadow.net/9a0c88c3720c55bcf7fee98630c2a8e1/00%3A02%3A17-upload.txt


Depending on degraded operation for this task is the wrong strategy.
You needed to 'btrfs device delete/remove' before physically
disconnecting these drives.

OK you definitely did this incorrectly if you're expecting to
disconnect two devices at the same time, and then "btrfs device delete
missing" instead of explicitly deleting drives by ID before you
physically disconnect them.


I don't disagree in general, however, I did make sure that all data was 
accessible with two devices before proceeding with this endeavor.



It sounds to me like you had a successful conversion from 4 disk
raid10 to a 4 disk raid1. But then you're assuming there are
sufficient copies of all data and metadata on each drive. That is not
the case with Btrfs. The drives are not mirrored. The block groups are
mirrored. Btrfs raid1 tolerates exactly 1 device loss. Not two.


Whether it was through dumb luck or just due to the series of steps I've 
happened to have taken, it doesn't look like I've lost any data so far. 
But thank you for the correction regarding how RAID1 works in btrfs, 
I've indeed been misunderstanding it.



We need to see a list of commands issued in order, along with the
physical connected state of each drive. I thought I understood what
you did from the previous email, but this paragraph contradicts my
understanding, especially when you say "correct approach would be
first to convert all RAID 10 to RAID1 and then remove devices but that
wasn't an option"

OK so what did you do, in order, each command, interleaving the
physical device removals.


Well, at this point, I'm still quite confident that the BTRFS kernel bug 
is unrelated to this entire RAID10 thing, but I'll do so if you like. 
Unfortunately I do not have an exact record of this, but I can do my 
best to reconstruct it from memory.


The reason I'm doing this in the first place is that I'm trying to split 
a 4-drive RAID10 array that was getting full. The goal was to move some 
data off of it to a new array, then delete it from its original 
location. I couldn't use rsync because most of the data was in 
snapshots, and I couldn't use btrfs send/receive because it bugs out 
with the old "chown oXXX-XXX-0 failed: No such file or directory" 
bug. S

Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-05 Thread Vladimir Panteleev

On 05/07/2019 09.42, Andrei Borzenkov wrote:

On Fri, Jul 5, 2019 at 7:45 AM Vladimir Panteleev
 wrote:


Hi,

I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
RAID1 (2 disks). The array was less than half full, and I disconnected
two parity drives,


btrfs does not have dedicated parity drives; it is quite possible that
some chunks had their mirror pieces on these two drives, meaning you
effectively induced data loss. You had to perform "btrfs device
delete" *first*, then disconnect unused drive after this process has
completed.


Hi Andrei,

Thank you for replying. However, I'm pretty sure this is not the case as 
you describe it, and in fact, unrelated to the actual problem I'm having.


- I can access all the data on the volumes just fine.

- All the RAID10 block profiles had been successfully converted to 
RAID1. Currently, there are no RAID10 blocks left anywhere on the 
filesystem.


- Only the data was in the RAID10 profile. Metadata was and is in RAID1. 
It is also metadata which btrfs cannot move away from the missing device.


If you can propose a test to verify your hypothesis, I'd be happy to 
check. But, as far as my understanding of btrfs allows me to see, your 
conclusion rests on a bad assumption.


Also, IIRC, your suggestion is not applicable. btrfs refuses to remove a 
device from a 4-device filesystem with RAID10 blocks, as that would put 
it under the minimum number of devices for RAID10 blocks. I think the 
"correct" approach would be first to convert all RAID10 blocks to RAID1 
and only then remove the devices, however, this was not an option for me 
due to other constraints I was working under at the time.


--
Best regards,
 Vladimir


Re: "kernel BUG" and segmentation fault with "device delete"

2019-07-05 Thread Vladimir Panteleev

On 05/07/2019 04.39, Vladimir Panteleev wrote:
The process reached a point where the last missing device shows as 
containing 20 GB of RAID1 metadata. At this point, attempting to delete 
the device causes the operation to shortly fail with "No space left", 
followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the 
"btrfs device delete" command to crash with a segmentation fault.


Same effect if I try to use btrfs-replace on the missing device (which 
works) and then try to delete it (which fails in the same way).


Also same effect if I try to balance the metadata (balance start -v -m 
/mnt/a).


At this point this doesn't look like it is at all related to RAID10 or 
btrfs_check_rw_degradable, just a bug somewhere with handling something 
weird in the filesystem.


I'm out of ideas, so suggestions welcome. :)

--
Best regards,
 Vladimir


"kernel BUG" and segmentation fault with "device delete"

2019-07-04 Thread Vladimir Panteleev

Hi,

I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to 
RAID1 (2 disks). The array was less than half full, and I disconnected 
two parity drives, leaving two that contained one copy of all data.


After stubbing out btrfs_check_rw_degradable (because btrfs currently 
can't realize when it has all drives needed for RAID10), I've 
successfully mounted rw+degraded, balance-converted all RAID10 data to 
RAID1, and then btrfs-device-delete-d one of the missing drives. It 
fails at deleting the second.


The process reached a point where the last missing device shows as 
containing 20 GB of RAID1 metadata. At this point, attempting to delete 
the device causes the operation to shortly fail with "No space left", 
followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the 
"btrfs device delete" command to crash with a segmentation fault.


Here is the information about the filesystem:

https://dump.thecybershadow.net/55d558b4d0a59643e24c6b4ee9019dca/04%3A28%3A23-upload.txt

And here is the dmesg output (with enospc_debug):

https://dump.thecybershadow.net/9d3811b85d078908141a30886df8894c/04%3A28%3A53-upload.txt

Attempting to unmount the filesystem causes another warning:

https://dump.thecybershadow.net/6d6f2353cd07cd8464ece7e4df90816e/04%3A30%3A30-upload.txt

The umount command then hangs indefinitely.

Linux 5.1.15-arch1-1-ARCH, btrfs-progs v5.1.1

--
Best regards,
 Vladimir


4.20: "btrfs_run_delayed_refs:2978: errno=-28 No space left" >100GB unallocated / >400G free?

2019-02-25 Thread Vladimir Panteleev

Hello,

I am having this problem again with my RAID10 btrfs filesystem.

I have some nightly cronjobs which create and copy over snapshots.
Lately, they cause the filesystem to crash with "No space left", despite 
there being more than a hundred GB unallocated on all drives.


Following the advice in the previous thread [0], I enabled enospc_debug. 
Here is the dmegs output before the crash:


[55162.474274] BTRFS info (device sde1): space_info total=248034361344, 
used=246433710080, pinned=954941440, reserved=41959424, 
may_use=603439104, readonly=131072
[55162.484670] BTRFS info (device sde1): space_info 4 has 0 free, is not 
full
[55162.484673] BTRFS info (device sde1): space_info total=248034361344, 
used=246433710080, pinned=954941440, reserved=42254336, 
may_use=603324416, readonly=131072
[55162.613325] BTRFS info (device sde1): space_info 4 has 180224 free, 
is not full
[55162.613327] BTRFS info (device sde1): space_info total=248034361344, 
used=246433710080, pinned=954941440, reserved=45547520, 
may_use=599851008, readonly=131072

[55236.827153] [ cut here ]
[55236.827155] BTRFS: block rsv returned -28
[55236.827221] WARNING: CPU: 11 PID: 22800 at 
fs/btrfs/extent-tree.c:8230 btrfs_alloc_tree_block+0x21b/0x5b0 [btrfs]
[55236.827222] Modules linked in: xt_REDIRECT tcp_diag inet_diag 
scsi_transport_iscsi fuse ccm xt_nat vhost_net vhost tap iptable_mangle 
xt_CHECKSUM xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
ebtable_filter ebtables devlink ip6table_filter ip6_tables 
iptable_filter sit tunnel4 ip_tunnel bnep uinput it87 hwmon_vid 8021q 
garp mrp ipt_MASQUERADE iptable_nat nls_iso8859_1 nf_nat_ipv4 nls_cp437 
nf_nat vfat nf_conntrack fat nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp arc4 kvm_intel amdgpu kvm 
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ath9k 
aesni_intel uvcvideo chash amd_iommu_v2 aes_x86_64 crypto_simd 
ath9k_common gpu_sched cryptd videobuf2_vmalloc ath9k_hw 
videobuf2_memops mxm_wmi iTCO_wdt glue_helper ath3k btusb 
snd_hda_codec_hdmi videobuf2_v4l2 iTCO_vendor_support i2c_algo_bit 
intel_cstate ath snd_hda_codec_realtek snd_hda_codec_generic mac80211 
ttm videobuf2_common btrtl intel_uncore btbcm btintel snd_usb_audio 
drm_kms_helper
[55236.827246]  snd_hda_intel videodev snd_usbmidi_lib bluetooth 
intel_rapl_perf snd_hda_codec snd_rawmidi i2c_i801 drm snd_seq_device 
snd_hda_core mousedev pl2303 snd_hwdep media rndis_host input_leds 
cdc_ether joydev usbnet r8169 snd_pcm cfg80211 ecdh_generic mii agpgart 
crc16 lpc_ich realtek snd_timer syscopyarea sysfillrect sysimgblt snd 
e1000e mei_me libphy mei fb_sys_fops soundcore rfkill wmi pcc_cpufreq 
bridge stp llc evdev tun mac_hid sg crypto_user ip_tables x_tables btrfs 
libcrc32c crc32c_generic xor raid6_pq sr_mod sd_mod cdrom hid_generic 
usbhid hid isci libsas ahci scsi_transport_sas libahci xhci_pci libata 
crc32c_intel firewire_ohci xhci_hcd firewire_core ehci_pci scsi_mod 
crc_itu_t ehci_hcd
[55236.827269] CPU: 11 PID: 22800 Comm: btrfs Not tainted 
4.20.11-arch2-1-ARCH #1
[55236.827270] Hardware name: Gigabyte Technology Co., Ltd. To be filled 
by O.E.M./X79S-UP5, BIOS F5f 03/19/2014

[55236.827280] RIP: 0010:btrfs_alloc_tree_block+0x21b/0x5b0 [btrfs]
[55236.827282] Code: 48 c7 c7 40 c5 72 c0 89 44 24 28 e8 7f f1 5e f3 8b 
54 24 28 85 c0 0f 84 36 ff ff ff 89 d6 48 c7 c7 78 16 6e c0 e8 af 14 e5 
f2 <0f> 0b e9 21 ff ff ff 49 8b 84 24 f0 01 00 00 48 8b 74 24 37 48 89

[55236.827282] RSP: 0018:be0ca27b76e0 EFLAGS: 00010282
[55236.827283] RAX:  RBX: a381f884c000 RCX: 

[55236.827284] RDX: 0007 RSI: b44a42fe RDI: 

[55236.827285] RBP: 4000 R08: 0001 R09: 
3749
[55236.827285] R10: 0004 R11:  R12: 
a381e3c72000
[55236.827286] R13: a381dfcbdf08 R14: a381f884c130 R15: 
0001
[55236.827287] FS:  7fc4f3b6c8c0() GS:a381ffac() 
knlGS:

[55236.827288] CS:  0010 DS:  ES:  CR0: 80050033
[55236.827288] CR2: 01083ba8a398 CR3: 000d62dde004 CR4: 
001626e0

[55236.827289] Call Trace:
[55236.827297]  ? __set_page_dirty_nobuffers+0x10e/0x150
[55236.827306]  alloc_tree_block_no_bg_flush+0x47/0x50 [btrfs]
[55236.827315]  __btrfs_cow_block+0x11b/0x500 [btrfs]
[55236.827324]  btrfs_cow_block+0xdc/0x1a0 [btrfs]
[55236.827332]  btrfs_search_slot+0x368/0x990 [btrfs]
[55236.827342]  lookup_inline_extent_backref+0x186/0x610 [btrfs]
[55236.827354]  ? set_extent_bit+0x19/0x20 [btrfs]
[55236.827364]  ? update_block_group.isra.24+0x10d/0x3a0 [btrfs]
[55236.827374]  __btrfs_free_extent.isra.25+0xed/0x940 [btrfs]
[55236.827377]  ? _raw_spin_lock+0x13/0x30
[55236.827378]  ? _raw_spin_unlock+0x16/0x30
[55236.827390]  ? btrfs_merge_delayed_refs+0x315/0x350 [btrfs]
[55236.827400]  __btrfs_run_delayed_refs+0x6f2/0x10e0 [btrfs]
[55236.827410]  btrfs_r

Re: 4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?

2017-10-22 Thread Vladimir Panteleev

On 2017-10-19 12:14, Martin Raiber wrote:

You could also mount with
"enospc_debug" to give the devs more infos about this issue.
I am having more ENOSPC issues with 4.9.x than with the latest 4.14.


Here is the dmesg output with -o enospc_debug, hopefully it will be 
useful for someone:

https://dump.thecybershadow.net/266ad878bed7921ccca3a7f624df0cc7/scratch.txt


for me a work-around for something like this has been to reduce the
amount of dirty memory


In the end I deleted some snapshots, that seemed to free up enough 
metadata for the balance to continue. After 25 hours, a balance with 
-dusage=10 finished and freed up 344GB of "Device unallocated" space.


--
Best regards,
 Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?

2017-10-19 Thread Vladimir Panteleev

On Tue, 17 Oct 2017 16:21:04 -0700, Duncan wrote:

* try the balance on 4.14-rc5+, where the known bug should be fixed


Thanks! However, I'm getting the same error on 4.14.0-rc5-g9aa0d2dde6eb. 
The stack trace is different, though:


[25886.024757] BTRFS: Transaction aborted (error -28)
[25886.024793] [ cut here ]
[25886.024807] WARNING: CPU: 3 PID: 1904 at fs/btrfs/extent-tree.c:7062 
__btrfs_free_extent.isra.24+0xc23/0xda0 [btrfs]
[25886.024808] Modules linked in: ctr fuse xt_nat vhost_net vhost tap 
xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 
xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter devlink tun nls_utf8 cifs ccm dns_resolver fscache uinput 
it87 hwmon_vid ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
libcrc32c crc32c_generic sit tunnel4 ip_tunnel snd_hda_codec_hdmi 8021q 
mrp snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt 
iTCO_vendor_support nls_iso8859_1 nls_cp437 mxm_wmi vfat fat 
nvidia_drm(PO) intel_rapl nvidia_modeset(PO) x86_pkg_temp_thermal 
intel_powerclamp nvidia(PO) coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc arc4 aesni_intel 
aes_x86_64 ath9k crypto_simd
[25886.024830]  glue_helper cryptd ath9k_common ath9k_hw intel_cstate 
ath3k ath intel_rapl_perf btusb snd_hda_intel btrtl btbcm pl2303 btintel 
drm_kms_helper uvcvideo snd_usb_audio snd_hda_codec videobuf2_vmalloc 
snd_usbmidi_lib mac80211 bluetooth videobuf2_memops snd_rawmidi 
videobuf2_v4l2 ecdh_generic usbserial crc16 i2c_i801 lpc_ich cdc_acm 
snd_seq_device snd_hda_core videobuf2_core drm e1000e cfg80211 snd_hwdep 
snd_pcm syscopyarea r8169 sysfillrect snd_timer videodev sysimgblt mii 
rfkill mei_me ptp fb_sys_fops snd mousedev input_leds joydev evdev 
ioatdma led_class mac_hid media soundcore mei pps_core dca shpchp wmi 
bridge tpm_infineon tpm_tis tpm_tis_core stp llc tpm button sch_fq_codel 
sg ip_tables x_tables sr_mod cdrom btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq sd_mod hid_generic
[25886.024855]  hid_dr ff_memless usbhid hid crc32c_intel isci xhci_pci 
ahci libsas ehci_pci xhci_hcd scsi_transport_sas libahci ehci_hcd 
usbcore libata usb_common scsi_mod serio
[25886.024863] CPU: 3 PID: 1904 Comm: btrfs-transacti Tainted: P 
  O4.14.0-rc5-g9aa0d2dde6eb #2
[25886.024864] Hardware name: Gigabyte Technology Co., Ltd. To be filled 
by O.E.M./X79S-UP5, BIOS F5f 03/19/2014

[25886.024865] task: 880eb8f1d880 task.stack: c9000c81c000
[25886.024871] RIP: 0010:__btrfs_free_extent.isra.24+0xc23/0xda0 [btrfs]
[25886.024871] RSP: 0018:c9000c81fc28 EFLAGS: 00010282
[25886.024873] RAX: 0026 RBX: 0854ddb4 RCX: 

[25886.024873] RDX:  RSI: 880fff2cdc48 RDI: 
880fff2cdc48
[25886.024874] RBP: c9000c81fcd0 R08: 0613 R09: 
0007
[25886.024875] R10: 1000 R11: 0001 R12: 
880ec87c6000
[25886.024876] R13: ffe4 R14:  R15: 
880ff4f4a690
[25886.024877] FS:  () GS:880fff2c() 
knlGS:

[25886.024878] CS:  0010 DS:  ES:  CR0: 80050033
[25886.024879] CR2: 7f1c6cb9c0d0 CR3: 02c09003 CR4: 
001606e0

[25886.024880] Call Trace:
[25886.024887]  ? btrfs_previous_extent_item+0xe1/0x110 [btrfs]
[25886.024895]  ? btrfs_merge_delayed_refs+0x8c/0x550 [btrfs]
[25886.024901]  __btrfs_run_delayed_refs+0x6ee/0x12f0 [btrfs]
[25886.024909]  btrfs_run_delayed_refs+0x6b/0x250 [btrfs]
[25886.024916]  btrfs_commit_transaction+0x48/0x920 [btrfs]
[25886.024922]  ? start_transaction+0x99/0x420 [btrfs]
[25886.024929]  transaction_kthread+0x182/0x1b0 [btrfs]
[25886.024932]  kthread+0x125/0x140
[25886.024939]  ? btrfs_cleanup_transaction+0x520/0x520 [btrfs]
[25886.024940]  ? kthread_create_on_node+0x70/0x70
[25886.024942]  ret_from_fork+0x25/0x30
[25886.024944] Code: d7 e0 0f ff eb d0 44 89 ee 48 c7 c7 68 b7 40 a0 e8 
c4 8d d7 e0 0f ff e9 7c fb ff ff 44 89 ee 48 c7 c7 68 b7 40 a0 e8 ae 8d 
d7 e0 <0f> ff e9 00 f5 ff ff 8b 55 20 48 89 c1 49 89 d8 48 c7 c6 48 b8

[25886.024961] ---[ end trace 3570a54b286cb501 ]---
[25886.024966] BTRFS: error (device sda1) in __btrfs_free_extent:7062: 
errno=-28 No space left

[25886.024968] BTRFS info (device sda1): forced readonly
[25886.024969] BTRFS: error (device sda1) in 
btrfs_run_delayed_refs:3089: errno=-28 No space left


Aside from rebuilding the filesystem, what are my options? Should I try 
to temporarily add a file from another volume as a device and retry the 
balance? If so, what would be a good size for the temporary device?


--
Best regards,
 Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?

2017-10-17 Thread Vladimir Panteleev

Hi,

I'm experiencing some issues with a btrfs filesystem - mounting and 
other operations taking forever, a balance that takes hours to start and 
never completes (due to the error in the subject line), and I've also 
seen the NULL pointer dereference in free_reloc_roots which Naohiro Aota 
fixed in 4.13.5 (I was on 4.13.4 at the time).


The output of the relevant commands:

# uname -a
Linux home.thecybershadow.net 4.13.6-1-ARCH #1 SMP PREEMPT Thu Oct 12 
12:42:27 CEST 2017 x86_64 GNU/Linux


# btrfs --version
btrfs-progs v4.13

# df -h /mnt/a
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdi1   5.5T  4.8T  737G  87% /mnt/a

# btrfs filesystem show /mnt/a
Label: none  uuid: f4162b8e-930d-49fa-bbea-56ea3bb544fa
Total devices 4 FS bytes used 4.74TiB
devid2 size 2.73TiB used 2.73TiB path /dev/sdi1
devid3 size 2.73TiB used 2.73TiB path /dev/sdb1
devid4 size 2.73TiB used 2.73TiB path /dev/sda1
devid7 size 2.73TiB used 2.73TiB path /dev/sde1

# btrfs filesystem df /mnt/a
Data, RAID10: total=5.34TiB, used=4.62TiB
System, RAID1: total=8.00MiB, used=608.00KiB
Metadata, RAID1: total=122.03GiB, used=121.37GiB
GlobalReserve, single: total=512.00MiB, used=149.00MiB

# btrfs filesystem usage /mnt/a
Overall:
Device size:  10.92TiB
Device allocated: 10.91TiB
Device unallocated:2.00GiB
Device missing:  0.00B
Used:  9.47TiB
Free (estimated):737.75GiB  (min: 737.75GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 149.00MiB)

Data,RAID10: Size:5.34TiB, Used:4.62TiB
   /dev/sda1   1.33TiB
   /dev/sdb1   1.33TiB
   /dev/sde1   1.33TiB
   /dev/sdi1   1.33TiB

Metadata,RAID1: Size:122.03GiB, Used:121.37GiB
   /dev/sda1  59.51GiB
   /dev/sdb1  61.51GiB
   /dev/sde1  61.52GiB
   /dev/sdi1  61.52GiB

System,RAID1: Size:8.00MiB, Used:608.00KiB
   /dev/sda1   8.00MiB
   /dev/sdb1   8.00MiB

Unallocated:
   /dev/sda1   1.34TiB
   /dev/sdb1   1.33TiB
   /dev/sde1   1.33TiB
   /dev/sdi1   1.33TiB

# btrfs device usage /mnt/a
/dev/sda1, ID: 4
   Device size: 2.73TiB
   Device slack:3.50KiB
   Data,RAID10: 1.33TiB
   Metadata,RAID1: 59.51GiB
   System,RAID1:8.00MiB
   Unallocated: 1.34TiB

/dev/sdb1, ID: 3
   Device size: 2.73TiB
   Device slack:3.50KiB
   Data,RAID10: 1.33TiB
   Metadata,RAID1: 61.51GiB
   System,RAID1:8.00MiB
   Unallocated: 1.33TiB

/dev/sde1, ID: 7
   Device size: 2.73TiB
   Device slack:  0.00B
   Data,RAID10: 1.33TiB
   Metadata,RAID1: 61.52GiB
   Unallocated: 1.33TiB

/dev/sdi1, ID: 2
   Device size: 2.73TiB
   Device slack:3.50KiB
   Data,RAID10: 1.33TiB
   Metadata,RAID1: 61.52GiB
   Unallocated: 1.33TiB

I admit I'm somewhat confused by some of these figures. If I'm reading 
this correctly, the metadata blocks are almost full, however that 
shouldn't be an issue because there seems to be plenty of unallocated 
space - or is there? The "Unallocated" figure in the "Overall:" section 
(2.00GiB) doesn't seem to match that in the "Unallocated:" section 
(1.33TiB), and my gut estimate is that the actual amount of data on the 
disks roughly matches df's output. In any case, I've tried deleting some 
200GiB worth of files (which has not helped the situation in any way 
that I've noticed) and running a balance with -dusage=0 and -dusage=10 
(which never completes due to the errors below).


Here's the timeline (with dmesg excerpts) of trying to mount the 
filesystem. The previous unmount was unclean due to the NULL pointer 
dereference, and it seems to attempt to resume the balance started a few 
mounts ago right after mounting.


(running mount command)

[  128.514335] BTRFS info (device sdb1): disk space caching is enabled
[  128.514340] BTRFS info (device sdb1): has skinny extents
[  128.827147] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 61870, 
rd 18726, flush 1320, corrupt 0, gen 275
[  128.827154] BTRFS info (device sdb1): bdev /dev/sde1 errs: wr 0, rd 
36, flush 0, corrupt 0, gen 0
[  128.827161] BTRFS info (device sdb1): bdev /dev/sda1 errs: wr 0, rd 
457, flush 0, corrupt 0, gen 0
[  184.666715] BTRFS warning (device sdb1): block group 2378367500288 
has wrong amount of free space
[  184.666717] BTRFS warning (device sdb1): failed to load free space 
cache for block group 2378367500288, rebuilding it now
[  184.732017] BTRFS warning (device sdb1): block group 7880254160896 
has wrong amount of free space
[  184.732018] BTRFS warning (device sdb1): failed to load free space 
cache for block group 7880254160896, rebuilding it no