Re: 5.6-5.10 balance regression?

Qu Wenruo Mon, 28 Dec 2020 20:38:26 -0800



On 2020/12/29 上午8:59, David Arendt wrote:

Hi,
Just for information: On my system the error appeared on a filesystemusing space cache v1. I think my problem might then be unrelated to thisone.


Then this is more interesting.

There are two locations which can return -ENOENT in delete_v1_space_cache():

- No file extent found
  This means something is wrong in backref walk.
  I don't believe it's even possible, or qgroup and balance will be
  completely broken.

- delete_block_group_cache() failed to grab the free space cache inode
  There is another possibility that, we have free space cache inode in
  commit root (where data relocation reads from), but it's not in our
  current root.
  In that case, -ENOENT is safe to ignore.

I guess you may hit the 2nd case, as your next balance finishes withoutproblem.

If it will happen again, I will try to collect more information.Maybe a should try a clear_cache to ensure that the space cache is notwrong.

Clear_cache itself won't remove all the existing cache, it just removethe caches when the block group gets dirty.


Thus we use btrfs-check to remove free space cache completely.

Thanks,
Qu


Bye,
David Arendt

On 12/29/20 1:44 AM, Qu Wenruo wrote:



On 2020/12/29 上午7:39, Qu Wenruo wrote:



On 2020/12/29 上午3:58, Stéphane Lesimple wrote:

I know it fails in relocate_block_group(), which returns -2, I'm
currently
adding a couple printk's here and there to try to pinpoint thatbetter.


Okay, so btrfs_relocate_block_group() starts with stage
MOVE_DATA_EXTENTS, which
completes successfully, as relocate_block_group() returns 0:

BTRFS info (device <unknown>): relocate_block_group:
prepare_to_realocate = 0
BTRFS info (device <unknown>): relocate_block_group loop: progress =
1, btrfs_start_transaction = ok
[...]
BTRFS info (device <unknown>): relocate_block_group loop: progress =
168, btrfs_start_transaction = ok
BTRFS info (device <unknown>): relocate_block_group: returning err = 0
BTRFS info (device dm-10): stage = move data extents,
relocate_block_group = 0
BTRFS info (device dm-10): found 167 extents, stage: move data extents

Then it proceeds to the UPDATE_DATA_PTRS stage and calls
relocate_block_group()
again. This time it'll fail at the 92th iteration of the loop:

BTRFS info (device <unknown>): relocate_block_group loop: progress =
92, btrfs_start_transaction = ok
BTRFS info (device <unknown>): relocate_block_group loop:

extents_found = 92, item_size(53) >= sizeof(*ei)(24), flags = 1, ret= 0

BTRFS info (device <unknown>): add_data_references:
btrfs_find_all_leafs = 0
BTRFS info (device <unknown>): add_data_references loop:
read_tree_block ok
BTRFS info (device <unknown>): add_data_references loop:
delete_v1_space_cache = -2


Damn it, if we find no v1 space cache for the block group, it means
we're fine to continue...

BTRFS info (device <unknown>): relocate_block_group loop:
add_data_references = -2

Then the -ENOENT goes all the way up the call stack and aborts the
balance.

So it fails in delete_v1_space_cache(), though it is worth noting that
the
FS we're talking about is actually using space_cache v2.


Space cache v2, no wonder no v1 space cache.


Does it help? Shall I dig deeper?


You're already at the point!

Mind me to craft a fix with your signed-off-by?


The problem is more complex than I thought, but still we at least have
some workaround.

Firstly, this happens when an old fs get v2 space cache enabled, but
still has v1 space cache left.

Newer v2 mount should cleanup v1 properly, but older kernel doesn't do
the proper cleaning, thus left some v1 cache.

Then we call btrfs balance on such old fs, leading to the -ENOENT error.
We can't ignore the error, as we have no way to relocate such left over
v1 cache (normally we delete it completely, but with v2 cache, we can't).

So what I can do is only to add a warning message to the problem.

To solve your problem, I also submitted a patch to btrfs-progs, to force
v1 space cache cleaning even if the fs has v2 space cache enabled.

Or, you can disable v2 space cache first, using "btrfs check
--clear-space-cache v2" first, then "btrfs check --clear-space_cache
v1", and finally mount the fs with "space_cache=v2" again.

To verify there is no space cache v1 left, you can run the following
command to verify:

# btrfs ins dump-tree -t root <device> | grep EXTENT_DATA

It should output nothing.

Then please try if you can balance all your data.

Thanks,
Qu


Thanks,
Qu


Regards,

Stéphane.

Re: 5.6-5.10 balance regression?

Reply via email to