Re: "bad tree block start" when trying to mount on ARM

Erik Jensen Sat, 20 Feb 2021 21:42:22 -0800

On Fri, Feb 19, 2021 at 10:01 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
> On 2021/2/20 下午12:28, Erik Jensen wrote:
> > [...]
> > Brainstorming some ideas, is compacting the address space something
> > that could be done offline? E.g., maybe some two-pass process: first
> > something balance-like that bumps all of the metadata up to a compact
> > region of address space, starting at a new 16TiB boundary, and then a
> > follow up pass that just strips the top bits off?
>
> We need btrfs-progs support for off-line balancing.
>
> I used to have this idea, but see very limited usage.
>
> This would be the safest bet, but needs a lot of work, although in user
> space.


Would any of the chunks have to actually be physically moved on disk
like happens in a real balance, or would it just be a matter of
adjusting the bytenrs in the relevant data structures? If the latter,
it seems like it could do something relatively straightforward like
start with the lowest in-use bytenr, adjust it to the first possible
bytenr, adjust the second-lowest to be just after it, et cetera.

While I'm sure this would still be a complex challenge, and would need
to take precautions like marking the filesystem unmountable while it's
working and keeping a journal of its progress in case of interruption,
maybe it'd less onerous than reimplementing all of the rebalance logic
in userspace?

> > Or maybe once all of the bytenrs are brought within 16TiB of each
> > other by balance, btrfs could just keep track of an offset that needs
> > to be applied when mapping page cache indexes?
>
> But further balance/new chunk allocation can still go beyond the limit.
>
> This is biggest problem other fs don't need to bother.
> We can dynamically allocate chunks while others can't.

That's true, but no more so than for the offline address-space
compaction option above, or for doing a backup, format, restore cycle.
Obviously it would be ideal if the issue didn't occur in the first
place, but given that it does, it would be nice if there was *some*
way to get the filesystem back into a usable state for a while at
least, even if it required temporarily hooking the drives up to a
64-bit system to do so.

Now, if I had known about the issue beforehand, I probably would have
unmounted the filesystem and used dd when changing my drive
encryption, rather than calling btrfs replace a bunch of times, in
which case I probably never would have triggered the issue in the
first place. :)

> > Or maybe btrfs could use multiple virtual inodes on 32-bit systems,
> > one for each 16TiB block of address space with metadata in it? If this
> > were to ever grow to need more than a handful of virtual inodes, it
> > seems like a balance *would* actually help in this case by compacting
> > the metadata higher in the address space, allowing the virtual inodes
> > for lower in the address space to be dropped.
>
> This may be a good idea.
>
> But the problem of test coverage is always here.
>
> We can spend tons of lines, but at the end it will not really be well
> tested, as it's really hard

I guess this would involve replacing btrfs_fs_info::btree_inode with
an xarray of inodes on 32-bit systems, and allocating inodes as
needed? It looks like inode structs have a lot going on, and I
definitely don't have the knowledge base to judge if this would be a
tractable change to implement or not. (E.g., would calling
new_inode(fs_info->sb) whenever needed cause any issues, or would it
just work as expected?) It looks like chunk metadata can span more
than one page, so another question is whether those can ever be
allocated such that they cross a 16 TiB boundary? If so, it appears
that would be much harder to try to make work. (Presumably such
boundary-spanning allocations could be prevented going forward, but
there could still be existing filesystems that would have to be
rejected.)

> > Or maybe btrfs could just not use the page cache for the metadata
> > inode once the offset exceeds 16TiB, and only cache at the block
> > layer? This would surely hurt performance, but at least the filesystem
> > could still be accessed.
>
> I don't believe it's really possible, unless we override the XArray
> thing provided by MM completely and implemented a btrfs only structure.
>
> That's too costy.

Makes sense.

> > Given that this issue appears to be not due to the size of the
> > filesystem, but merely how much I've used it, having the only solution
> > be to copy all of the data off, reformat the drives, and then restore
> > every time filesystem usage exceeds a certain thresholds is… not very
> > satisfying.
>
> Yeah, definitely not a good experience.
>
> >
> > Finally, I've never done kernel dev before, but I do have some C
> > experience, so if there is a solution that falls into the category of
> > seeming reasonable, likely to be accepted if implemented, but being
> > unlikely to get implemented given the low priority of supporting
> > 32-bit systems, let me know and maybe I can carve out some time to
> > give it a try.
> >
> BTW, if you want things like 64K page size, while still keep the 4K
> sector size of your existing btrfs, then I guess you may be interested
> in the recent subpage support.
>
> Which allow btrfs to mount 4K sector size fs with 64K page size.
>
> Unfortunately it's still WIP, but may fit your usecase, as ARM support
> multiple page sizes (4K, 16K, 64K).
> (Although we are only going to support 64K page for now)

So, basically I'd need this change plus the Bootlin large page patch,
and then hope I never cross the 256 TiB mark for chunk metadata? (Or
at least, not until I find an AArch64 board that fits my needs.) Would
this conflict with your graceful error/warning patch at all? Is there
an easy way to see what my highest bytenr is today?

Also, I see read-only support went into 5.12. Do you have any idea
when write support will be ready for general use?

Thanks!

Re: "bad tree block start" when trying to mount on ARM

Reply via email to