On 02/13/18 13:54, Qu Wenruo wrote:
> On 2018年02月13日 20:26, Holger Hoffstätte wrote:
>> On 02/13/18 12:40, Qu Wenruo wrote:
>>>>> The problem is not about how much space it takes, but how many extents
>>>>> are here in the filesystem.
>>
>> I have no idea why btrfs' mount even needs to touch all block groups to
>> get going (which seems to be the root of the problem), but here's a
>> not so crazy idea for more "mechanical sympathy". Feel free to mock
>> me if this is terribly wrong or not possible. ;)
>>
>> Mounting of even large filesystems (with many extents) seems to be fine
>> on SSDS, but not so fine on rotational storage. We've heard that from
>> several people with large (multi-TB) filesystems, and obviously it's
>> even more terrible on 5400RPM drives because their seeks are sooo sloow.
>>
>> If the problem is that the bgs are touched/iterated in "tree order",
>> would it then not be possible to sort the block groups in physical order
>> before trying to load whatever mount needs to load?
> 
> This is in fact a good idea.
> Make block group into its own tree.

Well, that's not what I was thinking about at all..yet. :)
(keep in mind I'm not really that familiar with the internals).

Out of curiosity I ran a bit of perf on my own mount process, which is
fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and
small files. Unfortunately it's also very fresh since I recreated it just
this weekend, so everything is neatly packed together and fast.

In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty
old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s.

My perf run shows that the only interesting part responsible for mount time
is the nested loop in btrfs_read_block_groups calling find_first_block_group
(which got inlined & is not in the perf callgraph) over and over again,
accounting for 75% of time spent.

I now understand your comment why the real solution to this problem
is to move bgs into their own tree, and agree: both kitchens and databases
have figured out a long time ago that the key to fast scan and lookup
performance is to not put different things in the same storage container;
in the case of analytical DBMS this is columnar storage. :)

But what I originally meant was something much simpler and more
brute-force-ish. I see that btrfs_read_block_groups adds readahead
(is that actually effective?) but what I was looking for was the equivalent
of a DBMS' sequential scan. Right now finding (and loading) a bg seems to
involve a nested loop of tree lookups. It seems easier to rip through the
entire tree in nice 8MB chunks and discard what you don't need instead
of seeking around trying to find all the right bits in scattered order.

Could we alleviate cold mounts by starting more readaheads in
btrfs_read_block_groups, so that the extent tree is scanned more linearly?

cheers,
Holger

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to