On 2018年02月14日 00:24, Holger Hoffstätte wrote:
> On 02/13/18 13:54, Qu Wenruo wrote:
>> On 2018年02月13日 20:26, Holger Hoffstätte wrote:
>>> On 02/13/18 12:40, Qu Wenruo wrote:
>>>>>> The problem is not about how much space it takes, but how many extents
>>>>>> are here in the filesystem.
>>>
>>> I have no idea why btrfs' mount even needs to touch all block groups to
>>> get going (which seems to be the root of the problem), but here's a
>>> not so crazy idea for more "mechanical sympathy". Feel free to mock
>>> me if this is terribly wrong or not possible. ;)
>>>
>>> Mounting of even large filesystems (with many extents) seems to be fine
>>> on SSDS, but not so fine on rotational storage. We've heard that from
>>> several people with large (multi-TB) filesystems, and obviously it's
>>> even more terrible on 5400RPM drives because their seeks are sooo sloow.
>>>
>>> If the problem is that the bgs are touched/iterated in "tree order",
>>> would it then not be possible to sort the block groups in physical order
>>> before trying to load whatever mount needs to load?
>>
>> This is in fact a good idea.
>> Make block group into its own tree.
> 
> Well, that's not what I was thinking about at all..yet. :)
> (keep in mind I'm not really that familiar with the internals).
> 
> Out of curiosity I ran a bit of perf on my own mount process, which is
> fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and
> small files. Unfortunately it's also very fresh since I recreated it just
> this weekend, so everything is neatly packed together and fast.
> 
> In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty
> old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s.
> 
> My perf run shows that the only interesting part responsible for mount time
> is the nested loop in btrfs_read_block_groups calling find_first_block_group
> (which got inlined & is not in the perf callgraph) over and over again,
> accounting for 75% of time spent.
> 
> I now understand your comment why the real solution to this problem
> is to move bgs into their own tree, and agree: both kitchens and databases
> have figured out a long time ago that the key to fast scan and lookup
> performance is to not put different things in the same storage container;
> in the case of analytical DBMS this is columnar storage. :)
> 
> But what I originally meant was something much simpler and more
> brute-force-ish. I see that btrfs_read_block_groups adds readahead
> (is that actually effective?) but what I was looking for was the equivalent
> of a DBMS' sequential scan. Right now finding (and loading) a bg seems to
> involve a nested loop of tree lookups. It seems easier to rip through the
> entire tree in nice 8MB chunks and discard what you don't need instead
> of seeking around trying to find all the right bits in scattered order.

The problem is, the tree (extent tree) containing block groups is very,
very very large.

It's a tree shared by all subvolumes. And since tree nodes and leaves
can be scattered around the whole disk, it's pretty hard to do batch
readahead.

> 
> Could we alleviate cold mounts by starting more readaheads in
> btrfs_read_block_groups, so that the extent tree is scanned more linearly?

Since extent tree is not linear, it won't be as effective as we believe.

Thanks,
Qu

> 
> cheers,
> Holger
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to