On 02/13/18 13:54, Qu Wenruo wrote: > On 2018年02月13日 20:26, Holger Hoffstätte wrote: >> On 02/13/18 12:40, Qu Wenruo wrote: >>>>> The problem is not about how much space it takes, but how many extents >>>>> are here in the filesystem. >> >> I have no idea why btrfs' mount even needs to touch all block groups to >> get going (which seems to be the root of the problem), but here's a >> not so crazy idea for more "mechanical sympathy". Feel free to mock >> me if this is terribly wrong or not possible. ;) >> >> Mounting of even large filesystems (with many extents) seems to be fine >> on SSDS, but not so fine on rotational storage. We've heard that from >> several people with large (multi-TB) filesystems, and obviously it's >> even more terrible on 5400RPM drives because their seeks are sooo sloow. >> >> If the problem is that the bgs are touched/iterated in "tree order", >> would it then not be possible to sort the block groups in physical order >> before trying to load whatever mount needs to load? > > This is in fact a good idea. > Make block group into its own tree.
Well, that's not what I was thinking about at all..yet. :) (keep in mind I'm not really that familiar with the internals). Out of curiosity I ran a bit of perf on my own mount process, which is fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and small files. Unfortunately it's also very fresh since I recreated it just this weekend, so everything is neatly packed together and fast. In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. My perf run shows that the only interesting part responsible for mount time is the nested loop in btrfs_read_block_groups calling find_first_block_group (which got inlined & is not in the perf callgraph) over and over again, accounting for 75% of time spent. I now understand your comment why the real solution to this problem is to move bgs into their own tree, and agree: both kitchens and databases have figured out a long time ago that the key to fast scan and lookup performance is to not put different things in the same storage container; in the case of analytical DBMS this is columnar storage. :) But what I originally meant was something much simpler and more brute-force-ish. I see that btrfs_read_block_groups adds readahead (is that actually effective?) but what I was looking for was the equivalent of a DBMS' sequential scan. Right now finding (and loading) a bg seems to involve a nested loop of tree lookups. It seems easier to rip through the entire tree in nice 8MB chunks and discard what you don't need instead of seeking around trying to find all the right bits in scattered order. Could we alleviate cold mounts by starting more readaheads in btrfs_read_block_groups, so that the extent tree is scanned more linearly? cheers, Holger
signature.asc
Description: OpenPGP digital signature