On Sat, Sep 14, 2019 at 12:29:09PM -0600, Chris Murphy wrote: > On Fri, Sep 13, 2019 at 5:04 AM Austin S. Hemmelgarn > <ahferro...@gmail.com> wrote: > > > > Do you have a source for this claim of a 128MB max extent size? Because > > everything I've seen indicates the max extent size is a full data chunk > > (so 1GB for the common case, potentially up to about 5GB for really big > > filesystems) > > Yeah a block group can be a kind of "super extent". I think the > EXTENT_DATA maxes out at 128M but they are often contiguous, for > example > > item 308 key (5741459 EXTENT_DATA 0) itemoff 39032 itemsize 53 > generation 241638 type 1 (regular) > extent data disk byte 193851400192 nr 134217728 > extent data offset 0 nr 134217728 ram 134217728 > extent compression 0 (none) > item 309 key (5741459 EXTENT_DATA 134217728) itemoff 38979 itemsize 53 > generation 241638 type 1 (regular) > extent data disk byte 193985617920 nr 134217728 > extent data offset 0 nr 134217728 ram 134217728 > extent compression 0 (none) > item 310 key (5741459 EXTENT_DATA 268435456) itemoff 38926 itemsize 53 > generation 241638 type 1 (regular) > extent data disk byte 194119835648 nr 134217728 > extent data offset 0 nr 134217728 ram 134217728 > extent compression 0 (none) > > Where FIEMAP has a different view (via filefrag -v) > > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 131071: 47327002.. 47458073: 131072: > 1: 131072.. 294911: 47518701.. 47682540: 163840: 47458074: > 2: 294912.. 360447: 50279681.. 50345216: 65536: 47682541: > 3: 360448.. 499871: 50377984.. 50517407: 139424: 50345217: last,eof > Fedora-Workstation-Live-x86_64-31_Beta-1.1.iso: 4 extents found > > Those extents are all bigger than 128M. But they're each made up of > contiguous EXTENT_DATA items. > > Also, the EXTENT_DATA size goes to a 128K max for any compressed > files, so you get an explosive number of EXTENT_DATA items on > compressed file systems, and thus metadata to rewrite.
The compressed extents tend to be physically contiguous as well, so quantitatively they aren't much of a problem. There's more space used in metadata, but that is compensated by less space in data. In the subvol trees, logically contiguous extents are always logically contiguous by key order, so they are packed densely in subvol metadata pages. In extent, csum, and free space trees, when there is physical contiguity--or just proximity--the extent's items are packed into the same metadata pages, keeping their costs down. That's true of all small extents, not just compressed ones. Note that contiguity isn't necessary for metadata space efficiency--the extents just have to be close together, they don't need to be seamless, or in order (at least not for this reason). Writes that are separated in _time_ are a different problem, and potentially much worse than the compression case. If you have a file that consists of lots of extents that were written with significant allocations to other files between them, that file becomes a metadata monster that can create massive commit latencies when it is deleted or modified. If you unpack tarballs or build sources or rsync backup trees or really any two or more writing tasks at the same time on a big btrfs filesystem, you can run into cases where the metadata:data ratio goes above 1.0 during updates _and_ the metadata is randomly distributed physically. Commits after a big delete run for hours. > I wonder if instead of a rewrite of defragmenting, if there could be > improvements to the allocator to write bigger extents. I guess the > problem really comes from file appends? Smarter often means slower but > perhaps it could be a variation on autodefrag? Physically dispersed files can be fixed by defrag, but directory trees are a little different. The current defrag doesn't look at the physical distances between files, only the extents within a single file, so it doesn't help when you have a big fragmented directory tree of many small not-fragmented files. IOW defrag helps with 'rm -f' performance but not 'rm -rf' performance. Other filesystems have allocator heuristics that reserve space near growing files, or try to pre-divide the free space to spread out files belonging to different directories or created by two processes. This is an attempt to fix the problem before it occurs, and sometimes it works; however, the heuristics have to match the reality or it just makes things worse, and extra complexity breeds bugs, e.g. the fix recently for a bug which tried to give every thread its own block group for allocation--i.e. 20 threads writing 4K each could ENOSPC if there was less than 20GB of unallocated space. I think the best approach may be to attack the problem quantitatively with an autodefrag agent: keep the write path fast and simple, but detect areas where problems are occurring--i.e. where the ratio of extent metadata locality to physical locality is low--and clean them up with some minimal data relocation. Note that's somewhat different from what the current kernel autodefrag does. In absolute terms autodefrag is worse--ideally we'd just put the data in the right place from the start, not write it in the wrong place then spend more iops fixing it later--but not all iops have equal cost. In some cases there is an opportunity to trade cheap iops at one time for expensive iops at a different time, and a userspace agent can invest more time, memory, and code complexity on that trade than the kernel. Some back-of-the-envelope math says we don't need to do very much post-processing work to deal with the very worst cases: keep extent sizes over a few hundred KB, and keep small files not more than about 5-10 metadata items away from their logical neighbors, and we avoid the worst-case 12.0 metadata-to-data ratios during updates. Compared to those, other inefficiencies are trivial. > > -- > Chris Murphy >