Package: linux-image-amd64
Version: 4.9+80

Debian's use of the SLAB allocator combined with ongoing kernel changes mean 
the ext4 inode cache wastes ~21% of space allocated to it on recent amd64 
kernels, a regression from the ~2% waste in jessie.

SLAB enforces a first-order allocation (i.e. 4KB on x86[-64]) for slabs 
containing VFS-reclaimable objects such as ext4_inode_info: 
http://elixir.free-electrons.com/linux/v4.9.25/source/mm/slab.c#L1827

In jessie's Linux 3.16 kernel, an ext4_inode_cache entry is ~1000 bytes, so 
four fit nicely in a slab. Additions to this structure and its members have 
increased it to ~1072 bytes in 4.9.25 (on a machine with 32 logical cores):

  # grep ext4_inode_cache /proc/slabinfo
name      <active_objs> <num_objs> <objsize> <objperslab>
ext4_inode_cache  956         987         1072       3  …

…leaving 880 bytes wasted per slab in Debian stretch (and jessie-backports).

Having 3 objects vs. 4 per slab may reduce internal fragmentation, but 
inodes can't linger for as long, and creating them evicts data, leading to 
increased disk activity. Slab cache allocation takes time; and if the slabs 
were denser, more inodes (or other content) could fit in CPU cache.

By comparison, mainline's default SLUB allocator (used by Ubuntu) seems to 
use a 4 page/16KB or 8 page/32 KB slab size, which fits 15/30 
ext4_inode_cache objects. This has also decreased since 3.16, but it is not 
as wasteful.

Inode cache size is initially small, but may grow to ~50% of RAM under heavy 
workloads, e.g. fileserver rsync.

== Possible workarounds/resolutions ==

A custom-compiled kernel with the right options reduces ext4_inode_cache 
object size below 1000 bytes - for me, it cut ~160MB from slab_cache on an 
active 32GB web app/file server with nightly rsync. (It may reduce CPU and 
disk utilization, but the load in question is not constant enough to 
benchmark.)

Some flags have a big impact on ext4_inode_info
(and subsidiary structs such as rw_semaphore):
http://elixir.free-electrons.com/linux/v4.9.25/source/fs/ext4/ext4.h#L937

The precise sizes change with kernel version and CPU configuration. For 
jessie-backports' Linux 4.7.8, disabling both
* EXT4 encryption (CONFIG_EXT4_FS_ENCRYPTION) _and_ either:
  a) VFS quota (CONFIG_QUOTA; OCSFS2 must be disabled first), or
  b) Optimistic rw_semaphore spinning (CONFIG_RWSEM_SPIN_ON_OWNER)
reduced ext4_inode_cache objects to 1008-1016 bytes; sufficient to fit four 
inodes in a slab. It worked on 4.8.7 as well, reducing size to exactly 1024.

But custom compilation is time-consuming and workload-dependent. Tossing 
ext4 encryption and quota is fine for our purposes, but Debian may not want 
to.

Disabling optimistic semaphore owner spinning - perhaps under a certain 
number of cores? - may be part of a general solution; there's no menu option 
for CONFIG_RWSEM_SPIN_ON_OWNER, so it has to be set in the build config, or 
possibly on the command line.

https://lkml.org/lkml/2014/8/3/120 suggests optimistic improves some 
contention-heavy workloads - or at least benchmarks thereof - but it may not 
be worth the trade-off by default. Incidentally, I found zero documentation 
that this may negatively impact memory usage.

Getting into more significant code changes: Ted Ts'o shrunk ext4_inode_info 
by 8% six years ago:
http://linux-ext4.vger.kernel.narkive.com/D3sK9Flg/patch-0-6-shrinking-the-size-of-ext4-inode-info

…but it has since grown ~22%, due to features such as ext4 encryption, 
project-based quota, and the aforementioned optimistic spinning on the three 
read-write semaphores in the struct:
https://github.com/torvalds/linux/commit/4fc828e24cd9c385d3a44e1b499ec7fc70239d8a
https://github.com/torvalds/linux/commit/ce069fc920e5734558b3d9cbef1ab06cf01ee793
https://lwn.net/Articles/697603/

Ted mentioned that "it would be possible to further slim down the 
ext4_inode_cache by another 100 bytes or so, by breaking the ext4_inode_info 
into the portion of the inode required [when] a file is opened for writing, 
and everything else."

This might be worth it, given that we're on the borderline, and particularly 
if rw_semaphore is included; there are attempts to make those even bigger:
http://lists-archives.com/linux-kernel/28643980-locking-rwsem-enable-count-based-spinning-on-reader.html

Adding an define to configure out project quota (kprojid_t i_projid) may cut 
a few bytes - or maybe more given alignment? I don't know if this would have 
a negative impact on filesystems which used them, other than the feature not 
working. At least it would give another knob to tweak.

Adjusting struct alignment may also be beneficial, either in all cases or 
based on the presence/absence of flags, as in 
https://patchwork.ozlabs.org/patch/62051/

ext4_inode_info appears to contain a copy of the 256-byte on-disk format. 
Maybe it's feasible to use some of this in-place rather than duplicating it 
and writing it back later? Or it could be separated into its own object; 
it's a nice round size.  (In-place use may violate style guidelines, if 
nothing else…)

Lastly, 32-bit and uniprocessor kernels have far smaller ext4_inode_cache - 
I got one down to 560 bytes (7 obj/slab) - and may remain beneficial where 
RAM is strictly limited (VMs in particular).

== SLAB vs. SLUB ==

Debain's use of SLAB allocation (vs. SLUB) might also be reconsidered. But 
I'm not sure this is as useful as just reducing the inode size.

Both allocators appear to have improved over time (e.g. SLAB got 1-byte 
freelist entries). If anything, SLAB has had more work recently.

The view in 2012 appeared to be that SLUB was less-suitable for 
multiprocessor systems than SLAB:
https://lists.debian.org/debian-kernel/2012/03/msg00944.html

And while Linus seems to want to get rid of SLAB:
http://marc.info/?l=linux-mm&m=147423350524545&w=2

... it seems SuSE also still uses it:
http://marc.info/?l=linux-mm&m=147426644529856&w=2

In fact this problem might have been avoided with SLAB because it would have 
soaked up 4K blocks.
http://marc.info/?l=linux-mm&m=147422898523307&w=2

Reducing structure size would benefit every allocator, so that should 
probably be the focus.
--
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 

Reply via email to