Status changed to 'Confirmed' because the bug affects multiple users.
** Changed in: zfs-linux (Ubuntu)
Status: New => Confirmed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/2099745
Title:
Poor code generation in shipped zfs.ko resulting in ~5x zstd
decompression (and likely compression too) slowdown. Freshly built
zfs.ko from zfs-dkms is fine
Status in zfs-linux package in Ubuntu:
Confirmed
Bug description:
1. ********* System info *********:
- lscpu | grep -E "Model name|Flags"
Model name: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp
hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
- lsb_release -rd
Description: Ubuntu 24.04.2 LTS
Release: 24.04
apt-cache policy linux-modules-6.8.0-53-generic zfs-dkms
linux-modules-6.8.0-53-generic:
Installed: 6.8.0-53.55
Candidate: 6.8.0-53.55
Version table:
*** 6.8.0-53.55 500
500 http://gb.archive.ubuntu.com/ubuntu noble-updates/main amd64
Packages
100 /var/lib/dpkg/status
zfs-dkms:
Installed: 2.2.2-0ubuntu9.1
Candidate: 2.2.2-0ubuntu9.1
Version table:
*** 2.2.2-0ubuntu9.1 500
500 http://gb.archive.ubuntu.com/ubuntu noble-updates/universe amd64
Packages
100 /var/lib/dpkg/status
2.2.2-0ubuntu9 500
500 http://gb.archive.ubuntu.com/ubuntu noble/universe amd64 Packages
2. ************ Summary *************:
- Poor code generation in the main zstd decompression loop in the shipped
zfs.ko
is resulting in a ~5x decompression slowdown compared to a freshly built
zfs.ko from zfs-dkms.
This issue is also likely affecting zstd compression.
3. *********** Reproduction *********:
- Create a test dataset with encryption disabled (to remove a possible
cause):
sudo zfs create -o encryption=off -o recordsize=128k -o compression=zstd
rpool/test
- Get some compressible file around ~500MB in size (anything will do as
long as it is compressible).
For example: wget
https://download.qt.io/archive/qt/4.8/4.8.2/qt-everywhere-opensource-src-4.8.2.tar.gz
gzip -d qt-everywhere-opensource-src-4.8.2.tar.gz
- Benchmark sequential reads with fio:
mv -v qt-everywhere-opensource-src-4.8.2.tar
qt-everywhere-opensource-src-4.8.2.tar.0.0
zpool sync
# Run the fio command a few times to ensure the data is in the ARC.
fio --name=qt-everywhere-opensource-src-4.8.2.tar --bs=128k --rw=read
--numjobs=1 --iodepth=1 --size=500M --group_reporting --loops=10
4. ********** Results **********:
- The fio command in step 3 benchmarks sequential cached reads from the ARC.
We are therefore measuring single-threaded zstd decompression performance
in zfs.ko.
- With the shipped zfs.ko:
qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB,
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=236MiB/s][r=1886 IOPS][eta 00m:00s]
qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0:
pid=4121: Thu Feb 27 15:50:38 2025
read: IOPS=1625, BW=203MiB/s (213MB/s)(5000MiB/24610msec)
clat (usec): min=18, max=4338, avg=613.58, stdev=469.32
lat (usec): min=18, max=4339, avg=613.62, stdev=469.32
clat percentiles (usec):
| 1.00th=[ 21], 5.00th=[ 23], 10.00th=[ 23], 20.00th=[ 25],
| 30.00th=[ 208], 40.00th=[ 523], 50.00th=[ 668], 60.00th=[ 766],
| 70.00th=[ 889], 80.00th=[ 1037], 90.00th=[ 1172], 95.00th=[ 1385],
| 99.00th=[ 1713], 99.50th=[ 1811], 99.90th=[ 2180], 99.95th=[ 2409],
| 99.99th=[ 2966]
bw ( KiB/s): min=111360, max=355072, per=100.00%, avg=208216.82,
stdev=83337.72, samples=49
iops : min= 870, max= 2774, avg=1626.69, stdev=651.08, samples=49
lat (usec) : 20=0.16%, 50=26.87%, 100=1.18%, 250=2.73%, 500=7.59%
lat (usec) : 750=19.73%, 1000=18.87%
lat (msec) : 2=22.64%, 4=0.25%, 10=0.01%
cpu : usr=0.20%, sys=99.50%, ctx=178, majf=15, minf=42
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s),
io=5000MiB (5243MB), run=24610-24610msec
- With zfs.ko from zfs-dkms:
qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB,
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1152MiB/s][r=9219 IOPS][eta 00m:00s]
qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0:
pid=692842: Fri Feb 21 13:11:39 2025
read: IOPS=8871, BW=1109MiB/s (1163MB/s)(5000MiB/4509msec)
clat (usec): min=9, max=770, avg=112.26, stdev=96.85
lat (usec): min=9, max=770, avg=112.29, stdev=96.85
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 16],
| 30.00th=[ 23], 40.00th=[ 25], 50.00th=[ 122], 60.00th=[ 151],
| 70.00th=[ 174], 80.00th=[ 200], 90.00th=[ 229], 95.00th=[ 249],
| 99.00th=[ 322], 99.50th=[ 553], 99.90th=[ 644], 99.95th=[ 676],
| 99.99th=[ 725]
bw ( MiB/s): min= 936, max= 1466, per=100.00%, avg=1109.14,
stdev=148.75, samples=9
iops : min= 7488, max=11734, avg=8873.11, stdev=1190.03, samples=9
lat (usec) : 10=0.10%, 20=20.52%, 50=22.47%, 100=3.51%, 250=48.45%
lat (usec) : 500=4.35%, 750=0.60%, 1000=0.01%
cpu : usr=0.67%, sys=99.29%, ctx=15, majf=0, minf=42
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=1109MiB/s (1163MB/s), 1109MiB/s-1109MiB/s (1163MB/s-1163MB/s),
io=5000MiB (5243MB), run=4509-4509msec
- There is therefore a ~5x decompression performance difference between the
shipped zfs.ko and zfs.ko from zfs-dkms.
As a baseline, I have built and benchmarked zstd 1.4.5 (the version used
in ZFS) on the same file:
$ ./zstd-git/programs/zstd --version
*** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
$ ./zstd-git/programs/zstd -b3 -B128KB
qt-everywhere-opensource-src-4.8.2.tar.0.0
3#src-4.8.2.tar.0.0 : 647249920 -> 247589078 (2.614), 370.5 MB/s ,1861.8
MB/s
5. ********** Investigation ************:
- I used perf to profile the fio command, both with the shipped zfs.ko and
the freshly built zfs.ko from zfs-dkms.
The two flame graphs are attached.
- With both versions of zfs.ko, the majority of the time is spent inside
ZSTD_decompressSequences_bmi2.constprop.0.
As expected, in both cases the BMI2 version of ZSTD_decompressSequences
is selected.
- However in the shipped zfs.ko the main loop of
ZSTD_decompressSequences_bmi2.constprop.0 contains many calls
to small functions that should have been inlined (MEM_64bits, MEM_32bits,
BIT_reloadDStream, BIT_readBits, BIT_readBitsFast,
ZSTD_copy16). MEM_64bits and MEM_32bits are particularly bad since their
definitions are:
MEM_STATIC unsigned MEM_32bits(void) { return sizeof(size_t)==4; }
MEM_STATIC unsigned MEM_64bits(void) { return sizeof(size_t)==8; }
- In the freshly built zfs.ko from zfs-dkms these small functions have been
inlined.
This has allowed the compiler to actually make use of BMI2 instructions
(shlx and shrx).
- As far as I can tell all shipped zfs.ko are affected by this problem. I
have looked at zfs.ko in:
- 5.15.0-133-generic
- 6.8.0-56-generic
- 6.11.0-18-generic
- 6.12.0-15-generic
and they all appear to have the same issue.
- Zstd compression is also likely affected since the main loop
ZSTD_encodeSequences_bmi2 appears to have the same issue.
I have however not tested it.
- There remains a few calls to ZSTD_copy4, ZSTD_copy8 and ZSTD_copy16 that
should be inlined.
A small patch (attached) adding MEM_STATIC to ZSTD_copy(4,8,16) results
in a 12% decompression performance improvement:
qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB,
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1304MiB/s][r=10.4k IOPS][eta 00m:00s]
qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0:
pid=31764: Thu Feb 27 15:45:04 2025
read: IOPS=10.1k, BW=1257MiB/s (1318MB/s)(5000MiB/3977msec)
clat (usec): min=17, max=686, avg=98.69, stdev=66.19
lat (usec): min=17, max=686, avg=98.72, stdev=66.19
clat percentiles (usec):
| 1.00th=[ 21], 5.00th=[ 22], 10.00th=[ 22], 20.00th=[ 23],
| 30.00th=[ 47], 40.00th=[ 89], 50.00th=[ 105], 60.00th=[ 118],
| 70.00th=[ 133], 80.00th=[ 153], 90.00th=[ 165], 95.00th=[ 190],
| 99.00th=[ 237], 99.50th=[ 482], 99.90th=[ 570], 99.95th=[ 586],
| 99.99th=[ 627]
bw ( MiB/s): min= 1172, max= 1383, per=100.00%, avg=1266.07, stdev=87.32,
samples=7
iops : min= 9380, max=11064, avg=10128.57, stdev=698.54, samples=7
lat (usec) : 20=0.62%, 50=29.98%, 100=16.23%, 250=52.35%, 500=0.40%
lat (usec) : 750=0.44%
cpu : usr=0.73%, sys=99.22%, ctx=31, majf=15, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s),
io=5000MiB (5243MB), run=3977-3977msec
6. ********** Attached files *********:
- Flame graph of the fio command described in step 4 with both the shipped
zfs.ko and the freshly built zfs.ko from zfs-dkms
- zfs-shipped.ko and zfs-dkms.ko
- patch0-MEM_STATIC-on-ZSTD_copy4-8-16.patch
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2099745/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp