On 2024-02-18 22:35:03 [+0200], Lasse Collin wrote: > The balance between the hottest locations in the decompressor code > varies depending on the input file. Linux kernel source compresses very > well (ratio is about 0.10). This reduces the benefit of branchless > code. On my main computer I still get about 2 % time reduction with =3.
Okay, so the input matters, too. I tried 1GiB urandom (so it does not compress so well) but that went quicker than expected… Anyway. I found 3 idle x86 boxes and re-run a test with linux' perf on them and the arm64 box. I all flavours for the two archives. On RiscV I did the 'xz -t' thing because perf seems not to be supported well or I lack access. The task is pinned to a single CPU means the task can't be migrated to another core and xz observes only one "core" (and does not spawn threads). So it is single threaded. Intel(R) Xeon(R) Platinum 8176M CPU: | Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs): | | 13.384,81 msec task-clock # 1,000 CPUs utilized ( +- 0,05% ) | 21 context-switches # 1,569 /sec ( +- 2,61% ) | 0 cpu-migrations # 0,000 /sec | 119 page-faults # 8,891 /sec ( +- 0,34% ) | 28.041.975.275 cycles # 2,095 GHz ( +- 0,05% ) | 32.576.330.155 instructions # 1,16 insn per cycle ( +- 0,00% ) | 4.304.914.251 branches # 321,627 M/sec ( +- 0,00% ) | 567.850.712 branch-misses # 13,19% of all branches ( +- 0,02% ) | | 13,38558 +- 0,00707 seconds time elapsed ( +- 0,05% ) | | Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 runs): | | 12.853,67 msec task-clock # 1,000 CPUs utilized ( +- 0,03% ) | 18 context-switches # 1,400 /sec ( +- 5,72% ) | 0 cpu-migrations # 0,000 /sec | 220 page-faults # 17,116 /sec ( +- 45,95% ) | 26.929.223.135 cycles # 2,095 GHz ( +- 0,03% ) | 42.017.609.529 instructions # 1,56 insn per cycle ( +- 0,00% ) | 3.226.245.101 branches # 250,998 M/sec ( +- 0,00% ) | 299.814.626 branch-misses # 9,29% of all branches ( +- 0,11% ) | | 12,85438 +- 0,00395 seconds time elapsed ( +- 0,03% ) missed branches dropped, gained instructions but isn per cycle improved. Less idle cycles. Worth, ~0.5 sec. | Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs): | | 12.872,36 msec task-clock # 1,000 CPUs utilized ( +- 0,01% ) | 17 context-switches # 1,321 /sec ( +- 6,55% ) | 0 cpu-migrations # 0,000 /sec | 220 page-faults # 17,091 /sec ( +- 45,98% ) | 26.968.386.196 cycles # 2,095 GHz ( +- 0,01% ) | 44.566.213.262 instructions # 1,65 insn per cycle ( +- 0,00% ) | 2.957.642.049 branches # 229,767 M/sec ( +- 0,00% ) | 249.987.257 branch-misses # 8,45% of all branches ( +- 0,05% ) | | 12,87303 +- 0,00115 seconds time elapsed ( +- 0,01% ) Slightly worse vs previous. | Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 runs): | | 9.740,84 msec task-clock # 1,000 CPUs utilized ( +- 0,02% ) | 21 context-switches # 2,156 /sec ( +- 6,14% ) | 0 cpu-migrations # 0,000 /sec | 216 page-faults # 22,175 /sec ( +- 46,95% ) | 20.407.560.821 cycles # 2,095 GHz ( +- 0,02% ) | 34.751.763.859 instructions # 1,70 insn per cycle ( +- 0,00% ) | 3.182.093.181 branches # 326,676 M/sec ( +- 0,00% ) | 271.587.827 branch-misses # 8,53% of all branches ( +- 0,06% ) | | 9,74159 +- 0,00223 seconds time elapsed ( +- 0,02% ) Missed branches increased but instructions dropped, insn per cycles improved a bit. Worth almost 3secs. | Performance counter stats for './xz_0x1f0_clang -t linux-6.7.5.tar.xz' (5 runs): | | 10.400,65 msec task-clock # 1,000 CPUs utilized ( +- 0,03% ) | 21 context-switches # 2,019 /sec ( +- 4,15% ) | 0 cpu-migrations # 0,000 /sec | 218 page-faults # 20,960 /sec ( +- 46,47% ) | 21.789.921.119 cycles # 2,095 GHz ( +- 0,03% ) | 38.046.946.649 instructions # 1,75 insn per cycle ( +- 0,00% ) | 3.691.511.759 branches # 354,931 M/sec ( +- 0,00% ) | 272.904.230 branch-misses # 7,39% of all branches ( +- 0,03% ) | | 10,40140 +- 0,00305 seconds time elapsed ( +- 0,03% ) | clang made more instructions, better insn/cycle ratio but it costs 0.5s vs gcc. Now the other one: | Performance counter stats for './xz_0x000_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | | 6.345,58 msec task-clock # 1,000 CPUs utilized ( +- 0,01% ) | 14 context-switches # 2,206 /sec ( +- 3,50% ) | 0 cpu-migrations # 0,000 /sec | 111 page-faults # 17,492 /sec ( +- 0,53% ) | 13.294.316.865 cycles # 2,095 GHz ( +- 0,01% ) | 14.333.630.221 instructions # 1,08 insn per cycle ( +- 0,00% ) | 1.883.687.210 branches # 296,850 M/sec ( +- 0,00% ) | 312.352.872 branch-misses # 16,58% of all branches ( +- 0,02% ) | | 6,346194 +- 0,000638 seconds time elapsed ( +- 0,01% ) | | Performance counter stats for './xz_0x00f_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | | 5.152,52 msec task-clock # 1,000 CPUs utilized ( +- 0,05% ) | 12 context-switches # 2,329 /sec ( +- 4,25% ) | 0 cpu-migrations # 0,000 /sec | 213 page-faults # 41,339 /sec ( +- 47,86% ) | 10.794.789.805 cycles # 2,095 GHz ( +- 0,05% ) | 21.297.180.861 instructions # 1,97 insn per cycle ( +- 0,00% ) | 1.134.077.104 branches # 220,101 M/sec ( +- 0,01% ) | 65.695.965 branch-misses # 5,79% of all branches ( +- 0,02% ) | | 5,15311 +- 0,00266 seconds time elapsed ( +- 0,05% ) | | Performance counter stats for './xz_0x1f0_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | | 3.732,30 msec task-clock # 1,000 CPUs utilized ( +- 0,13% ) | 15 context-switches # 4,019 /sec ( +- 5,33% ) | 0 cpu-migrations # 0,000 /sec | 106 page-faults # 28,401 /sec ( +- 0,55% ) | 7.819.284.450 cycles # 2,095 GHz ( +- 0,13% ) | 15.658.698.884 instructions # 2,00 insn per cycle ( +- 0,00% ) | 1.157.490.199 branches # 310,128 M/sec ( +- 0,00% ) | 65.438.661 branch-misses # 5,65% of all branches ( +- 0,03% ) | | 3,73292 +- 0,00499 seconds time elapsed ( +- 0,13% ) Still a win. An older Xeon/Sandybridge: | Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs): | 14,83757 +- 0,00216 seconds time elapsed ( +- 0,01% ) | Performance counter stats for './xz_0x001_gcc -t linux-6.7.5.tar.xz' (5 runs): | 15,881129 +- 0,000770 seconds time elapsed ( +- 0,00% ) | Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 runs): | 15,589420 +- 0,000867 seconds time elapsed ( +- 0,01% ) | Performance counter stats for './xz_0x007_gcc -t linux-6.7.5.tar.xz' (5 runs): | 15,59517 +- 0,00257 seconds time elapsed ( +- 0,02% ) | Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs): | 15,99202 +- 0,00258 seconds time elapsed ( +- 0,02% ) | Performance counter stats for './xz_0x010_gcc -t linux-6.7.5.tar.xz' (5 runs): | 13,0439 +- 0,0111 seconds time elapsed ( +- 0,08% ) | Performance counter stats for './xz_0x030_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12,23834 +- 0,00391 seconds time elapsed ( +- 0,03% ) | Performance counter stats for './xz_0x070_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12,1047 +- 0,0205 seconds time elapsed ( +- 0,17% ) | Performance counter stats for './xz_0x0f0_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12,07072 +- 0,00405 seconds time elapsed ( +- 0,03% ) | Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12,1289 +- 0,0103 seconds time elapsed ( +- 0,08% ) and the other: | Performance counter stats for './xz_0x000_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 6,63439 +- 0,00177 seconds time elapsed ( +- 0,03% ) | Performance counter stats for './xz_0x001_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 6,42421 +- 0,00847 seconds time elapsed ( +- 0,13% ) | Performance counter stats for './xz_0x003_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 6,3814 +- 0,0116 seconds time elapsed ( +- 0,18% ) | Performance counter stats for './xz_0x007_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 6,41950 +- 0,00239 seconds time elapsed ( +- 0,04% ) | Performance counter stats for './xz_0x00f_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 6,55812 +- 0,00165 seconds time elapsed ( +- 0,03% ) | Performance counter stats for './xz_0x010_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4,8010 +- 0,0157 seconds time elapsed ( +- 0,33% ) | Performance counter stats for './xz_0x030_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4,73339 +- 0,00700 seconds time elapsed ( +- 0,15% ) | Performance counter stats for './xz_0x070_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4,76041 +- 0,00702 seconds time elapsed ( +- 0,15% ) | Performance counter stats for './xz_0x0f0_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4,62235 +- 0,00723 seconds time elapsed ( +- 0,16% ) | Performance counter stats for './xz_0x1f0_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4,62489 +- 0,00535 seconds time elapsed ( +- 0,12% ) Ryzen: | Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs): | 6,53743 +- 0,00711 seconds time elapsed ( +- 0,11% ) | Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs): | 6,17059 +- 0,00146 seconds time elapsed ( +- 0,02% ) | Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 runs): | 4,541942 +- 0,000630 seconds time elapsed ( +- 0,01% ) | Performance counter stats for './xz_0x000_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 3,18848 +- 0,00251 seconds time elapsed ( +- 0,08% ) | Performance counter stats for './xz_0x00f_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 2,61733 +- 0,00146 seconds time elapsed ( +- 0,06% ) | Performance counter stats for './xz_0x1f0_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 1,82759 +- 0,00217 seconds time elapsed ( +- 0,12% ) Arm64: | Performance counter stats for './xz_0x000_clang -t linux-6.7.5.tar.xz' (5 runs): | 12.19798 +- 0.00455 seconds time elapsed ( +- 0.04% ) | Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12.07622 +- 0.00374 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x001_clang -t linux-6.7.5.tar.xz' (5 runs): | 12.80433 +- 0.00322 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x001_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12.82816 +- 0.00543 seconds time elapsed ( +- 0.04% ) | Performance counter stats for './xz_0x003_clang -t linux-6.7.5.tar.xz' (5 runs): | 12.81225 +- 0.00492 seconds time elapsed ( +- 0.04% ) | Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12.79457 +- 0.00355 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x007_clang -t linux-6.7.5.tar.xz' (5 runs): | 12.93820 +- 0.00639 seconds time elapsed ( +- 0.05% ) | Performance counter stats for './xz_0x007_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12.76739 +- 0.00127 seconds time elapsed ( +- 0.01% ) | Performance counter stats for './xz_0x00f_clang -t linux-6.7.5.tar.xz' (5 runs): | 13.13949 +- 0.00285 seconds time elapsed ( +- 0.02% ) | Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs): | 12.90021 +- 0.00531 seconds time elapsed ( +- 0.04% ) | Performance counter stats for './xz_0x000_clang -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.483373 +- 0.000590 seconds time elapsed ( +- 0.01% ) | Performance counter stats for './xz_0x000_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.906357 +- 0.000577 seconds time elapsed ( +- 0.01% ) | Performance counter stats for './xz_0x001_clang -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.85769 +- 0.00148 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x001_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.926150 +- 0.000405 seconds time elapsed ( +- 0.01% ) | Performance counter stats for './xz_0x003_clang -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.86843 +- 0.00161 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x003_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.945385 +- 0.000988 seconds time elapsed ( +- 0.02% ) | Performance counter stats for './xz_0x007_clang -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.814933 +- 0.000952 seconds time elapsed ( +- 0.02% ) | Performance counter stats for './xz_0x007_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.95251 +- 0.00154 seconds time elapsed ( +- 0.03% ) | Performance counter stats for './xz_0x00f_clang -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.897356 +- 0.000741 seconds time elapsed ( +- 0.02% ) | Performance counter stats for './xz_0x00f_gcc -t warzone2100-data_4.3.3-3_all.xz' (5 runs): | 4.949230 +- 0.000245 seconds time elapsed ( +- 0.00% ) Here it does not matter if I look at clang/gcc or one of the files, the 000 varient is slightly better. For RiscV I have only the "xz -t" numbers and here it says | ----=== ./xz_0x000_gcc ===---- | linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 31 MiB/s, 0:44 | warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 13 MiB/s, 0:14 | ----=== ./xz_0x003_gcc ===---- | linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 29 MiB/s, 0:46 | warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 12 MiB/s, 0:15 | ----=== ./xz_0x00f_gcc ===---- | linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 29 MiB/s, 0:47 | warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 12 MiB/s, 0:15 So appears that here also, the 000 variant performs a bit better. I don't know how accurate numbers are here. I could try to re-run them with perf to get a higher runtime resolution and to see how much the run time varies. Sebastian