On 2024-02-18 22:35:03 [+0200], Lasse Collin wrote:
> The balance between the hottest locations in the decompressor code
> varies depending on the input file. Linux kernel source compresses very
> well (ratio is about 0.10). This reduces the benefit of branchless
> code. On my main computer I still get about 2 % time reduction with =3.

Okay, so the input matters, too. I tried 1GiB urandom (so it does not
compress so well) but that went quicker than expected… Anyway.
I found 3 idle x86 boxes and re-run a test with linux' perf on them and
the arm64 box. I all flavours for the two archives. On RiscV I did the
'xz -t' thing because perf seems not to be supported well or I lack
access.

The task is pinned to a single CPU means the task can't be migrated to
another core and xz observes only one "core" (and does not spawn
threads). So it is single threaded.

Intel(R) Xeon(R) Platinum 8176M CPU:

|  Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 
runs):    
|                  
|          13.384,81 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,05% )
|                 21      context-switches                 #    1,569 /sec      
                  ( +-  2,61% )
|                  0      cpu-migrations                   #    0,000 /sec      
    
|                119      page-faults                      #    8,891 /sec      
                  ( +-  0,34% )
|     28.041.975.275      cycles                           #    2,095 GHz       
                  ( +-  0,05% )
|     32.576.330.155      instructions                     #    1,16  insn per 
cycle              ( +-  0,00% )
|      4.304.914.251      branches                         #  321,627 M/sec     
                  ( +-  0,00% )
|        567.850.712      branch-misses                    #   13,19% of all 
branches             ( +-  0,02% )
| 
|           13,38558 +- 0,00707 seconds time elapsed  ( +-  0,05% )
|
|  Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|          12.853,67 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,03% )
|                 18      context-switches                 #    1,400 /sec      
                  ( +-  5,72% )
|                  0      cpu-migrations                   #    0,000 /sec
|                220      page-faults                      #   17,116 /sec      
                  ( +- 45,95% )
|     26.929.223.135      cycles                           #    2,095 GHz       
                  ( +-  0,03% )
|     42.017.609.529      instructions                     #    1,56  insn per 
cycle              ( +-  0,00% )
|      3.226.245.101      branches                         #  250,998 M/sec     
                  ( +-  0,00% )
|        299.814.626      branch-misses                    #    9,29% of all 
branches             ( +-  0,11% )
| 
|           12,85438 +- 0,00395 seconds time elapsed  ( +-  0,03% )

missed branches dropped, gained instructions but isn per cycle improved.
Less idle cycles. Worth, ~0.5 sec.

|  Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|          12.872,36 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,01% )
|                 17      context-switches                 #    1,321 /sec      
                  ( +-  6,55% )
|                  0      cpu-migrations                   #    0,000 /sec
|                220      page-faults                      #   17,091 /sec      
                  ( +- 45,98% )
|     26.968.386.196      cycles                           #    2,095 GHz       
                  ( +-  0,01% )
|     44.566.213.262      instructions                     #    1,65  insn per 
cycle              ( +-  0,00% )
|      2.957.642.049      branches                         #  229,767 M/sec     
                  ( +-  0,00% )
|        249.987.257      branch-misses                    #    8,45% of all 
branches             ( +-  0,05% )
| 
|           12,87303 +- 0,00115 seconds time elapsed  ( +-  0,01% )

Slightly worse vs previous.

|  Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|           9.740,84 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,02% )
|                 21      context-switches                 #    2,156 /sec      
                  ( +-  6,14% )
|                  0      cpu-migrations                   #    0,000 /sec
|                216      page-faults                      #   22,175 /sec      
                  ( +- 46,95% )
|     20.407.560.821      cycles                           #    2,095 GHz       
                  ( +-  0,02% )
|     34.751.763.859      instructions                     #    1,70  insn per 
cycle              ( +-  0,00% )
|      3.182.093.181      branches                         #  326,676 M/sec     
                  ( +-  0,00% )
|        271.587.827      branch-misses                    #    8,53% of all 
branches             ( +-  0,06% )
| 
|            9,74159 +- 0,00223 seconds time elapsed  ( +-  0,02% )

Missed branches increased but instructions dropped, insn per cycles
improved a bit. Worth almost 3secs.

|  Performance counter stats for './xz_0x1f0_clang -t linux-6.7.5.tar.xz' (5 
runs):
| 
|          10.400,65 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,03% )
|                 21      context-switches                 #    2,019 /sec      
                  ( +-  4,15% )
|                  0      cpu-migrations                   #    0,000 /sec
|                218      page-faults                      #   20,960 /sec      
                  ( +- 46,47% )
|     21.789.921.119      cycles                           #    2,095 GHz       
                  ( +-  0,03% )
|     38.046.946.649      instructions                     #    1,75  insn per 
cycle              ( +-  0,00% )
|      3.691.511.759      branches                         #  354,931 M/sec     
                  ( +-  0,00% )
|        272.904.230      branch-misses                    #    7,39% of all 
branches             ( +-  0,03% )
| 
|           10,40140 +- 0,00305 seconds time elapsed  ( +-  0,03% )
|

clang made more instructions, better insn/cycle ratio but it costs 0.5s vs gcc.

Now the other one:

|  Performance counter stats for './xz_0x000_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
| 
|           6.345,58 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,01% )
|                 14      context-switches                 #    2,206 /sec      
                  ( +-  3,50% )
|                  0      cpu-migrations                   #    0,000 /sec
|                111      page-faults                      #   17,492 /sec      
                  ( +-  0,53% )
|     13.294.316.865      cycles                           #    2,095 GHz       
                  ( +-  0,01% )
|     14.333.630.221      instructions                     #    1,08  insn per 
cycle              ( +-  0,00% )
|      1.883.687.210      branches                         #  296,850 M/sec     
                  ( +-  0,00% )
|        312.352.872      branch-misses                    #   16,58% of all 
branches             ( +-  0,02% )
| 
|           6,346194 +- 0,000638 seconds time elapsed  ( +-  0,01% )
| 
|  Performance counter stats for './xz_0x00f_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
| 
|           5.152,52 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,05% )
|                 12      context-switches                 #    2,329 /sec      
                  ( +-  4,25% )
|                  0      cpu-migrations                   #    0,000 /sec
|                213      page-faults                      #   41,339 /sec      
                  ( +- 47,86% )
|     10.794.789.805      cycles                           #    2,095 GHz       
                  ( +-  0,05% )
|     21.297.180.861      instructions                     #    1,97  insn per 
cycle              ( +-  0,00% )
|      1.134.077.104      branches                         #  220,101 M/sec     
                  ( +-  0,01% )
|         65.695.965      branch-misses                    #    5,79% of all 
branches             ( +-  0,02% )
| 
|            5,15311 +- 0,00266 seconds time elapsed  ( +-  0,05% )
| 
|  Performance counter stats for './xz_0x1f0_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
| 
|           3.732,30 msec task-clock                       #    1,000 CPUs 
utilized               ( +-  0,13% )
|                 15      context-switches                 #    4,019 /sec      
                  ( +-  5,33% )
|                  0      cpu-migrations                   #    0,000 /sec
|                106      page-faults                      #   28,401 /sec      
                  ( +-  0,55% )
|      7.819.284.450      cycles                           #    2,095 GHz       
                  ( +-  0,13% )
|     15.658.698.884      instructions                     #    2,00  insn per 
cycle              ( +-  0,00% )
|      1.157.490.199      branches                         #  310,128 M/sec     
                  ( +-  0,00% )
|         65.438.661      branch-misses                    #    5,65% of all 
branches             ( +-  0,03% )
| 
|            3,73292 +- 0,00499 seconds time elapsed  ( +-  0,13% )

Still a win.
An older Xeon/Sandybridge:
| Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          14,83757 +- 0,00216 seconds time elapsed  ( +-  0,01% )
| Performance counter stats for './xz_0x001_gcc -t linux-6.7.5.tar.xz' (5 runs):
|         15,881129 +- 0,000770 seconds time elapsed  ( +-  0,00% )
| Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 runs):
|         15,589420 +- 0,000867 seconds time elapsed  ( +-  0,01% )
| Performance counter stats for './xz_0x007_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          15,59517 +- 0,00257 seconds time elapsed  ( +-  0,02% )
| Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          15,99202 +- 0,00258 seconds time elapsed  ( +-  0,02% )
| Performance counter stats for './xz_0x010_gcc -t linux-6.7.5.tar.xz' (5 runs):
|           13,0439 +- 0,0111 seconds time elapsed  ( +-  0,08% )
| Performance counter stats for './xz_0x030_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12,23834 +- 0,00391 seconds time elapsed  ( +-  0,03% )
| Performance counter stats for './xz_0x070_gcc -t linux-6.7.5.tar.xz' (5 runs):
|           12,1047 +- 0,0205 seconds time elapsed  ( +-  0,17% )
| Performance counter stats for './xz_0x0f0_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12,07072 +- 0,00405 seconds time elapsed  ( +-  0,03% )
| Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 runs):
|           12,1289 +- 0,0103 seconds time elapsed  ( +-  0,08% )

and the other:
| Performance counter stats for './xz_0x000_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           6,63439 +- 0,00177 seconds time elapsed  ( +-  0,03% )
| Performance counter stats for './xz_0x001_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           6,42421 +- 0,00847 seconds time elapsed  ( +-  0,13% )
| Performance counter stats for './xz_0x003_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|            6,3814 +- 0,0116 seconds time elapsed  ( +-  0,18% )
| Performance counter stats for './xz_0x007_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           6,41950 +- 0,00239 seconds time elapsed  ( +-  0,04% )
| Performance counter stats for './xz_0x00f_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           6,55812 +- 0,00165 seconds time elapsed  ( +-  0,03% )
| Performance counter stats for './xz_0x010_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|            4,8010 +- 0,0157 seconds time elapsed  ( +-  0,33% )
| Performance counter stats for './xz_0x030_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4,73339 +- 0,00700 seconds time elapsed  ( +-  0,15% )
| Performance counter stats for './xz_0x070_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4,76041 +- 0,00702 seconds time elapsed  ( +-  0,15% )
| Performance counter stats for './xz_0x0f0_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4,62235 +- 0,00723 seconds time elapsed  ( +-  0,16% )
| Performance counter stats for './xz_0x1f0_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4,62489 +- 0,00535 seconds time elapsed  ( +-  0,12% )

Ryzen:
| Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs):
|           6,53743 +- 0,00711 seconds time elapsed  ( +-  0,11% )
| Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs):
|           6,17059 +- 0,00146 seconds time elapsed  ( +-  0,02% )
| Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          4,541942 +- 0,000630 seconds time elapsed  ( +-  0,01% )
| Performance counter stats for './xz_0x000_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           3,18848 +- 0,00251 seconds time elapsed  ( +-  0,08% )
| Performance counter stats for './xz_0x00f_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           2,61733 +- 0,00146 seconds time elapsed  ( +-  0,06% )
| Performance counter stats for './xz_0x1f0_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           1,82759 +- 0,00217 seconds time elapsed  ( +-  0,12% )

Arm64:

| Performance counter stats for './xz_0x000_clang -t linux-6.7.5.tar.xz' (5 
runs):
|          12.19798 +- 0.00455 seconds time elapsed  ( +-  0.04% )
| Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12.07622 +- 0.00374 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x001_clang -t linux-6.7.5.tar.xz' (5 
runs):
|          12.80433 +- 0.00322 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x001_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12.82816 +- 0.00543 seconds time elapsed  ( +-  0.04% )
| Performance counter stats for './xz_0x003_clang -t linux-6.7.5.tar.xz' (5 
runs):
|          12.81225 +- 0.00492 seconds time elapsed  ( +-  0.04% )
| Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12.79457 +- 0.00355 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x007_clang -t linux-6.7.5.tar.xz' (5 
runs):
|          12.93820 +- 0.00639 seconds time elapsed  ( +-  0.05% )
| Performance counter stats for './xz_0x007_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12.76739 +- 0.00127 seconds time elapsed  ( +-  0.01% )
| Performance counter stats for './xz_0x00f_clang -t linux-6.7.5.tar.xz' (5 
runs):
|          13.13949 +- 0.00285 seconds time elapsed  ( +-  0.02% )
| Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 runs):
|          12.90021 +- 0.00531 seconds time elapsed  ( +-  0.04% )

| Performance counter stats for './xz_0x000_clang -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.483373 +- 0.000590 seconds time elapsed  ( +-  0.01% )
| Performance counter stats for './xz_0x000_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.906357 +- 0.000577 seconds time elapsed  ( +-  0.01% )
| Performance counter stats for './xz_0x001_clang -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4.85769 +- 0.00148 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x001_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.926150 +- 0.000405 seconds time elapsed  ( +-  0.01% )
| Performance counter stats for './xz_0x003_clang -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4.86843 +- 0.00161 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x003_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.945385 +- 0.000988 seconds time elapsed  ( +-  0.02% )
| Performance counter stats for './xz_0x007_clang -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.814933 +- 0.000952 seconds time elapsed  ( +-  0.02% )
| Performance counter stats for './xz_0x007_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|           4.95251 +- 0.00154 seconds time elapsed  ( +-  0.03% )
| Performance counter stats for './xz_0x00f_clang -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.897356 +- 0.000741 seconds time elapsed  ( +-  0.02% )
| Performance counter stats for './xz_0x00f_gcc -t 
warzone2100-data_4.3.3-3_all.xz' (5 runs):
|          4.949230 +- 0.000245 seconds time elapsed  ( +-  0.00% )

Here it does not matter if I look at clang/gcc or one of the files, the
000 varient is slightly better.

For RiscV I have only the "xz -t" numbers and here it says
| ----=== ./xz_0x000_gcc ===----
| linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 31 MiB/s, 0:44
| warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 13 MiB/s, 0:14

| ----=== ./xz_0x003_gcc ===----
| linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 29 MiB/s, 0:46
| warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 12 MiB/s, 0:15

| ----=== ./xz_0x00f_gcc ===----
| linux-6.7.5.tar.xz: 134,9 MiB / 1.386,4 MiB = 0,097, 29 MiB/s, 0:47
| warzone2100-data_4.3.3-3_all.xz: 136,0 MiB / 180,3 MiB = 0,754, 12 MiB/s, 0:15

So appears that here also, the 000 variant performs a bit better. I
don't know how accurate numbers are here. I could try to re-run them
with perf to get a higher runtime resolution and to see how much the
run time varies.

Sebastian

Reply via email to