Hi,

I've been doing some performance benchmarking of zstd on ARM platforms and
found an interesting performance trajectory across GCC versions. GCC-14
brought a significant improvement in decompression performance on ARM
(about +20% compared to GCC-13), but GCC-15 not only loses that
improvement, it actually performs worse than GCC-13 - up to 27% slower than
GCC-14 on Apple Silicon.

The interesting part is that disabling if-conversion and
late-combine-instructions with -fno-if-conversion
-fno-late-combine-instructions almost completely recovers the GCC-14
performance level. On my M1 Max, GCC-15 drops from the GCC-14 baseline of
1659 MB/s down to 1210 MB/s for decompression, but with those flags
disabled it recovers to 1638 MB/s (essentially back to GCC-14 levels).

Here's what I've tested so far:

Test setup:
- Platforms: Apple Silicon M1 Max (primary), AWS Graviton 5 (m9g.4xlarge)
- Compilers: GCC 13.4.0, GCC 14.3.0, GCC 15.2.0, and Clang 17.0.0 for
comparison
- Build: -O3 -fPIC -march=native on Apple, -march=armv9-a on Graviton
- Workload: zstd 1.5.7 level 3 compression on a 10MB test file, 10
iterations

Apple Silicon M1 Max (-march=native) results:

+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
| Compiler   | Flags                               | Compression |
Decompression | vs GCC-13            | vs GCC-14            |
+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
| GCC-13     | base                                | 498.7 MB/s  | 1385.2
MB/s   | (baseline)           | +0.2% / -16.5%       |
| GCC-13     | -fno-if-conversion                  | 494.9 MB/s  | 1365.3
MB/s   | -0.8% / -1.4%        | -0.6% / -17.7%       |
| GCC-14     | base                                | 497.7 MB/s  | 1659.1
MB/s   | -0.2% / +19.8% *     | (baseline)           |
| GCC-14     | -fno-if-conversion                  | 500.4 MB/s  | 1649.5
MB/s   | +0.3% / +19.1%       | +0.5% / -0.6%        |
| GCC-15     | base                                | 486.2 MB/s  | 1209.8
MB/s   | -2.5% / -12.7%       | -2.3% / -27.1% **    |
| GCC-15     | -fno-if-conversion                  | 493.9 MB/s  | 1548.2
MB/s   | -1.0% / +11.8%       | -0.8% / -6.7%        |
| GCC-15     | -fno-late-combine-instructions      | 500.7 MB/s  | 1314.4
MB/s   | +0.4% / -5.1%        | +0.6% / -20.8%       |
| GCC-15     | both flags                          | 495.8 MB/s  | 1638.4
MB/s   | -0.6% / +18.3%       | -0.4% / -1.2% ***    |
| Clang 17   | base                                | 491.2 MB/s  | 1340.4
MB/s   | -1.5% / -3.2%        | -1.3% / -19.2%       |
+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
  * GCC-14 boost    ** GCC-15 regression    *** Workaround recovery

AWS Graviton 5 (Neoverse V3, -march=armv9-a) results:

+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
| Compiler   | Flags
            | Compression | Decompression | vs GCC-13            | vs
GCC-14            |
+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
| GCC-13     | base
           | 755.4 MB/s  | 1484.2 MB/s   | (baseline)           | +6.8% /
-28.2%       |
| GCC-13     | -fno-if-conversion -fno-if-conversion2
           | 839.4 MB/s  | 1498.5 MB/s   | +11.1% / +1.0%       | +18.7% /
-27.5%      |
| GCC-14     | base
           | 707.3 MB/s  | 2066.9 MB/s   | -6.4% / +39.2% *     |
(baseline)           |
| GCC-14     | -fno-if-conversion -fno-if-conversion2
           | 837.8 MB/s  | 2194.8 MB/s   | +10.9% / +47.9%      | +18.4% /
+6.2%       |
| GCC-15     | base
           | 659.9 MB/s  | 1497.6 MB/s   | -12.6% / +0.9%       | -6.7% /
-27.5% **    |
| GCC-15     | -fno-late-combine-instructions -fno-if-conversion
-fno-if-conversion2   | 827.5 MB/s  | 2214.1 MB/s   | +9.5% / +49.2%
| +17.0% / +7.1% ***   |
+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
  * GCC-14 boost    ** GCC-15 regression    *** Workaround recovery

The performance trajectory across compiler versions is interesting. Using
GCC-13 as the baseline, GCC-14 brought a significant decompression
improvement (+19.8%) - nearly 20% faster. But GCC-15 not only loses that
improvement, it actually performs worse than the GCC-13 baseline (-12.7% vs
GCC-13, or -27.1% vs GCC-14).

For GCC-15, disabling if-conversion alone gets decompression back above the
GCC-13 baseline (+11.8% vs GCC-13), and with both -fno-if-conversion and
-fno-late-combine-instructions, it nearly recovers the GCC-14 performance
level (+18.3% vs GCC-13, vs +19.8% for GCC-14). Interestingly, these flags
have minimal effect on GCC-13 and GCC-14 themselves.

So it seems like GCC-14 might have introduced some optimization that really
helped zstd decompression on ARM (perhaps related to if-conversion
changes?), but then GCC-15 changed things again - possibly the new
late-combine-instructions pass interacting badly with if-conversion - and
lost that benefit. The flags to disable these passes in GCC-15 essentially
recover the GCC-14 behavior, which suggests whatever GCC-14 was doing right
is being undone by these optimization passes in GCC-15.

If you're hitting this issue, the workaround is straightforward:

  -O3 -fPIC -march=armv9-a -fno-if-conversion -fno-late-combine-instructions

That gets you back to within 1-2% of GCC 14 performance.

I'm curious whether this is expected behavior or if there's room to improve
the heuristics in these passes for ARM. A couple of questions come to mind:

- Should if-conversion and late-combine be more conservative on ARM given
how good modern branch prediction is on these chips?
- zstd decompression is pretty memory-bound - could the passes be taking
memory access patterns into account when making optimization decisions?

I'm happy to dig deeper if it would help. I can generate assembly dumps to
see exactly what code patterns are being pessimized, or I can put together
a more detailed bug report with instruction-level analysis. Just let me
know what would be most useful.

The regression is very reproducible - I've seen it consistently across both
the 1.5.7 release and the 1.6.0 dev branch of zstd, and across multiple
test runs. I've got benchmark scripts, raw CSV data from 52+ runs, and I
can test on other ARM platforms if that would help establish the pattern
more clearly.

For reference, here are the full system details:

Apple Silicon M1 Max:
- macOS 26.2 (Darwin 26.2.0)
- GCC 13.4.0, 14.3.0, and 15.2.0 via Homebrew
- Testing with zstd 1.5.7, compression level 3, 10 iterations per run

AWS Graviton 5:
- m9g instances
- ARM Neoverse V3, compiled with -march=armv9-a
- Ubuntu 22.04.5 LTS
- GCC 13.4.0, 14.3.0, and 15.2.0 built from source
- Similar regression patterns observed with same workaround flags

This matters beyond just benchmarks - zstd is pretty widely deployed for
database compression (RocksDB and friends), network protocols like HTTP/3,
backup tools, container images, etc.

If there's anything else I can provide to help track this down, just let me
know, but this seems easy to reproduce.

Best regards,
Corentin Chary

Reply via email to