Hi,
I've been doing some performance benchmarking of zstd on ARM platforms and
found an interesting performance trajectory across GCC versions. GCC-14
brought a significant improvement in decompression performance on ARM
(about +20% compared to GCC-13), but GCC-15 not only loses that
improvement, it actually performs worse than GCC-13 - up to 27% slower than
GCC-14 on Apple Silicon.
The interesting part is that disabling if-conversion and
late-combine-instructions with -fno-if-conversion
-fno-late-combine-instructions almost completely recovers the GCC-14
performance level. On my M1 Max, GCC-15 drops from the GCC-14 baseline of
1659 MB/s down to 1210 MB/s for decompression, but with those flags
disabled it recovers to 1638 MB/s (essentially back to GCC-14 levels).
Here's what I've tested so far:
Test setup:
- Platforms: Apple Silicon M1 Max (primary), AWS Graviton 5 (m9g.4xlarge)
- Compilers: GCC 13.4.0, GCC 14.3.0, GCC 15.2.0, and Clang 17.0.0 for
comparison
- Build: -O3 -fPIC -march=native on Apple, -march=armv9-a on Graviton
- Workload: zstd 1.5.7 level 3 compression on a 10MB test file, 10
iterations
Apple Silicon M1 Max (-march=native) results:
+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
| Compiler | Flags | Compression |
Decompression | vs GCC-13 | vs GCC-14 |
+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
| GCC-13 | base | 498.7 MB/s | 1385.2
MB/s | (baseline) | +0.2% / -16.5% |
| GCC-13 | -fno-if-conversion | 494.9 MB/s | 1365.3
MB/s | -0.8% / -1.4% | -0.6% / -17.7% |
| GCC-14 | base | 497.7 MB/s | 1659.1
MB/s | -0.2% / +19.8% * | (baseline) |
| GCC-14 | -fno-if-conversion | 500.4 MB/s | 1649.5
MB/s | +0.3% / +19.1% | +0.5% / -0.6% |
| GCC-15 | base | 486.2 MB/s | 1209.8
MB/s | -2.5% / -12.7% | -2.3% / -27.1% ** |
| GCC-15 | -fno-if-conversion | 493.9 MB/s | 1548.2
MB/s | -1.0% / +11.8% | -0.8% / -6.7% |
| GCC-15 | -fno-late-combine-instructions | 500.7 MB/s | 1314.4
MB/s | +0.4% / -5.1% | +0.6% / -20.8% |
| GCC-15 | both flags | 495.8 MB/s | 1638.4
MB/s | -0.6% / +18.3% | -0.4% / -1.2% *** |
| Clang 17 | base | 491.2 MB/s | 1340.4
MB/s | -1.5% / -3.2% | -1.3% / -19.2% |
+------------+-------------------------------------+-------------+---------------+----------------------+----------------------+
* GCC-14 boost ** GCC-15 regression *** Workaround recovery
AWS Graviton 5 (Neoverse V3, -march=armv9-a) results:
+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
| Compiler | Flags
| Compression | Decompression | vs GCC-13 | vs
GCC-14 |
+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
| GCC-13 | base
| 755.4 MB/s | 1484.2 MB/s | (baseline) | +6.8% /
-28.2% |
| GCC-13 | -fno-if-conversion -fno-if-conversion2
| 839.4 MB/s | 1498.5 MB/s | +11.1% / +1.0% | +18.7% /
-27.5% |
| GCC-14 | base
| 707.3 MB/s | 2066.9 MB/s | -6.4% / +39.2% * |
(baseline) |
| GCC-14 | -fno-if-conversion -fno-if-conversion2
| 837.8 MB/s | 2194.8 MB/s | +10.9% / +47.9% | +18.4% /
+6.2% |
| GCC-15 | base
| 659.9 MB/s | 1497.6 MB/s | -12.6% / +0.9% | -6.7% /
-27.5% ** |
| GCC-15 | -fno-late-combine-instructions -fno-if-conversion
-fno-if-conversion2 | 827.5 MB/s | 2214.1 MB/s | +9.5% / +49.2%
| +17.0% / +7.1% *** |
+------------+-------------------------------------------------------------------------+-------------+---------------+----------------------+----------------------+
* GCC-14 boost ** GCC-15 regression *** Workaround recovery
The performance trajectory across compiler versions is interesting. Using
GCC-13 as the baseline, GCC-14 brought a significant decompression
improvement (+19.8%) - nearly 20% faster. But GCC-15 not only loses that
improvement, it actually performs worse than the GCC-13 baseline (-12.7% vs
GCC-13, or -27.1% vs GCC-14).
For GCC-15, disabling if-conversion alone gets decompression back above the
GCC-13 baseline (+11.8% vs GCC-13), and with both -fno-if-conversion and
-fno-late-combine-instructions, it nearly recovers the GCC-14 performance
level (+18.3% vs GCC-13, vs +19.8% for GCC-14). Interestingly, these flags
have minimal effect on GCC-13 and GCC-14 themselves.
So it seems like GCC-14 might have introduced some optimization that really
helped zstd decompression on ARM (perhaps related to if-conversion
changes?), but then GCC-15 changed things again - possibly the new
late-combine-instructions pass interacting badly with if-conversion - and
lost that benefit. The flags to disable these passes in GCC-15 essentially
recover the GCC-14 behavior, which suggests whatever GCC-14 was doing right
is being undone by these optimization passes in GCC-15.
If you're hitting this issue, the workaround is straightforward:
-O3 -fPIC -march=armv9-a -fno-if-conversion -fno-late-combine-instructions
That gets you back to within 1-2% of GCC 14 performance.
I'm curious whether this is expected behavior or if there's room to improve
the heuristics in these passes for ARM. A couple of questions come to mind:
- Should if-conversion and late-combine be more conservative on ARM given
how good modern branch prediction is on these chips?
- zstd decompression is pretty memory-bound - could the passes be taking
memory access patterns into account when making optimization decisions?
I'm happy to dig deeper if it would help. I can generate assembly dumps to
see exactly what code patterns are being pessimized, or I can put together
a more detailed bug report with instruction-level analysis. Just let me
know what would be most useful.
The regression is very reproducible - I've seen it consistently across both
the 1.5.7 release and the 1.6.0 dev branch of zstd, and across multiple
test runs. I've got benchmark scripts, raw CSV data from 52+ runs, and I
can test on other ARM platforms if that would help establish the pattern
more clearly.
For reference, here are the full system details:
Apple Silicon M1 Max:
- macOS 26.2 (Darwin 26.2.0)
- GCC 13.4.0, 14.3.0, and 15.2.0 via Homebrew
- Testing with zstd 1.5.7, compression level 3, 10 iterations per run
AWS Graviton 5:
- m9g instances
- ARM Neoverse V3, compiled with -march=armv9-a
- Ubuntu 22.04.5 LTS
- GCC 13.4.0, 14.3.0, and 15.2.0 built from source
- Similar regression patterns observed with same workaround flags
This matters beyond just benchmarks - zstd is pretty widely deployed for
database compression (RocksDB and friends), network protocols like HTTP/3,
backup tools, container images, etc.
If there's anything else I can provide to help track this down, just let me
know, but this seems easy to reproduce.
Best regards,
Corentin Chary