phongn opened a new pull request, #13167:
URL: https://github.com/apache/trafficserver/pull/13167
## Summary
Add a SIMD-accelerated bulk ASCII tolower helper `ts::memcpy_tolower` in
`tscore`, and use it in place of the byte-at-a-time loop on the URL
canonicalization fast path that produces the cache-key digest. The work is
trivially data-parallel, so a 16-byte SSE2/NEON kernel gives a straightforward
speedup once the input is long enough to amortize the setup. Behavior matches
`ParseRules::ink_tolower`: bytes in `A..Z` map to `a..z`, all others (including
`0x80..0xFF`) pass through unchanged.
## Implementation
- New header `include/tscore/ink_memcpy_tolower.h`. Header-only, inline.
- 16-byte SIMD body on `__SSE2__` (baseline x86_64 ABI) and on `__ARM_NEON`
/ `__aarch64__` (baseline ARMv8 ABI), guarded so neither intrinsic header is
pulled in on the other architecture or on platforms without either.
- Scalar tail handles the trailing 0–15 bytes after the SIMD body and serves
as the sole implementation on platforms without either ISA.
- `src/proxy/hdrs/URL.cc` drops its static-inline `memcpy_tolower` and calls
`ts::memcpy_tolower` instead.
## Performance (Xeon E5-2683 v4, AVX2 box)
| Size | Scalar | SIMD | Speedup |
|---:|---:|---:|---:|
| 4 B | 5.1 ns | 5.5 ns | 0.93× (tail-only; no change) |
| 8 B | 9.9 ns | 8.4 ns | 1.18× |
| **16 B** | 14.8 ns | 3.9 ns | **3.79×** |
| 24 B | 20.0 ns | 10.2 ns | 1.96× (1 SIMD block + 8B tail) |
| **32 B** | 26.8 ns | 5.1 ns | **5.25×** |
| 64 B | 53.9 ns | 7.5 ns | **7.18×** |
| 256 B | 168.7 ns | 22.6 ns | **7.47×** |
| 1024 B | 677 ns | 94.7 ns | **7.15×** |
URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on
the scalar tail with no change. Typical host names (16+ bytes) get the full
4–7× speedup.
## Test plan
- [x] New microbench `tools/benchmark/benchmark_memcpy_tolower` runs 268
correctness assertions covering:
- Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing the SIMD
body) against the scalar reference.
- An exhaustive sweep of all 256 byte values verifying that only `A..Z`
are remapped — guards against any future widening of the case-fold range.
- [x] `cmake --build build -t format` clean.
- [x] `src/proxy/hdrs/libhdrs.a` builds clean with the updated URL.cc.
- [ ] Jenkins CI green.
## Notes for reviewers
- No new compile flags or dependencies. Just baseline SSE2 (x86_64) and
baseline NEON (ARMv8); both are guaranteed by their respective ABIs.
- The header includes `<emmintrin.h>` / `<arm_neon.h>` only inside the `#if`
that needs them, so other architectures don't pull them in.
- Other call sites that do a similar byte-at-a-time tolower loop
(`HPACK.cc`, `QPACK.cc`, `UrlRewrite.cc`) could also benefit, but they're left
untouched here to keep the PR focused. Easy follow-ups.
- An AVX2 path would add another ~2× for inputs ≥ 32 bytes but would require
either a `-mavx2` build flag (limiting binary portability) or a runtime
resolver. Out of scope here.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]