phongn opened a new pull request, #13167:
URL: https://github.com/apache/trafficserver/pull/13167

   ## Summary
   
   Add a SIMD-accelerated bulk ASCII tolower helper `ts::memcpy_tolower` in 
`tscore`, and use it in place of the byte-at-a-time loop on the URL 
canonicalization fast path that produces the cache-key digest. The work is 
trivially data-parallel, so a 16-byte SSE2/NEON kernel gives a straightforward 
speedup once the input is long enough to amortize the setup. Behavior matches 
`ParseRules::ink_tolower`: bytes in `A..Z` map to `a..z`, all others (including 
`0x80..0xFF`) pass through unchanged.
   
   ## Implementation
   
   - New header `include/tscore/ink_memcpy_tolower.h`. Header-only, inline.
   - 16-byte SIMD body on `__SSE2__` (baseline x86_64 ABI) and on `__ARM_NEON` 
/ `__aarch64__` (baseline ARMv8 ABI), guarded so neither intrinsic header is 
pulled in on the other architecture or on platforms without either.
   - Scalar tail handles the trailing 0–15 bytes after the SIMD body and serves 
as the sole implementation on platforms without either ISA.
   - `src/proxy/hdrs/URL.cc` drops its static-inline `memcpy_tolower` and calls 
`ts::memcpy_tolower` instead.
   
   ## Performance (Xeon E5-2683 v4, AVX2 box)
   
   | Size | Scalar | SIMD | Speedup |
   |---:|---:|---:|---:|
   | 4 B | 5.1 ns | 5.5 ns | 0.93× (tail-only; no change) |
   | 8 B | 9.9 ns | 8.4 ns | 1.18× |
   | **16 B** | 14.8 ns | 3.9 ns | **3.79×** |
   | 24 B | 20.0 ns | 10.2 ns | 1.96× (1 SIMD block + 8B tail) |
   | **32 B** | 26.8 ns | 5.1 ns | **5.25×** |
   | 64 B | 53.9 ns | 7.5 ns | **7.18×** |
   | 256 B | 168.7 ns | 22.6 ns | **7.47×** |
   | 1024 B | 677 ns | 94.7 ns | **7.15×** |
   
   URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on 
the scalar tail with no change. Typical host names (16+ bytes) get the full 
4–7× speedup.
   
   ## Test plan
   
   - [x] New microbench `tools/benchmark/benchmark_memcpy_tolower` runs 268 
correctness assertions covering:
     - Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing the SIMD 
body) against the scalar reference.
     - An exhaustive sweep of all 256 byte values verifying that only `A..Z` 
are remapped — guards against any future widening of the case-fold range.
   - [x] `cmake --build build -t format` clean.
   - [x] `src/proxy/hdrs/libhdrs.a` builds clean with the updated URL.cc.
   - [ ] Jenkins CI green.
   
   ## Notes for reviewers
   
   - No new compile flags or dependencies. Just baseline SSE2 (x86_64) and 
baseline NEON (ARMv8); both are guaranteed by their respective ABIs.
   - The header includes `<emmintrin.h>` / `<arm_neon.h>` only inside the `#if` 
that needs them, so other architectures don't pull them in.
   - Other call sites that do a similar byte-at-a-time tolower loop 
(`HPACK.cc`, `QPACK.cc`, `UrlRewrite.cc`) could also benefit, but they're left 
untouched here to keep the PR focused. Easy follow-ups.
   - An AVX2 path would add another ~2× for inputs ≥ 32 bytes but would require 
either a `-mavx2` build flag (limiting binary portability) or a runtime 
resolver. Out of scope here.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to