Re: [I] perf: Use AArch64 SVE gather to speed up RLE dictionary decoding [arrow-rs]

via GitHub Fri, 29 May 2026 18:57:36 -0700


wuleiwuleiwulei commented on issue #10036:
URL: https://github.com/apache/arrow-rs/issues/10036#issuecomment-4581269323


   > I think this code path was optimized for auto-vectorization recently in 
[#9746](https://github.com/apache/arrow-rs/pull/9746). At least on x86 it can 
use gather instructions, given the right target features. The bounds checks are 
preventing even more optimized code, how do you plan to handle these checks in 
the SVE version?
   
   Thanks — I dug into #9746 and it changes the picture, so let me re-frame.
   
   #9746 already hoists the bounds check the right way: the per-element 
`iter().all(|i| i < dict_len)` becomes a vectorizable u32 max-reduction, and 
the gather then uses `get_unchecked` after that single check. So on x86 with 
AVX2/AVX-512 in the target features, LLVM autovectorizes the gather and the 
bounds checks are out of the hot path. That resolves the safety concern for 
free — and it also means my original numbers (measured against the pre-#9746 
scalar loop) are stale, so I'll redo them against current main.
   
   Where I still expect an SVE path to add value:
   
   - On x86, the autovectorized `get_unchecked` gather lowers to a hardware 
gather (`vpgatherdd` etc.) when AVX2/512 is in the target features.
   - On AArch64, base Armv8-A / NEON has no gather instruction; the only Arm 
gather is in SVE, which is an *optional* extension and not part of the aarch64 
baseline. A portable `aarch64-unknown-linux-gnu` binary therefore won't have 
`+sve` in its target features, so even after #9746 the autovectorizer cannot 
emit an SVE gather and falls back to scalar loads. Runtime-detected SVE asm is 
meant to fill exactly that gap for portable binaries, without requiring `-C 
target-cpu=...` at build time.
   
   On bounds checks specifically: I'd reuse #9746's existing max-reduction 
guard unchanged and place the SVE gather at the same `get_unchecked` site, so 
the SVE path adds no new per-element checks and no new unsoundness — the single 
hoisted check already proves every index in the (now 16-wide) chunk is in range.
   
   Before pushing anything I'll:
   1. Rebase on current main (post-#9746) and re-measure, so "before" already 
includes the autovectorization win and the reported delta is purely the SVE 
contribution.
   2. Inspect the disassembly of the post-#9746 path on a portable aarch64 
build to confirm whether an SVE gather is emitted, and share it here.
   
   If the re-baselined numbers don't show a meaningful SVE win on top of #9746, 
I'll close this rather than add asm for little gain. Does that plan sound right 
to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] perf: Use AArch64 SVE gather to speed up RLE dictionary decoding [arrow-rs]

Reply via email to