wuleiwuleiwulei commented on issue #10036: URL: https://github.com/apache/arrow-rs/issues/10036#issuecomment-4581269323
> I think this code path was optimized for auto-vectorization recently in [#9746](https://github.com/apache/arrow-rs/pull/9746). At least on x86 it can use gather instructions, given the right target features. The bounds checks are preventing even more optimized code, how do you plan to handle these checks in the SVE version? Thanks — I dug into #9746 and it changes the picture, so let me re-frame. #9746 already hoists the bounds check the right way: the per-element `iter().all(|i| i < dict_len)` becomes a vectorizable u32 max-reduction, and the gather then uses `get_unchecked` after that single check. So on x86 with AVX2/AVX-512 in the target features, LLVM autovectorizes the gather and the bounds checks are out of the hot path. That resolves the safety concern for free — and it also means my original numbers (measured against the pre-#9746 scalar loop) are stale, so I'll redo them against current main. Where I still expect an SVE path to add value: - On x86, the autovectorized `get_unchecked` gather lowers to a hardware gather (`vpgatherdd` etc.) when AVX2/512 is in the target features. - On AArch64, base Armv8-A / NEON has no gather instruction; the only Arm gather is in SVE, which is an *optional* extension and not part of the aarch64 baseline. A portable `aarch64-unknown-linux-gnu` binary therefore won't have `+sve` in its target features, so even after #9746 the autovectorizer cannot emit an SVE gather and falls back to scalar loads. Runtime-detected SVE asm is meant to fill exactly that gap for portable binaries, without requiring `-C target-cpu=...` at build time. On bounds checks specifically: I'd reuse #9746's existing max-reduction guard unchanged and place the SVE gather at the same `get_unchecked` site, so the SVE path adds no new per-element checks and no new unsoundness — the single hoisted check already proves every index in the (now 16-wide) chunk is in range. Before pushing anything I'll: 1. Rebase on current main (post-#9746) and re-measure, so "before" already includes the autovectorization win and the reported delta is purely the SVE contribution. 2. Inspect the disassembly of the post-#9746 path on a portable aarch64 build to confirm whether an SVE gather is emitted, and share it here. If the re-baselined numbers don't show a meaningful SVE win on top of #9746, I'll close this rather than add asm for little gain. Does that plan sound right to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
