wuleiwuleiwulei opened a new issue, #10036:
URL: https://github.com/apache/arrow-rs/issues/10036

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   When reading dictionary-encoded columns from Parquet, 
`RleDecoder::get_batch_with_dict` (in `parquet/src/encodings/rle.rs`) is on a 
very hot path. In the bit-packed branch, the decoder unpacks the indices into a 
scratch buffer and then materializes the output with a scalar, per-element 
dictionary lookup:
   
   ```rust
   buffer[values_read..values_read + num_values]
       .iter_mut()
       .zip(index_buf[..num_values].iter())
       .for_each(|(b, i)| b.clone_from(&dict[*i as usize]));
   ```
   
   This is a sequence of dependent, data-dependent loads (a gather) and 
dominates decode time for dictionary columns with primitive value types. On 
AArch64 CPUs that implement SVE (e.g. Kunpeng 920 / Neoverse-class server 
cores), this loop leaves the hardware gather capability completely unused, so 
dictionary decode is slower than necessary on this architecture.
   
   `perf` profiling of a TPC-H workload on AArch64 SVE hardware shows this 
gather as one of the top hotspots in the Parquet read path for 
dictionary-encoded primitive columns.
   
   **Describe the solution you'd like**
   
   Add an AArch64-only SVE fast path for the dictionary gather in 
`get_batch_with_dict`, keeping the scalar implementation as the fallback:
   
   - A small `#[cfg(target_arch = "aarch64")]` module that gathers 4-byte 
(i32/f32) and 8-byte (i64/f64) dictionary values using SVE indexed loads 
(`ld1w` / `ld1d` with a vector index), processing one vector-length of elements 
per iteration via `whilelt` predication (vector-length agnostic).
   - Runtime SVE detection via 
`std::arch::is_aarch64_feature_detected!("sve")`, cached in an `AtomicU8` so 
the check amortizes to a single relaxed load on the hot path.
   - The fast path only engages for `size_of::<T>() == 4 | 8`; all other types, 
and all non-AArch64 / non-SVE targets, fall back to the existing scalar 
`clone_from` loop. Results are bit-for-bit identical to the scalar path — only 
the gather is accelerated.
   
   This is purely additive: no public API change, and no behaviour change on 
any existing platform.
   
   **Measured improvement.** Benchmarked on Kunpeng 920B (SVE, 256-bit) over 
the full TPC-H query set against ~140 GB of data. Build flags were identical 
for the baseline and the patched build; the only difference is this SVE fast 
path. Per-function times were measured with `perf`, aggregated by symbol; each 
value is the mean of 3 runs. The SVE path was confirmed active at runtime via 
`is_aarch64_feature_detected!("sve")`.
   
   - **Target function (`get_batch_with_dict`), summed over all 22 queries:** 
**2875 ms → 1622 ms** — a **43.6% reduction (1.77× faster)** on the optimized 
kernel.
   - **End-to-end TPC-H (22 queries):** **+1.83% overall** (table below); 20/22 
queries are faster and the 2 outliers (Q3, Q10, ≈1%) are within run-to-run 
noise. The end-to-end figure is smaller because dictionary decode is only a 
fraction of total query time — the kernel-level number above isolates the 
actual win.
   
   | Query | before (s) | after (s) | Δ (faster) |
   | ----- | ---------- | --------- | ---------- |
   | Q1    | 6.037      | 5.987     | +0.83%     |
   | Q2    | 1.208      | 1.190     | +1.45%     |
   | Q3    | 5.584      | 5.673     | −1.59%     |
   | Q4    | 3.190      | 3.127     | +1.98%     |
   | Q5    | 4.432      | 4.407     | +0.57%     |
   | Q6    | 1.473      | 1.398     | +5.06%     |
   | Q7    | 6.441      | 6.265     | +2.73%     |
   | Q8    | 6.057      | 5.983     | +1.23%     |
   | Q9    | 23.494     | 22.921    | +2.44%     |
   | Q10   | 6.302      | 6.366     | −1.01%     |
   | Q11   | 2.465      | 2.436     | +1.18%     |
   | Q12   | 3.070      | 2.953     | +3.82%     |
   | Q13   | 9.954      | 9.708     | +2.46%     |
   | Q14   | 3.676      | 3.631     | +1.24%     |
   | Q15   | 2.862      | 2.798     | +2.25%     |
   | Q16   | 3.458      | 3.402     | +1.62%     |
   | Q17   | 3.349      | 3.327     | +0.66%     |
   | Q18   | 10.431     | 10.278    | +1.46%     |
   | Q19   | 4.756      | 4.610     | +3.07%     |
   | Q20   | 4.888      | 4.845     | +0.87%     |
   | Q21   | 50.797     | 49.641    | +2.28%     |
   | Q22   | 3.937      | 3.837     | +2.55%     |
   | Total | 167.86     | 164.78    | +1.83%     |
   
   **Describe alternatives you've considered**
   
   - **Rely on autovectorization** — the compiler does not turn this 
arbitrary-index gather into SVE gather instructions.
   - **`std::simd` / portable SIMD** — gather with arbitrary indices is not 
available on stable, and portable fixed-width SIMD cannot express SVE's 
vector-length-agnostic (VLA) gather.
   - **Stable `std::arch` SVE intrinsics** — SVE intrinsics are still unstable 
in Rust, which is why a small, audited `asm!` block is used; it can be swapped 
for intrinsics once they stabilize. This is the main difference from existing 
SIMD in the repo — e.g. `arrow-arith`'s AVX paths are `target_feature`-gated at 
compile time, and `parquet`'s `simdutf8` path is feature-gated — here runtime 
detection is needed because SVE availability/width isn't known at compile time 
for portable binaries.
   - **NEON** — fixed-width NEON has no true gather instruction, so it offers 
little benefit for this access pattern.
   - **Leave as-is** — simplest, but forfeits a meaningful win on a growing 
class of AArch64 SVE server CPUs.
   
   **Additional context**
   
   - Scope is limited to `RleDecoder::get_batch_with_dict`; the encoder, `get`, 
`get_batch`, and `skip` are untouched.
   - Prior art in the repo for arch-specific SIMD acceleration: 
`arrow-arith/src/aggregate.rs` (AVX512/AVX dispatch) and 
`parquet/src/util/utf8.rs` (`simdutf8`). This proposal follows the same spirit, 
adding runtime-detected SVE for AArch64.
   - The SVE path uses `unsafe` inline assembly. Safety contract for each 
helper: `dict` must be valid for reads up to the maximum index, `indices` must 
point to `count` valid `i32`s, and `output` must have `count` writable slots; 
the public entry point only dispatches into it after confirming SVE 
availability and `size_of::<T>()`.
   - I'm happy to open a PR with the implementation, an SVE-specific test plus 
a Criterion benchmark, and CI notes for exercising the AArch64 path. I've 
implemented this with runtime detection (zero cost on other targets, automatic 
on SVE hardware); happy to gate it behind a Cargo feature instead if you'd 
prefer a more conservative default.
   - This is my first contribution to arrow-rs, so apologies in advance if I've 
missed any conventions — happy to adjust the issue/PR format, benchmarks, or 
anything else per your guidance. Just let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to