[PR] perf(parquet): vectorize ARM64 NEON bool unpacking for ~4x throughput [arrow-go]

via GitHub Fri, 27 Mar 2026 09:45:41 -0700


zeroshade opened a new pull request, #731:
URL: https://github.com/apache/arrow-go/pull/731


   ### Rationale for this change
   Improve the bytes_to_bools implementation on ARM64 NEON with actual SIMD 
instructions. The result is a ~4x throughput improvement for ARM64
   
   ### What changes are included in this PR?
   Rewrote the assembly using `DUP` + `CMTST` NEON pattern.
   
   1. ld1r {v2.8b}, [ptr] — broadcast one input byte across all 8 SIMD lanes
   2. cmtst v2.8b, v2.8b, v0.8b — parallel bit-test against mask 
[1,2,4,8,16,32,64,128]
   3. and v2.8b, v2.8b, v1.8b — normalize 0xFF → 0x01 for valid Go bool values
   4. st1 {v2.8b}, [ptr], #8 — store 8 output bools at once with post-increment
   A scalar tail handles the last few bits when fewer than 8 output slots 
remain.
   
   ### Are these changes tested?
   All existing tests continue to pass, new tests added to further validate
   
   - Added TestBytesToBoolsCorrectness — validates every bit position against 
the reference Go implementation for sizes 1–1024 bytes
   - Added TestBytesToBoolsOutlenSmaller — edge case where output is smaller 
than 8× input
   - Added BenchmarkBytesToBools — parametric benchmark at 64B, 256B, 1KB, 4KB, 
16KB
   
   ### Are there any user-facing changes?
   No, this is purely a performance optimization:
   
   *Benchmark Results (Apple M4, darwin/arm64)*
   ```
                                  baseline (scalar)   optimized (NEON)
                                      sec/op              sec/op    vs base
   
   BytesToBools/bytes=64-10           82.69n              21.57n     -73.91% 
(p=0.008)
   BytesToBools/bytes=256-10         333.60n              86.43n     -74.09% 
(p=0.008)
   BytesToBools/bytes=1K-10           1.322µ              327.4n     -75.23% 
(p=0.008)
   BytesToBools/bytes=4K-10           5.293µ              1.297µ     -75.50% 
(p=0.008)
   BytesToBools/bytes=16K-10         21.343µ              5.184µ     -75.71% 
(p=0.008)
   geomean                            1.327µ              333.1n     -74.90%
   ```
   
   Throughput: 735 MiB/s → 2,863 MiB/s (+298%)
   Zero allocations in both versions. All results statistically significant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(parquet): vectorize ARM64 NEON bool unpacking for ~4x throughput [arrow-go]

Reply via email to