Hi Dan,
Thanks for proposing this.
The #1 requirement I have is to use xsimd for this, like we do for other
SIMD optimizations in Arrow.
https://xsimd.readthedocs.io/
Regards
Antoine.
Le 16/05/2026 à 02:55, Dan Mattheiss a écrit :
Hello everyone,
*cc @mapleFU and @pitrou, who’ve been the most active reviewers on
parquet-cpp bloom filter code recently.*
arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0
(apache/arrow-go#336). Impala and Kudu have had AVX2 SBBF probes for years.
arrow-cpp still ships the scalar reference in
BlockSplitBloomFilter::FindHash. I’d like to discuss porting the arrow-go
approach to C++ before opening a PR. This is a read-path implementation
change with no on-disk format implications, so I’m raising it here rather
than on parquet-dev.
*Scope*
On-disk format unchanged. SALT, XXH64, bucket index unchanged. Only the
probe implementation changes — an AVX2 variant selected at runtime via
arrow::internal::CpuInfo, with the existing scalar path retained for
non-AVX2 builds. No API change, no behavior change for any reader.
*The change*
About a dozen instructions, no branches:
bool find_avx2(uint64_t hash) const {
uint32_t bucket_index = uint32_t(((hash >> 32) * num_blocks_) >> 32);
uint32_t key = uint32_t(hash);
__m256i hash_v = _mm256_set1_epi32(int32_t(key));
__m256i salt = _mm256_load_si256(reinterpret_cast<const __m256i*>(SALT));
__m256i prod = _mm256_mullo_epi32(hash_v, salt);
__m256i shift = _mm256_srli_epi32(prod, 27);
__m256i ones = _mm256_set1_epi32(1);
__m256i mask = _mm256_sllv_epi32(ones, shift);
const __m256i* blk = reinterpret_cast<const __m256i*>(data_ + bucket_index
* 32);
return _mm256_testc_si256(_mm256_loadu_si256(blk), mask);
}
_mm256_mullo_epi32(hash_v, salt) produces eight 32-bit products each equal
to scalar key * SALT[i]. _mm256_testc_si256 returns 1 iff (~block & mask)
== 0 — the same condition as the scalar 8-iteration AND-with-early-exit.
*Correctness*
Bit-identical to the existing scalar reference: a diff test against the
vendored FindHash from cpp/src/parquet/bloom_filter.cc reports 0 mismatches
across 167M (query, filter) pairs. The same diff test would run in CI to
guard against drift between the two paths.
*Numbers*
Intel Xeon @ 2.8 GHz (Cascade Lake-class, 33 MiB L3, virtualised), g++ -O3
-mavx2 -mbmi2. Post-hash probe latency; XXH64 cost excluded from the timed
loop.
|Regime |Filter footprint|scalar |AVX2 single-key |AVX2 4-way bulk|
|------------------|----------------|--------|----------------|---------------|
|Small (in-cache) |0.5 MiB |12.73 ns|3.70 ns (3.4×) |2.36 ns (5.4×) |
|Medium (out-of-L3)|128 MiB |35.84 ns|29.18 ns (1.2×) |22.79 ns (1.6×)|
|Large (deep DRAM) |1 GiB |41.41 ns|35.90 ns (1.15×)|27.75 ns (1.5×)|
In-cache the win is ALU collapse: 8 dependent loads with branches becomes
two SIMD ops after one cache-line load. Out-of-L3 both paths wait for the
same DRAM load and the per-probe gain narrows; the 4-way bulk path overlaps
four cache misses in flight and recovers most of it. Most useful for WHERE
col IN (...) and large-table scans across many row-group filters.
*Not in scope for this discussion*
• AVX-512 evaluated and rejected — frequency throttle on Sapphire Rapids
exceeds the width gain on this workload, and SBBF mask compute doesn’t
saturate ZMM. The variants-rejected table in the linked repo has the data.
• NEON / SVE2 not measured. Intrinsics map cleanly on paper but I don’t
want to send hand-wave estimates. Happy to do this as a follow-up if
there’s interest.
*What I’d like feedback on*
1. Preference on landing probe + insert together, or probe first as a
smaller PR?
2. Should NEON block the AVX2 PR, or is it acceptable as a follow-up?
3. Any concerns with the diff-test-in-CI approach as the regression guard
between scalar and AVX2 paths?
*Links*
• arrow-go prior art: https://github.com/apache/arrow-go/pull/336
• Bench, vendored references, diff test:
https://github.com/dmatth1/quickbloom (under investigations/parquet_port/)
• Longer write-up: https://dmatth1.github.io/posts/parquet-bloom/
Thanks,
Dan