Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays

Galder Zamarreño Fri, 17 Apr 2026 06:16:56 -0700

On Tue, 7 Apr 2026 19:21:25 GMT, Kerem Kat <[email protected]> wrote:

> I propose a new intrinsic that replaces the scalar binary search in 
> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other 
> architectures are out of scope of the current issue).
> 
> ## Stub code
> 
> "Binary" search does an 8-ary search in the stub: each iteration loads 7 
> evenly-spaced pivots from the current `[low, high]` range into a register, 
> compares the key against all 7 in parallel, and narrows the range to one of 8 
> subranges based on the result. The range shrinks by 8x per iteration instead 
> of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + one 
> batched compare per iteration. The tail of the SIMD search falls through to a 
> per-type scalar binary search, based on the element count (currently 128 for 
> both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`).
> 
> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the 
> cost of an extra load. When I tried the per-iteration packing, the cost of 
> `vpinsrq` plus the serial dependency chain through the pack made each `long` 
> N-ary iteration too expensive to amortize over only a 4x range reduction. 
> Switching `long` to 8-ary across two YMM registers brought it in line with 
> the dword path.
> 
> To squeeze performance out of the stub, it uses type-specialized loops to 
> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of 
> every four are independent of each other (letting the OoO engine compute the 
> offsets in parallel and issue the loads simultaneously), and uses branchless 
> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in the 
> inner loop).
> 
> ## Calling the stub
> 
> Calling the stub for small arrays regressed against C2-inlined Java, because 
> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys 
> across the caller's loop) while the stub call is opaque and forces reloads 
> every iteration.
> 
> In my first attempt at a fix, I added a manual `If` diamond that called 
> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback 
> recursively, producing nested diamonds. Then I found that the proper fix uses 
> C2's `for_predicated_intrinsic` mechanism: 
> `inline_array_binary_search_predicate` returns a slow-path control via 
> `generate_fair_guard`, and the framework wires the slow path through a 
> non-intrinsifying call generator. Initially I used `generate_slow_guard` 
> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes 
> both branches as hot. When the inlined binary search is treated as cold, the 
> fallback ends up...


Couldn't find anything wrong with the approach nor the code, but I'm not an 
expert.

> The dip at 512 elements for long correlates with branch predictor pressure: 
> -prof perf reports ~0.04% branch misses for the baseline vs. ~1% for the 
> intrinsic. Happy to investigate further if needed.

Is this reproducible? Do you always see the dip? If I were you I would try to 
look into it while the PR gets reviewed.

-------------

PR Review: https://git.openjdk.org/jdk/pull/30612#pullrequestreview-4129068171

Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays

Reply via email to