On Tue, 7 Apr 2026 19:21:25 GMT, Kerem Kat <[email protected]> wrote: > I propose a new intrinsic that replaces the scalar binary search in > `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other > architectures are out of scope of the current issue). > > ## Stub code > > "Binary" search does an 8-ary search in the stub: each iteration loads 7 > evenly-spaced pivots from the current `[low, high]` range into a register, > compares the key against all 7 in parallel, and narrows the range to one of 8 > subranges based on the result. The range shrinks by 8x per iteration instead > of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + one > batched compare per iteration. The tail of the SIMD search falls through to a > per-type scalar binary search, based on the element count (currently 128 for > both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`). > > Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the > cost of an extra load. When I tried the per-iteration packing, the cost of > `vpinsrq` plus the serial dependency chain through the pack made each `long` > N-ary iteration too expensive to amortize over only a 4x range reduction. > Switching `long` to 8-ary across two YMM registers brought it in line with > the dword path. > > To squeeze performance out of the stub, it uses type-specialized loops to > avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of > every four are independent of each other (letting the OoO engine compute the > offsets in parallel and issue the loads simultaneously), and uses branchless > partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in the > inner loop). > > ## Calling the stub > > Calling the stub for small arrays regressed against C2-inlined Java, because > C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys > across the caller's loop) while the stub call is opaque and forces reloads > every iteration. > > In my first attempt at a fix, I added a manual `If` diamond that called > `binarySearch0` on the slow path, but C2 re-intrinsified the fallback > recursively, producing nested diamonds. Then I found that the proper fix uses > C2's `for_predicated_intrinsic` mechanism: > `inline_array_binary_search_predicate` returns a slow-path control via > `generate_fair_guard`, and the framework wires the slow path through a > non-intrinsifying call generator. Initially I used `generate_slow_guard` > here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes > both branches as hot. When the inlined binary search is treated as cold, the > fallback ends up...
Couldn't find anything wrong with the approach nor the code, but I'm not an expert. > The dip at 512 elements for long correlates with branch predictor pressure: > -prof perf reports ~0.04% branch misses for the baseline vs. ~1% for the > intrinsic. Happy to investigate further if needed. Is this reproducible? Do you always see the dip? If I were you I would try to look into it while the PR gets reviewed. ------------- PR Review: https://git.openjdk.org/jdk/pull/30612#pullrequestreview-4129068171
