I propose a new intrinsic that replaces the scalar binary search in 
`Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other 
architectures are out of scope of the current issue).

## Stub code

"Binary" search does an 8-ary search in the stub: each iteration loads 7 
evenly-spaced pivots from the current `[low, high]` range into a register, 
compares the key against all 7 in parallel, and narrows the range to one of 8 
subranges based on the result. The range shrinks by 8x per iteration instead of 
2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + one batched 
compare per iteration. The tail of the SIMD search falls through to a per-type 
scalar binary search, based on the element count (currently 128 for both 
`DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`).

Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the cost 
of an extra load. When I tried the per-iteration packing, the cost of `vpinsrq` 
plus the serial dependency chain through the pack made each `long` N-ary 
iteration too expensive to amortize over only a 4x range reduction. Switching 
`long` to 8-ary across two YMM registers brought it in line with the dword path.

To squeeze performance out of the stub, it uses type-specialized loops to avoid 
per-iteration dispatch, arranges the pivot-offset `lea`s so three of every four 
are independent of each other (letting the OoO engine compute the offsets in 
parallel and issue the loads simultaneously), and uses branchless partitioning 
via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in the inner loop).

## Calling the stub

Calling the stub for small arrays regressed against C2-inlined Java, because C2 
fully optimizes inlined Java (hoisting array bases, lengths, and keys across 
the caller's loop) while the stub call is opaque and forces reloads every 
iteration.

In my first attempt at a fix, I added a manual `If` diamond that called 
`binarySearch0` on the slow path, but C2 re-intrinsified the fallback 
recursively, producing nested diamonds. Then I found that the proper fix uses 
C2's `for_predicated_intrinsic` mechanism: 
`inline_array_binary_search_predicate` returns a slow-path control via 
`generate_fair_guard`, and the framework wires the slow path through a 
non-intrinsifying call generator. Initially I used `generate_slow_guard` here, 
then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes both 
branches as hot. When the inlined binary search is treated as cold, the 
fallback ends up worse than just calling the stub.

To make the slow-path inlining possible without writing a binary search in IR 
by hand, `binarySearch0` was split into a tiny `@IntrinsicCandidate` wrapper 
that delegates to a `@ForceInline binarySearch0Impl` containing the actual 
loop. C2 inlines the wrapper trivially, then force-inlines the impl, producing 
the same generated code as the baseline.

Furthermore, a compile-time check in `inline_array_binary_search` returns 
`false` when C2 can statically prove the search range is below the per-type 
breakeven, avoiding the predicated intrinsic entirely for known-small arrays.

On Windows x64 the calling convention difference is handled via 
`setup_arg_regs` register shuffle with the 5th argument loaded from the stack, 
similar to how `checkcast_arraycopy` handles it. Other platforms fall back to 
the Java implementation.

## Tuning the thresholds

Per-type breakeven thresholds (int=256, long=768, short=512, char=512) were 
determined by JMH on Sapphire Rapids (AWS EC2 `c7i.48xlarge`). The intrinsic is 
gated on a new diagnostic flag `UseAVX2BinarySearchIntrinsic` (default true).

I do realize this is a single-microarch tuning. Is there a specific target CPU 
these should be tuned for, or a list of AVX2 CPUs the benchmarks should be run 
against before merging? The breakeven points may vary across different brands 
and microarchitectures.

## Benchmarks

The attached speedup plot summarizes the JMH result on Sapphire Rapids. The 
y-axis is **speedup (baseline / intrinsic)** so values above 1.0 mean the 
intrinsic is faster. The x-axis is **log2 scale** of array element count (and 
size), ranging from 64 to 8M elements. The plot was generated from JMH JSON 
output with 2 forks, 5 measurement iterations.

Please observe the following in the chart:

1. Small sizes below threshold fall back to the force-inlined function, which 
C2 further optimizes, so they are around `1x` speedup, within noise.
2. The dip at 512 elements for `long` correlates with branch predictor 
pressure: `-prof perf` reports `~0.04%` branch misses for the baseline vs. 
`~1%` for the intrinsic. Happy to investigate further if needed.
3. Average speedup is `~1.5x` across types for sizes above the threshold, with 
max speedup `~2.35x` observed for `int[1024]`.
4. Around the L2-cache boundary the speedup dips toward `1x`: the N-ary pivot 
loads no longer hit L1, both versions stall on L2, and the intrinsic's compute 
advantage is masked. Above L2 (>2M bytes) the speedup recovers to `~1.5x` as 
the workload becomes a steady-state stream from L3/memory and the intrinsic's 
lower iteration count amortizes the latency more effectively than the 
baseline's longer scalar loop.
5. There is a similar dip around the L1-cache boundary (48K bytes), most 
visible for `int[16384]`. When the array just exceeds L1, the intrinsic's 7 
scattered pivot loads per iteration each risk an L1 miss, while the baseline's 
single load per iteration spreads the cache pressure thinner. Once the array 
grows further and both versions are L2-resident, the intrinsic's compute 
advantage (lower iterations and lower total memory accesses) reasserts itself 
and the speedup recovers.

<img width="1800" height="1500" alt="bench" 
src="https://github.com/user-attachments/assets/2cd1cc31-0687-4015-aa33-7454efc71032";
 />

-------------

Commit messages:
 - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays

Changes: https://git.openjdk.org/jdk/pull/30612/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=30612&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8380841
  Stats: 924 lines in 15 files changed: 923 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/30612.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/30612/head:pull/30612

PR: https://git.openjdk.org/jdk/pull/30612

Reply via email to