I propose a new intrinsic that replaces the scalar binary search in `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other architectures are out of scope of the current issue).
## Stub code "Binary" search does an 8-ary search in the stub: each iteration loads 7 evenly-spaced pivots from the current `[low, high]` range into a register, compares the key against all 7 in parallel, and narrows the range to one of 8 subranges based on the result. The range shrinks by 8x per iteration instead of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + one batched compare per iteration. The tail of the SIMD search falls through to a per-type scalar binary search, based on the element count (currently 128 for both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`). Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the cost of an extra load. When I tried the per-iteration packing, the cost of `vpinsrq` plus the serial dependency chain through the pack made each `long` N-ary iteration too expensive to amortize over only a 4x range reduction. Switching `long` to 8-ary across two YMM registers brought it in line with the dword path. To squeeze performance out of the stub, it uses type-specialized loops to avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of every four are independent of each other (letting the OoO engine compute the offsets in parallel and issue the loads simultaneously), and uses branchless partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in the inner loop). ## Calling the stub Calling the stub for small arrays regressed against C2-inlined Java, because C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys across the caller's loop) while the stub call is opaque and forces reloads every iteration. In my first attempt at a fix, I added a manual `If` diamond that called `binarySearch0` on the slow path, but C2 re-intrinsified the fallback recursively, producing nested diamonds. Then I found that the proper fix uses C2's `for_predicated_intrinsic` mechanism: `inline_array_binary_search_predicate` returns a slow-path control via `generate_fair_guard`, and the framework wires the slow path through a non-intrinsifying call generator. Initially I used `generate_slow_guard` here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes both branches as hot. When the inlined binary search is treated as cold, the fallback ends up worse than just calling the stub. To make the slow-path inlining possible without writing a binary search in IR by hand, `binarySearch0` was split into a tiny `@IntrinsicCandidate` wrapper that delegates to a `@ForceInline binarySearch0Impl` containing the actual loop. C2 inlines the wrapper trivially, then force-inlines the impl, producing the same generated code as the baseline. Furthermore, a compile-time check in `inline_array_binary_search` returns `false` when C2 can statically prove the search range is below the per-type breakeven, avoiding the predicated intrinsic entirely for known-small arrays. On Windows x64 the calling convention difference is handled via `setup_arg_regs` register shuffle with the 5th argument loaded from the stack, similar to how `checkcast_arraycopy` handles it. Other platforms fall back to the Java implementation. ## Tuning the thresholds Per-type breakeven thresholds (int=256, long=768, short=512, char=512) were determined by JMH on Sapphire Rapids (AWS EC2 `c7i.48xlarge`). The intrinsic is gated on a new diagnostic flag `UseAVX2BinarySearchIntrinsic` (default true). I do realize this is a single-microarch tuning. Is there a specific target CPU these should be tuned for, or a list of AVX2 CPUs the benchmarks should be run against before merging? The breakeven points may vary across different brands and microarchitectures. ## Benchmarks The attached speedup plot summarizes the JMH result on Sapphire Rapids. The y-axis is **speedup (baseline / intrinsic)** so values above 1.0 mean the intrinsic is faster. The x-axis is **log2 scale** of array element count (and size), ranging from 64 to 8M elements. The plot was generated from JMH JSON output with 2 forks, 5 measurement iterations. Please observe the following in the chart: 1. Small sizes below threshold fall back to the force-inlined function, which C2 further optimizes, so they are around `1x` speedup, within noise. 2. The dip at 512 elements for `long` correlates with branch predictor pressure: `-prof perf` reports `~0.04%` branch misses for the baseline vs. `~1%` for the intrinsic. Happy to investigate further if needed. 3. Average speedup is `~1.5x` across types for sizes above the threshold, with max speedup `~2.35x` observed for `int[1024]`. 4. Around the L2-cache boundary the speedup dips toward `1x`: the N-ary pivot loads no longer hit L1, both versions stall on L2, and the intrinsic's compute advantage is masked. Above L2 (>2M bytes) the speedup recovers to `~1.5x` as the workload becomes a steady-state stream from L3/memory and the intrinsic's lower iteration count amortizes the latency more effectively than the baseline's longer scalar loop. 5. There is a similar dip around the L1-cache boundary (48K bytes), most visible for `int[16384]`. When the array just exceeds L1, the intrinsic's 7 scattered pivot loads per iteration each risk an L1 miss, while the baseline's single load per iteration spreads the cache pressure thinner. Once the array grows further and both versions are L2-resident, the intrinsic's compute advantage (lower iterations and lower total memory accesses) reasserts itself and the speedup recovers. <img width="1800" height="1500" alt="bench" src="https://github.com/user-attachments/assets/2cd1cc31-0687-4015-aa33-7454efc71032" /> ------------- Commit messages: - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays Changes: https://git.openjdk.org/jdk/pull/30612/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=30612&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8380841 Stats: 924 lines in 15 files changed: 923 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/30612.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/30612/head:pull/30612 PR: https://git.openjdk.org/jdk/pull/30612
