On Fri, 17 Apr 2026 13:59:12 GMT, Kerem Kat <[email protected]> wrote:
>> I propose a new intrinsic that replaces the scalar binary search in >> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other >> architectures are out of scope of the current issue). >> >> ## Stub code >> >> "Binary" search does an 8-ary search in the stub: each iteration loads 7 >> evenly-spaced pivots from the current `[low, high]` range into a register, >> compares the key against all 7 in parallel, and narrows the range to one of >> 8 subranges based on the result. The range shrinks by 8x per iteration >> instead of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + >> one batched compare per iteration. The tail of the SIMD search falls through >> to a per-type scalar binary search, based on the element count (currently >> 128 for both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`). >> >> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the >> cost of an extra load. When I tried the per-iteration packing, the cost of >> `vpinsrq` plus the serial dependency chain through the pack made each `long` >> N-ary iteration too expensive to amortize over only a 4x range reduction. >> Switching `long` to 8-ary across two YMM registers brought it in line with >> the dword path. >> >> To squeeze performance out of the stub, it uses type-specialized loops to >> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of >> every four are independent of each other (letting the OoO engine compute the >> offsets in parallel and issue the loads simultaneously), and uses branchless >> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in >> the inner loop). >> >> ## Calling the stub >> >> Calling the stub for small arrays regressed against C2-inlined Java, because >> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys >> across the caller's loop) while the stub call is opaque and forces reloads >> every iteration. >> >> In my first attempt at a fix, I added a manual `If` diamond that called >> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback >> recursively, producing nested diamonds. Then I found that the proper fix >> uses C2's `for_predicated_intrinsic` mechanism: >> `inline_array_binary_search_predicate` returns a slow-path control via >> `generate_fair_guard`, and the framework wires the slow path through a >> non-intrinsifying call generator. Initially I used `generate_slow_guard` >> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes >> both branches as hot. When the inlined binary search is treated... > > Thanks for the scan! On that same machine, the dip is reproducible. @krk It looks like you put a lot of effort into this patch, so thank you for that :) I am wondering though if it is desirable to add more and more intrinsics. Each one of them is drilling a hole through the JVM. And assembly code is harder to review, it is easier to introduce bugs. Could we instead use the Vector API, once it is fully available? Before we keep reviewing: You'd need to run `/template append` and confirm that you did not use AI for the patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/30612#issuecomment-4293814220
