On Wed, 6 May 2026 17:32:56 GMT, Kerem Kat <[email protected]> wrote: >> I propose a new intrinsic that replaces the scalar binary search in >> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other >> architectures are out of scope of the current issue). >> >> ## Stub code >> >> "Binary" search does an 8-ary search in the stub: each iteration loads 7 >> evenly-spaced pivots from the current `[low, high]` range into a register, >> compares the key against all 7 in parallel, and narrows the range to one of >> 8 subranges based on the result. The range shrinks by 8x per iteration >> instead of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + >> one batched compare per iteration. The tail of the SIMD search falls through >> to a per-type scalar binary search, based on the element count (currently >> 128 for both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`). >> >> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the >> cost of an extra load. When I tried the per-iteration packing, the cost of >> `vpinsrq` plus the serial dependency chain through the pack made each `long` >> N-ary iteration too expensive to amortize over only a 4x range reduction. >> Switching `long` to 8-ary across two YMM registers brought it in line with >> the dword path. >> >> To squeeze performance out of the stub, it uses type-specialized loops to >> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of >> every four are independent of each other (letting the OoO engine compute the >> offsets in parallel and issue the loads simultaneously), and uses branchless >> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in >> the inner loop). >> >> ## Calling the stub >> >> Calling the stub for small arrays regressed against C2-inlined Java, because >> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys >> across the caller's loop) while the stub call is opaque and forces reloads >> every iteration. >> >> In my first attempt at a fix, I added a manual `If` diamond that called >> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback >> recursively, producing nested diamonds. Then I found that the proper fix >> uses C2's `for_predicated_intrinsic` mechanism: >> `inline_array_binary_search_predicate` returns a slow-path control via >> `generate_fair_guard`, and the framework wires the slow path through a >> non-intrinsifying call generator. Initially I used `generate_slow_guard` >> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes >> both branches as hot. When the inlined binary search is treated... > > Kerem Kat has updated the pull request with a new target base due to a merge > or a rebase. The incremental webrev excludes the unrelated changes brought in > by the merge/rebase. The pull request contains two additional commits since > the last revision: > > - Merge branch 'master' into opt-binsearch-avx2-8380841 > - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays
I spotted a further issue with your implementation. Unfortunately, it is currently causing the tests to fail with a somewhat uninformative error (n.b. they still ought to be failing but with a more useful error). Your generator routine declared a StubCodeMark for the stub as follows: StubCodeMark mark(this, "StubRoutines", "array_binary_search"); That API is used to mark an unmanaged (i.e. non-enumerated) stub. The managed stubs declared in `stubDeclarations.hpp` or `stubDeclarations_<arch>.hpp` should be marked using the relevant `StubId`enum tag -- in your case this would be: StubCodeMark mark(this, StubId::stubgen_array_binary_search_id); With your current usage the `StubCodeMark` records the stub id as `StubId::NO_STUBID`, which is fine for unmanaged stubs but not ok for managed stubs. The failure in the latest GHA run happens because the verification code implemented by `~StubCodeMark` identifies that this stub belongs to a StubGenerator blob and tries to map the stub id to a blob id to make sure they align -- only `NO_STUBID` is not associated with any blob leading to an unexpected (but still appropriate) assert. What the verify code should do is detect that you have used the wrong `StubCodeMark` constructor, labelling a managed stub with `NO_STUBID`, and tell you to fix that call. I have a PR which will fix that for any future usage. Meanwhile, if you make the change recommended above then you will see the tests fail with a more meaningful error: java.lang.RuntimeException: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/adinn/redhat/openjdk/jdkdev/opt-binsearch-avx2-8380841/src/hotspot/share/code/aotCodeCache.cpp:2706), pid=820290, tid=820293 # assert(range.start_index() != -1) failed: missing store_archive_data for generated stub array_binary_search_stub (stub gen) The solution for that is to insert calls to `load_archive_data` and `store-archive_data` into your generator routine following the model of other StubGenerator stubs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/30612#issuecomment-4396473167
