Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays [v2]

Andrew Dinn Thu, 07 May 2026 03:57:49 -0700

On Wed, 6 May 2026 17:32:56 GMT, Kerem Kat <[email protected]> wrote:

>> I propose a new intrinsic that replaces the scalar binary search in 
>> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other 
>> architectures are out of scope of the current issue).
>> 
>> ## Stub code
>> 
>> "Binary" search does an 8-ary search in the stub: each iteration loads 7 
>> evenly-spaced pivots from the current `[low, high]` range into a register, 
>> compares the key against all 7 in parallel, and narrows the range to one of 
>> 8 subranges based on the result. The range shrinks by 8x per iteration 
>> instead of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + 
>> one batched compare per iteration. The tail of the SIMD search falls through 
>> to a per-type scalar binary search, based on the element count (currently 
>> 128 for both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`).
>> 
>> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the 
>> cost of an extra load. When I tried the per-iteration packing, the cost of 
>> `vpinsrq` plus the serial dependency chain through the pack made each `long` 
>> N-ary iteration too expensive to amortize over only a 4x range reduction. 
>> Switching `long` to 8-ary across two YMM registers brought it in line with 
>> the dword path.
>> 
>> To squeeze performance out of the stub, it uses type-specialized loops to 
>> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of 
>> every four are independent of each other (letting the OoO engine compute the 
>> offsets in parallel and issue the loads simultaneously), and uses branchless 
>> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in 
>> the inner loop).
>> 
>> ## Calling the stub
>> 
>> Calling the stub for small arrays regressed against C2-inlined Java, because 
>> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys 
>> across the caller's loop) while the stub call is opaque and forces reloads 
>> every iteration.
>> 
>> In my first attempt at a fix, I added a manual `If` diamond that called 
>> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback 
>> recursively, producing nested diamonds. Then I found that the proper fix 
>> uses C2's `for_predicated_intrinsic` mechanism: 
>> `inline_array_binary_search_predicate` returns a slow-path control via 
>> `generate_fair_guard`, and the framework wires the slow path through a 
>> non-intrinsifying call generator. Initially I used `generate_slow_guard` 
>> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes 
>> both branches as hot. When the inlined binary search is treated...
>
> Kerem Kat has updated the pull request with a new target base due to a merge 
> or a rebase. The incremental webrev excludes the unrelated changes brought in 
> by the merge/rebase. The pull request contains two additional commits since 
> the last revision:
> 
>  - Merge branch 'master' into opt-binsearch-avx2-8380841
>  - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays


I spotted a further issue with your implementation. Unfortunately, it is 
currently causing the tests to fail with a somewhat uninformative error (n.b. 
they still ought to be failing but with a more useful error).

Your generator routine declared a StubCodeMark for the stub as follows:

  StubCodeMark mark(this, "StubRoutines", "array_binary_search");


That API is used to mark an unmanaged (i.e. non-enumerated) stub. The managed 
stubs declared in `stubDeclarations.hpp` or `stubDeclarations_<arch>.hpp` 
should be marked using the relevant `StubId`enum tag -- in your case this would 
be:

  StubCodeMark mark(this, StubId::stubgen_array_binary_search_id);


With your current usage the `StubCodeMark` records the stub id as 
`StubId::NO_STUBID`, which is fine for unmanaged stubs but not ok for managed 
stubs.

The failure in the latest GHA run happens because the verification code 
implemented by `~StubCodeMark` identifies that this stub belongs to a 
StubGenerator blob and tries to map the stub id to a blob id to make sure they 
align -- only `NO_STUBID` is not associated with any blob leading to an 
unexpected (but still appropriate) assert. 

What the verify code should do is detect that you have used the wrong 
`StubCodeMark` constructor, labelling a managed stub with `NO_STUBID`, and tell 
you to fix that call. I have a PR which will fix that for any future usage.

Meanwhile, if you make the change recommended above then you will see the tests 
fail with a more meaningful error:

java.lang.RuntimeException: # A fatal error has been detected by the Java 
Runtime Environment:
#
#  Internal Error 
(/home/adinn/redhat/openjdk/jdkdev/opt-binsearch-avx2-8380841/src/hotspot/share/code/aotCodeCache.cpp:2706),
 pid=820290, tid=820293
#  assert(range.start_index() != -1) failed: missing store_archive_data for 
generated stub array_binary_search_stub (stub gen)

The solution for that is to insert calls to `load_archive_data` and 
`store-archive_data` into your generator routine following the model of other 
StubGenerator stubs.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30612#issuecomment-4396473167

Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays [v2]

Reply via email to