Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays [v2]

Andrew Dinn Wed, 06 May 2026 11:03:52 -0700

On Wed, 6 May 2026 17:32:56 GMT, Kerem Kat <[email protected]> wrote:

>> I propose a new intrinsic that replaces the scalar binary search in 
>> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other 
>> architectures are out of scope of the current issue).
>> 
>> ## Stub code
>> 
>> "Binary" search does an 8-ary search in the stub: each iteration loads 7 
>> evenly-spaced pivots from the current `[low, high]` range into a register, 
>> compares the key against all 7 in parallel, and narrows the range to one of 
>> 8 subranges based on the result. The range shrinks by 8x per iteration 
>> instead of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + 
>> one batched compare per iteration. The tail of the SIMD search falls through 
>> to a per-type scalar binary search, based on the element count (currently 
>> 128 for both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`).
>> 
>> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the 
>> cost of an extra load. When I tried the per-iteration packing, the cost of 
>> `vpinsrq` plus the serial dependency chain through the pack made each `long` 
>> N-ary iteration too expensive to amortize over only a 4x range reduction. 
>> Switching `long` to 8-ary across two YMM registers brought it in line with 
>> the dword path.
>> 
>> To squeeze performance out of the stub, it uses type-specialized loops to 
>> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of 
>> every four are independent of each other (letting the OoO engine compute the 
>> offsets in parallel and issue the loads simultaneously), and uses branchless 
>> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in 
>> the inner loop).
>> 
>> ## Calling the stub
>> 
>> Calling the stub for small arrays regressed against C2-inlined Java, because 
>> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys 
>> across the caller's loop) while the stub call is opaque and forces reloads 
>> every iteration.
>> 
>> In my first attempt at a fix, I added a manual `If` diamond that called 
>> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback 
>> recursively, producing nested diamonds. Then I found that the proper fix 
>> uses C2's `for_predicated_intrinsic` mechanism: 
>> `inline_array_binary_search_predicate` returns a slow-path control via 
>> `generate_fair_guard`, and the framework wires the slow path through a 
>> non-intrinsifying call generator. Initially I used `generate_slow_guard` 
>> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes 
>> both branches as hot. When the inlined binary search is treated...
>
> Kerem Kat has updated the pull request with a new target base due to a merge 
> or a rebase. The incremental webrev excludes the unrelated changes brought in 
> by the merge/rebase. The pull request contains two additional commits since 
> the last revision:
> 
>  - Merge branch 'master' into opt-binsearch-avx2-8380841
>  - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays


I have not yet looked into the details of he stub implementation but there are 
a few things that need considering before adding a new stub.

Firstly, should the stub be generic or architecture-specific. This appears to 
be an x86-specific implementation yet you have declared it at the generic level 
in `stubDeclarations.hpp` and allocated storage for the entry in shared class 
`StubRoutines`. Likewise you have implemented to call out ot the stub at the 
generic level in `libraryKit`. Is that because you expect other implementations 
on other architectures or just because you have not thought about the question 
and the implementation?

Secondly, if you add a new compiler stub to the runtime you need to implement 
support for saving and restoring the stub to/from the AOT cache which, as a 
side effect, registers the stub entry address where it can be relocated when 
AOT code that calls out to it is saved or loaded. If you look at the other stub 
implementations you will see that they eploy strategic calls to 
`store_archive_data` and `load_archive_data`. You will need to do the same.

Thirdly, if the stub employs external (static) data tables then you may also 
need to register some of the addresses involved with the AOT cache address 
table. I can advise further on that and on the previous two concerns when I 
look into the details.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30612#issuecomment-4390671723

Re: RFR: 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays [v2]

Reply via email to