> I propose a new intrinsic that replaces the scalar binary search in 
> `Arrays.binarySearch0()` with an N-ary SIMD search in AVX2 (AVX512 and other 
> architectures are out of scope of the current issue).
> 
> ## Stub code
> 
> "Binary" search does an 8-ary search in the stub: each iteration loads 7 
> evenly-spaced pivots from the current `[low, high]` range into a register, 
> compares the key against all 7 in parallel, and narrows the range to one of 8 
> subranges based on the result. The range shrinks by 8x per iteration instead 
> of 2x (`log8(N)` vs `log2(N)` iterations), at the cost of 7 loads + one 
> batched compare per iteration. The tail of the SIMD search falls through to a 
> per-type scalar binary search, based on the element count (currently 128 for 
> both `DWORD_NARY_THRESHOLD` and `QWORD_NARY_THRESHOLD`).
> 
> Initially the `long` path used 4-ary with 3 pivots in one YMM to avoid the 
> cost of an extra load. When I tried the per-iteration packing, the cost of 
> `vpinsrq` plus the serial dependency chain through the pack made each `long` 
> N-ary iteration too expensive to amortize over only a 4x range reduction. 
> Switching `long` to 8-ary across two YMM registers brought it in line with 
> the dword path.
> 
> To squeeze performance out of the stub, it uses type-specialized loops to 
> avoid per-iteration dispatch, arranges the pivot-offset `lea`s so three of 
> every four are independent of each other (letting the OoO engine compute the 
> offsets in parallel and issue the loads simultaneously), and uses branchless 
> partitioning via `popcnt` on the `vpcmpgt` mask (no per-pivot branches in the 
> inner loop).
> 
> ## Calling the stub
> 
> Calling the stub for small arrays regressed against C2-inlined Java, because 
> C2 fully optimizes inlined Java (hoisting array bases, lengths, and keys 
> across the caller's loop) while the stub call is opaque and forces reloads 
> every iteration.
> 
> In my first attempt at a fix, I added a manual `If` diamond that called 
> `binarySearch0` on the slow path, but C2 re-intrinsified the fallback 
> recursively, producing nested diamonds. Then I found that the proper fix uses 
> C2's `for_predicated_intrinsic` mechanism: 
> `inline_array_binary_search_predicate` returns a slow-path control via 
> `generate_fair_guard`, and the framework wires the slow path through a 
> non-intrinsifying call generator. Initially I used `generate_slow_guard` 
> here, then switched to `generate_fair_guard` (`PROB_FAIR`) so C2 optimizes 
> both branches as hot. When the inlined binary search is treated as cold, the 
> fallback ends up...

Kerem Kat has updated the pull request with a new target base due to a merge or 
a rebase. The incremental webrev excludes the unrelated changes brought in by 
the merge/rebase. The pull request contains two additional commits since the 
last revision:

 - Merge branch 'master' into opt-binsearch-avx2-8380841
 - 8380841: Add AVX2 optimized intrinsic for binarySearch on primitive arrays

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/30612/files
  - new: https://git.openjdk.org/jdk/pull/30612/files/14980904..699412d1

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=30612&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=30612&range=00-01

  Stats: 131183 lines in 2268 files changed: 48080 ins; 72961 del; 10142 mod
  Patch: https://git.openjdk.org/jdk/pull/30612.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/30612/head:pull/30612

PR: https://git.openjdk.org/jdk/pull/30612

Reply via email to