Re: Branchless binary search in Java?

Uwe Schindler Thu, 27 Jul 2023 06:19:53 -0700

See Shipilevs blog: 
https://shipilev.net/jvm/anatomy-quarks/30-conditional-moves/


He also has some examples and also there's a command line option to tell 
hotspot when to use cmov: -XX:ConditionalMoveLimit

Uwe

Am 27. Juli 2023 14:43:19 MESZ schrieb Uwe Schindler <[email protected]>:
>Hi Mike,
>
>actually Hotspot is using CMOV. Some nodes after bytecode analysis are 
>converted to CMoveNode and it has implementations to generate machine code (as 
>far as i see) for x86, s390 and arm.
>
>The generic code is here: 
>https://github.com/openjdk/jdk/blob/486c7844f902728ce580c3994f58e3e497834952/src/hotspot/share/opto/movenode.cpp
>
>Actually it is used in some cases, but I did not find out when it uses it to 
>generate instructions from bytecode. It also has some code to optimize the 
>cmov away if the result is known before (or not, e.g. for floats it does not 
>do this).
>
>I think best would be to ask on the hotspot compiler list on suggestions how 
>to write Java code to trigger the JVM to insert the CMOV.
>
>Uwe
>
>Am 27.07.2023 um 13:40 schrieb Michael McCandless:
>> Hi Team,
>> 
>> At Amazon (customer facing product search team) we've been playing with / 
>> benchmarking Tantivy (exciting Rust search engine loosely inspired by 
>> Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel and 
>> developed now by Quickwit and the Tantivy open source dev community) vs 
>> Lucene, by building a benchmark that blends Tantivy's search-benchmark-game 
>> (https://github.com/quickwit-oss/search-benchmark-game) and Lucene's nightly 
>> benchmarks (luceneutil: https://github.com/mikemccand/luceneutil and 
>> https://home.apache.org/~mikemccand/lucenebench).
>> 
>> It's great fun, and we would love more eyeballs to spot any remaining unfair 
>> aspects of the comparison (thank you Uwe and Adrien for catching things 
>> already!).  We are trying hard for an apples to apples comparison: same 
>> (enwiki) corpus, same queries, confirming we get precisely the same top N 
>> hits and total hit counts, latest versions of both engines.
>> 
>> Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many 
>> queries, and we've been trying to understand why.
>> 
>> Sometimes it is due to a more efficient algorithms, e.g. the count() API for 
>> pure disjunctive queries, which Adrien is now porting to Lucene (thank 
>> you!), showing sizable (~80% faster in one query) speedups: 
>> https://github.com/apache/lucene/issues/12358. Other times may be due to 
>> Rust's more efficient/immediate/Python-like GC, or direct access to SIMD 
>> (Lucene now has this for aKNN search -- big speedup -- and maybe soon for 
>> postings too?), and unsafe code, different skip data / block postings 
>> encoding, or ...
>> 
>> Specifically, one of the fascinating Tantivy optimizations is the branchless 
>> binary search: https://quickwit.io/blog/search-a-sorted-block. Here's 
>> another blog post about it (implemented in C++): 
>> https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/
>> 
>> The idea is to get rustc/gcc to compile down to x86-64's CMOVcc 
>> ("conditional move") instruction (I'm not sure if ARM has an equivalent?  
>> Maybe "conditional execution"?).  The idea is to avoid a "true" branch of 
>> the instruction stream (costly pipeline flush on mis-prediction, which is 
>> likely in a binary search or priority queue context) by instead 
>> conditionally moving a value from one location to another (register or 
>> memory).  Tantivy uses this for skipping through postings, in a single layer 
>> in-memory skip list structure (versus Lucene's on-disk (memory-mapped, by 
>> default) multi-layer skip list) -- see the above blog post.
>> 
>> This made me wonder: does javac's hotspot compiler use CMOVcc?  I see javac 
>> bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 which seems 
>> to imply C2 does in fact compile to CMOVcc sometimes.  So then I wondered 
>> whether a branchless binary search in Java is a possibility?  Has anyone 
>> played with this?
>> 
>> Before Robert gets too upset :)  Even if we could build such a thing, the 
>> right place for such optimizations is likely the JDK itself (see the similar 
>> discussion about SIMD-optimized sorting: 
>> https://github.com/apache/lucene/issues/12399). Still, I'm curious if anyone 
>> has explored this, and maybe saw some performance gains from way up in 
>> javaland where we can barely see the bare metal shining beneath us :)
>> 
>> Sorry for the long post!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>
>-- 
>Uwe Schindler
>Achterdiek 19, D-28357 Bremen
>https://www.thetaphi.de
>eMail:[email protected]

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Branchless binary search in Java?

Reply via email to