For the curious, I managed to make Java use cmov instructions for binary
search, see https://github.com/apache/lucene/pull/13692. binarySearch4,
binarySearch5 and binarySearch6 internally use the cmov instruction.

However, translating it into speedups for query evaluation proved quite
challenging, as most queries in our benchmarks find the desired doc ID in
the couple doc IDs after the current doc ID, which makes linear search hard
to beat. I thought I'd still share it on this thread as we might have other
places where making binary search branchless could prove useful. Some of
you might also have other ideas how we could make advancing within a block
faster.

On Mon, Aug 14, 2023 at 5:31 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Oh I realized this super interesting read (thanks Rob!) was accidentally
> sent privately to me, so I'm reply-all'ing explicitly so everyone else gets
> to read this too :)
>
> I especially love the charts and math at the end!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Jul 27, 2023 at 8:02 AM Rob Audenaerde <rob.audenae...@gmail.com>
> wrote:
>
>> Super interesting read!
>>
>> Btw. following the links of the internet I somehow ended up here, which
>> is a nice in-depth comparison on binary-search approaches:
>>
>> https://en.algorithmica.org/hpc/data-structures/binary-search/
>>
>>
>>
>> On Thu, Jul 27, 2023 at 1:40 PM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Hi Team,
>>>
>>> At Amazon (customer facing product search team) we've been playing with
>>> / benchmarking Tantivy (exciting Rust search engine loosely inspired by
>>> Lucene: https://github.com/quickwit-oss/tantivy, created by Paul
>>> Masurel and developed now by Quickwit and the Tantivy open source dev
>>> community) vs Lucene, by building a benchmark that blends
>>> Tantivy's search-benchmark-game (
>>> https://github.com/quickwit-oss/search-benchmark-game) and Lucene's
>>> nightly benchmarks (luceneutil: https://github.com/mikemccand/luceneutil
>>> and https://home.apache.org/~mikemccand/lucenebench).
>>>
>>> It's great fun, and we would love more eyeballs to spot any remaining
>>> unfair aspects of the comparison (thank you Uwe and Adrien for catching
>>> things already!).  We are trying hard for an apples to apples comparison:
>>> same (enwiki) corpus, same queries, confirming we get precisely the same
>>> top N hits and total hit counts, latest versions of both engines.
>>>
>>> Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many
>>> queries, and we've been trying to understand why.
>>>
>>> Sometimes it is due to a more efficient algorithms, e.g. the count() API
>>> for pure disjunctive queries, which Adrien is now porting to Lucene (thank
>>> you!), showing sizable (~80% faster in one query) speedups:
>>> https://github.com/apache/lucene/issues/12358.  Other times may be due
>>> to Rust's more efficient/immediate/Python-like GC, or direct access to SIMD
>>> (Lucene now has this for aKNN search -- big speedup -- and maybe soon for
>>> postings too?), and unsafe code, different skip data / block postings
>>> encoding, or ...
>>>
>>> Specifically, one of the fascinating Tantivy optimizations is the
>>> branchless binary search: https://quickwit.io/blog/search-a-sorted-block.
>>> Here's another blog post about it (implemented in C++):
>>> https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/
>>>
>>> The idea is to get rustc/gcc to compile down to x86-64's CMOVcc
>>> ("conditional move") instruction (I'm not sure if ARM has an equivalent?
>>> Maybe "conditional execution"?).  The idea is to avoid a "true" branch of
>>> the instruction stream (costly pipeline flush on mis-prediction, which is
>>> likely in a binary search or priority queue context) by instead
>>> conditionally moving a value from one location to another (register or
>>> memory).  Tantivy uses this for skipping through postings, in a single
>>> layer in-memory skip list structure (versus Lucene's on-disk
>>> (memory-mapped, by default) multi-layer skip list) -- see the above blog
>>> post.
>>>
>>> This made me wonder: does javac's hotspot compiler use CMOVcc?  I see
>>> javac bug fixes like https://github.com/openjdk/mobile/commit/a03e9220
>>> which seems to imply C2 does in fact compile to CMOVcc sometimes.  So then
>>> I wondered whether a branchless binary search in Java is a possibility?
>>> Has anyone played with this?
>>>
>>> Before Robert gets too upset :)  Even if we could build such a thing,
>>> the right place for such optimizations is likely the JDK itself (see the
>>> similar discussion about SIMD-optimized sorting:
>>> https://github.com/apache/lucene/issues/12399).  Still, I'm curious if
>>> anyone has explored this, and maybe saw some performance gains from way up
>>> in javaland where we can barely see the bare metal shining beneath us :)
>>>
>>> Sorry for the long post!
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>

-- 
Adrien

Reply via email to