Oh I realized this super interesting read (thanks Rob!) was accidentally
sent privately to me, so I'm reply-all'ing explicitly so everyone else gets
to read this too :)

I especially love the charts and math at the end!

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jul 27, 2023 at 8:02 AM Rob Audenaerde <rob.audenae...@gmail.com>
wrote:

> Super interesting read!
>
> Btw. following the links of the internet I somehow ended up here, which is
> a nice in-depth comparison on binary-search approaches:
>
> https://en.algorithmica.org/hpc/data-structures/binary-search/
>
>
>
> On Thu, Jul 27, 2023 at 1:40 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> At Amazon (customer facing product search team) we've been playing with /
>> benchmarking Tantivy (exciting Rust search engine loosely inspired by
>> Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel
>> and developed now by Quickwit and the Tantivy open source dev community) vs
>> Lucene, by building a benchmark that blends Tantivy's search-benchmark-game
>> (https://github.com/quickwit-oss/search-benchmark-game) and Lucene's
>> nightly benchmarks (luceneutil: https://github.com/mikemccand/luceneutil
>> and https://home.apache.org/~mikemccand/lucenebench).
>>
>> It's great fun, and we would love more eyeballs to spot any remaining
>> unfair aspects of the comparison (thank you Uwe and Adrien for catching
>> things already!).  We are trying hard for an apples to apples comparison:
>> same (enwiki) corpus, same queries, confirming we get precisely the same
>> top N hits and total hit counts, latest versions of both engines.
>>
>> Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many
>> queries, and we've been trying to understand why.
>>
>> Sometimes it is due to a more efficient algorithms, e.g. the count() API
>> for pure disjunctive queries, which Adrien is now porting to Lucene (thank
>> you!), showing sizable (~80% faster in one query) speedups:
>> https://github.com/apache/lucene/issues/12358.  Other times may be due
>> to Rust's more efficient/immediate/Python-like GC, or direct access to SIMD
>> (Lucene now has this for aKNN search -- big speedup -- and maybe soon for
>> postings too?), and unsafe code, different skip data / block postings
>> encoding, or ...
>>
>> Specifically, one of the fascinating Tantivy optimizations is the
>> branchless binary search: https://quickwit.io/blog/search-a-sorted-block.
>> Here's another blog post about it (implemented in C++):
>> https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/
>>
>> The idea is to get rustc/gcc to compile down to x86-64's CMOVcc
>> ("conditional move") instruction (I'm not sure if ARM has an equivalent?
>> Maybe "conditional execution"?).  The idea is to avoid a "true" branch of
>> the instruction stream (costly pipeline flush on mis-prediction, which is
>> likely in a binary search or priority queue context) by instead
>> conditionally moving a value from one location to another (register or
>> memory).  Tantivy uses this for skipping through postings, in a single
>> layer in-memory skip list structure (versus Lucene's on-disk
>> (memory-mapped, by default) multi-layer skip list) -- see the above blog
>> post.
>>
>> This made me wonder: does javac's hotspot compiler use CMOVcc?  I see
>> javac bug fixes like https://github.com/openjdk/mobile/commit/a03e9220
>> which seems to imply C2 does in fact compile to CMOVcc sometimes.  So then
>> I wondered whether a branchless binary search in Java is a possibility?
>> Has anyone played with this?
>>
>> Before Robert gets too upset :)  Even if we could build such a thing, the
>> right place for such optimizations is likely the JDK itself (see the
>> similar discussion about SIMD-optimized sorting:
>> https://github.com/apache/lucene/issues/12399).  Still, I'm curious if
>> anyone has explored this, and maybe saw some performance gains from way up
>> in javaland where we can barely see the bare metal shining beneath us :)
>>
>> Sorry for the long post!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>

Reply via email to