Hi Team,

At Amazon (customer facing product search team) we've been playing with /
benchmarking Tantivy (exciting Rust search engine loosely inspired by
Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel
and developed now by Quickwit and the Tantivy open source dev community) vs
Lucene, by building a benchmark that blends Tantivy's search-benchmark-game
(https://github.com/quickwit-oss/search-benchmark-game) and Lucene's
nightly benchmarks (luceneutil: https://github.com/mikemccand/luceneutil
and https://home.apache.org/~mikemccand/lucenebench).

It's great fun, and we would love more eyeballs to spot any remaining
unfair aspects of the comparison (thank you Uwe and Adrien for catching
things already!).  We are trying hard for an apples to apples comparison:
same (enwiki) corpus, same queries, confirming we get precisely the same
top N hits and total hit counts, latest versions of both engines.

Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many
queries, and we've been trying to understand why.

Sometimes it is due to a more efficient algorithms, e.g. the count() API
for pure disjunctive queries, which Adrien is now porting to Lucene (thank
you!), showing sizable (~80% faster in one query) speedups:
https://github.com/apache/lucene/issues/12358.  Other times may be due to
Rust's more efficient/immediate/Python-like GC, or direct access to SIMD
(Lucene now has this for aKNN search -- big speedup -- and maybe soon for
postings too?), and unsafe code, different skip data / block postings
encoding, or ...

Specifically, one of the fascinating Tantivy optimizations is the
branchless binary search: https://quickwit.io/blog/search-a-sorted-block.
Here's another blog post about it (implemented in C++):
https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/

The idea is to get rustc/gcc to compile down to x86-64's CMOVcc
("conditional move") instruction (I'm not sure if ARM has an equivalent?
Maybe "conditional execution"?).  The idea is to avoid a "true" branch of
the instruction stream (costly pipeline flush on mis-prediction, which is
likely in a binary search or priority queue context) by instead
conditionally moving a value from one location to another (register or
memory).  Tantivy uses this for skipping through postings, in a single
layer in-memory skip list structure (versus Lucene's on-disk
(memory-mapped, by default) multi-layer skip list) -- see the above blog
post.

This made me wonder: does javac's hotspot compiler use CMOVcc?  I see javac
bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 which
seems to imply C2 does in fact compile to CMOVcc sometimes.  So then I
wondered whether a branchless binary search in Java is a possibility?  Has
anyone played with this?

Before Robert gets too upset :)  Even if we could build such a thing, the
right place for such optimizations is likely the JDK itself (see the
similar discussion about SIMD-optimized sorting:
https://github.com/apache/lucene/issues/12399).  Still, I'm curious if
anyone has explored this, and maybe saw some performance gains from way up
in javaland where we can barely see the bare metal shining beneath us :)

Sorry for the long post!

Mike McCandless

http://blog.mikemccandless.com

Reply via email to