Hi Team, At Amazon (customer facing product search team) we've been playing with / benchmarking Tantivy (exciting Rust search engine loosely inspired by Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel and developed now by Quickwit and the Tantivy open source dev community) vs Lucene, by building a benchmark that blends Tantivy's search-benchmark-game (https://github.com/quickwit-oss/search-benchmark-game) and Lucene's nightly benchmarks (luceneutil: https://github.com/mikemccand/luceneutil and https://home.apache.org/~mikemccand/lucenebench).
It's great fun, and we would love more eyeballs to spot any remaining unfair aspects of the comparison (thank you Uwe and Adrien for catching things already!). We are trying hard for an apples to apples comparison: same (enwiki) corpus, same queries, confirming we get precisely the same top N hits and total hit counts, latest versions of both engines. Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many queries, and we've been trying to understand why. Sometimes it is due to a more efficient algorithms, e.g. the count() API for pure disjunctive queries, which Adrien is now porting to Lucene (thank you!), showing sizable (~80% faster in one query) speedups: https://github.com/apache/lucene/issues/12358. Other times may be due to Rust's more efficient/immediate/Python-like GC, or direct access to SIMD (Lucene now has this for aKNN search -- big speedup -- and maybe soon for postings too?), and unsafe code, different skip data / block postings encoding, or ... Specifically, one of the fascinating Tantivy optimizations is the branchless binary search: https://quickwit.io/blog/search-a-sorted-block. Here's another blog post about it (implemented in C++): https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/ The idea is to get rustc/gcc to compile down to x86-64's CMOVcc ("conditional move") instruction (I'm not sure if ARM has an equivalent? Maybe "conditional execution"?). The idea is to avoid a "true" branch of the instruction stream (costly pipeline flush on mis-prediction, which is likely in a binary search or priority queue context) by instead conditionally moving a value from one location to another (register or memory). Tantivy uses this for skipping through postings, in a single layer in-memory skip list structure (versus Lucene's on-disk (memory-mapped, by default) multi-layer skip list) -- see the above blog post. This made me wonder: does javac's hotspot compiler use CMOVcc? I see javac bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 which seems to imply C2 does in fact compile to CMOVcc sometimes. So then I wondered whether a branchless binary search in Java is a possibility? Has anyone played with this? Before Robert gets too upset :) Even if we could build such a thing, the right place for such optimizations is likely the JDK itself (see the similar discussion about SIMD-optimized sorting: https://github.com/apache/lucene/issues/12399). Still, I'm curious if anyone has explored this, and maybe saw some performance gains from way up in javaland where we can barely see the bare metal shining beneath us :) Sorry for the long post! Mike McCandless http://blog.mikemccandless.com