Hi,
First of all: We have to be a bit careful with the results. E.g., the
SegmentReader::get_live_docs() returns null, so the code does not use
deleted documents. Of course this is not relevant if the original Lucene
Index is also without deletions, but you need to keep an eye on it.
What's great: the indexing part ist perfectly optimized in Lucene Java.
He figured also out that our highly multithreaded IndexWriter is almost
impossible to rewrite in C++. He told about segmentation faults occuring
all the time when he tried to make the code parallel and spending weeks
on debugging this. So actually Java's concurrency with the Java memory
model is much easier to handle. So what he has actually shown: You can
make queries faster by specialization (see below).
It is also nice what he figured out: LZ4 and indexing itsself is as fast
in C++ as in Java, so we see that Hotspot is doing a good job. There are
only some smaller optimizations possible because Lucene Java sometimes
copies data needlessy (which is still a limitation of our IndexInput
design).
What is a good outcome: If we go to completely drop IndexInput and all
directory abstractions at some point in Lucene 11 with Java 24 and
solely work on MemorySegments, we could improve a lot. I agree with his
info that we still copy a lot of data from memory segments to heap when
decoding PFOR and similar stuff instead of directly accessing the memory
using VarHandles/MemorySegment. So we should really get rid of the
IndexInput abstractions (Robert and I are always getting crazy when we
see the IndexInput bullshit with seek and unaligned accesses....).
What is of course very crazy and the main reason for his improvements:
He figured out that the query part can be made faster by using some
tricks like not using virtual function calls. This is not possible in
Java and has the downside of requiring to recompile whole of Lucene on
C++ side whenever you add a new query type (as everything is hardcoded).
So he looses a lot of flexibility.
P.S.: Maybe we should make the BulkScorer window size configurable...
P.P.S.: He did not yet implement HNSW at all, so he does not use SIMD. I
wonder why the Lucene is not faster for stuff that autovectorizes nicely
(like bitcounts ind FixedBitSet,...).
Uwe
Am 22.07.2024 um 17:30 schrieb Michael McCandless:
Thanks for sharing Adrien, this is really cool! It's neat that the
relative gains of Java vs C are quite a bit less than they were ~11
years ago when I played with a much smaller subset of queries. Also,
COUNT on disjunction queries with Lucene Cyborg got slower. What a
feat, to port so much of our complex Search code to C!
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jul 22, 2024 at 9:43 AM Adrien Grand <jpou...@gmail.com> wrote:
Hello everyone,
I recently stumbled on this paper after Ishan shared it on
LinkedIn:
https://github.com/0ctopus13prime/lucene-cyborg-paper/blob/main/LuceneCyborg_Hybrid_Search_Engine_Written_in_Java_and_C%2B%2B.pdf.
This is quite impressive: this person did a high-fidelity rewrite
of Lucene in C++: it can even read indexes created by Lucene
as-is. Then they ran the Tantivy benchmark to compare performance
with Lucene, Tantivy and PISA. There are many takeaways, this is
an interesting read.
--
Adrien
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de