Re: Lucene Cyborg

Uwe Schindler Mon, 22 Jul 2024 09:57:48 -0700

Hi,

First of all: We have to be a bit careful with the results. E.g., theSegmentReader::get_live_docs() returns null, so the code does not usedeleted documents. Of course this is not relevant if the original LuceneIndex is also without deletions, but you need to keep an eye on it.

What's great: the indexing part ist perfectly optimized in Lucene Java.He figured also out that our highly multithreaded IndexWriter is almostimpossible to rewrite in C++. He told about segmentation faults occuringall the time when he tried to make the code parallel and spending weekson debugging this. So actually Java's concurrency with the Java memorymodel is much easier to handle. So what he has actually shown: You canmake queries faster by specialization (see below).

It is also nice what he figured out: LZ4 and indexing itsself is as fastin C++ as in Java, so we see that Hotspot is doing a good job. There areonly some smaller optimizations possible because Lucene Java sometimescopies data needlessy (which is still a limitation of our IndexInputdesign).

What is a good outcome: If we go to completely drop IndexInput and alldirectory abstractions at some point in Lucene 11 with Java 24 andsolely work on MemorySegments, we could improve a lot. I agree with hisinfo that we still copy a lot of data from memory segments to heap whendecoding PFOR and similar stuff instead of directly accessing the memoryusing VarHandles/MemorySegment. So we should really get rid of theIndexInput abstractions (Robert and I are always getting crazy when wesee the IndexInput bullshit with seek and unaligned accesses....).

What is of course very crazy and the main reason for his improvements:He figured out that the query part can be made faster by using sometricks like not using virtual function calls. This is not possible inJava and has the downside of requiring to recompile whole of Lucene onC++ side whenever you add a new query type (as everything is hardcoded).So he looses a lot of flexibility.


P.S.: Maybe we should make the BulkScorer window size configurable...

P.P.S.: He did not yet implement HNSW at all, so he does not use SIMD. Iwonder why the Lucene is not faster for stuff that autovectorizes nicely(like bitcounts ind FixedBitSet,...).


Uwe

Am 22.07.2024 um 17:30 schrieb Michael McCandless:

Thanks for sharing Adrien, this is really cool! It's neat that therelative gains of Java vs C are quite a bit less than they were ~11years ago when I played with a much smaller subset of queries. Also,COUNT on disjunction queries with Lucene Cyborg got slower. What afeat, to port so much of our complex Search code to C!
Mike McCandless

http://blog.mikemccandless.com


On Mon, Jul 22, 2024 at 9:43 AM Adrien Grand <jpou...@gmail.com> wrote:

    Hello everyone,

    I recently stumbled on this paper after Ishan shared it on
    LinkedIn:
    
https://github.com/0ctopus13prime/lucene-cyborg-paper/blob/main/LuceneCyborg_Hybrid_Search_Engine_Written_in_Java_and_C%2B%2B.pdf.

    This is quite impressive: this person did a high-fidelity rewrite
    of Lucene in C++: it can even read indexes created by Lucene
    as-is. Then they ran the Tantivy benchmark to compare performance
    with Lucene, Tantivy and PISA. There are many takeaways, this is
    an interesting read.
--Adrien

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Re: Lucene Cyborg

Reply via email to