I managed to speed up my native implementation over lucene from 10% to ~35%
by adding bulk scoring:
https://github.com/mccullocht/lucene-knn-oxide/pull/1

I tested it with and without prefetching -- it's about 10% faster with
prefetching -- but I think much of the benefit comes from loading the query
vector once for N vectors and reducing data dependencies between iterations
of the loop. I think there's a reasonable path to implementation with a
smaller but substantial win in core lucene: you need a bulk scoring
interface on RandomVectorScorer, changes to HnswGraphSearcher to use the
bulk interface, and VectorUtil changes to support one x many scoring.

If there's interest in accepting a change like this (provided it is 10%+
faster) I'm happy to open an issue and work on it.

On Tue, Jul 8, 2025 at 9:04 AM Chris Hegarty
<christopher.hega...@elastic.co.invalid> wrote:

> Thanks Trevor,
>
> > On 8 Jul 2025, at 16:46, Trevor McCulloch 
> > <trevor.mccull...@mongodb.com.INVALID>
> wrote:
> >
> > Hey Chris,
> >
> > I was looking at your microbenchmark results [1] and noticed that dot
> product times are lower than a typical L3 cache miss time of ~150ns. This
> may align our results: a native SIMD accelerate scorer off heap is a lot
> faster, but the performance in a scale workload isn't substantially better
> because the improvement gets eaten by memory latency when loading document
> vectors.
>
> Yeah, I had the same thought.
>
> > HNSW's access pattern is random so this feels likely. To that end I got
> an additional ~10% improvement in my macro benchmark by inserting
> prefetching intrinsics right before scoring [2]. It would certainly be
> possible to do this directly using cpu native intrinsics (_mm_prefetch on
> x86_64 or _prefetch on aarch64) in your native scorer.
>
> In a previous version I was perfecting, but then remove it. I think I need
> to use a larger dataset for this micro-benchmark.
>
> > I'll try to find some time this week to try this further up in the HNSW
> traversal loop since we are effectively doing one-to-many scoring against a
> list of vertices, it might be faster to loop twice: once to check the
> visited set and prefetch anything that will be scored, and then again to
> score. I should also stand this up on a linux box as perf counters may be
> helpful here.
>
> Interesting idea!
>
> -Chris.
>
> > Trevor
> >
> > [1]
> https://github.com/elastic/elasticsearch/pull/130635#issuecomment-3036314864
> > [2]
> https://github.com/mccullocht/lucene-knn-oxide/compare/main...prefetch
> >
> > On Mon, Jul 7, 2025 at 1:43 PM Chris Hegarty
> <christopher.hega...@elastic.co.invalid> wrote:
> > Thanks Trevor, this is helpful.  It also reminded me to reply here too
> on the off-heap scoring.
> >
> > Scoring off-heap, and thus avoiding a copy, gains approx 2x in the
> vector comparisons - this is quite substantial and maybe align with what
> Trevor sees? However, because of a Hotspot bug, using MemorySegment is not
> yet an option for scoring float32’s off-heap [1]. We’ll get the Hotspot bug
> fixed, but in the mean time to help evaluate the potential gain I just
> wrote native float32 vector comparison operations to see how they perform
> in Elasticsearch [2].
> >
> > -Chris.
> >
> > [1] https://mail.openjdk.org/pipermail/panama-dev/2025-June/021070.html
> > [2] https://github.com/elastic/elasticsearch/pull/130541
> >
> > > On 7 Jul 2025, at 21:01, Trevor McCulloch <
> trevor.mccull...@mongodb.com.INVALID> wrote:
> > >
> > > For comparison I put together a codec that is a copy of Lucene99 but
> with most of KnnVectorsReader.search() implemented in native code and
> called via FFM as a way of examining overhead from the JVM. I didn't have
> the same data set, but on 1M 1536d float vectors with default lucene99
> settings it was about 10% faster in native code -- not nothing, but not
> very much considering the additional complexity of using native code. I was
> able to avoid copying data from the mmap backing store in order to perform
> distance comparisons which was probably a significant chunk of the win. The
> vast majority of CPU time (~80%) is spent doing vector comparison and
> something like 95-96% of that CPU time is spent scoring in L0. Decoding
> edges from the graph is ~1.5%. I think so long as vector scoring code is
> competitive and Lucene is doing a similar number of comparisons the margin
> should be pretty close.
> > >
> > > I looked through their open source implementation and did not see
> anything that led me to believe that their HNSW implementation is
> substantially different in an algorithmic sense. They did have some logic
> around choosing different representations of the visited set depending on
> the expected number of nodes to visit (choosing between ~FixedBitSet and a
> hash set).
> > >
> > > On Mon, Jun 23, 2025 at 6:10 AM Benjamin Trent <ben.w.tr...@gmail.com>
> wrote:
> > > To my knowledge, FAISS isn't utilizing hand-rolled SIMD calculations.
> Do we know if it was compiled with `--ffast-math`?
> > >
> > > Vespa does utilize SIMD optimizations for vector comparisons.
> > >
> > > Some more ways I think Lucene is slower (though, I am not sure the 2x
> is fully explained):
> > >
> > >  - Reading floats onto heap float[] instead of accessing Memory
> Segments directly when scoring
> > >  - We store the graph in a unique way that requires a decoding step
> when exploring a new candidate, reading in vints and doing a binary search.
> I think all other hnsw impls do flat arrays of int/long values.
> > >  - We always use SparseBitSet, which for smaller indices <1M can have
> a noticeable impact on performance. I have seen this in my own benchmarking
> (switching to fixedbitset measurably improved query times on smaller data
> sets)
> > >
> > > Both of these are fairly "cheap". Which might explain the FAISS 10%
> difference. However, I am not sure they fully explain the 2x difference
> with vespa.
> > >
> > > On Thu, Jun 19, 2025 at 3:37 PM Adrien Grand <jpou...@gmail.com>
> wrote:
> > > Thanks Mike, this is useful information. Then I'll try to reproduce
> this benchmark to better understand what is happening.
> > >
> > > On Thu, Jun 19, 2025 at 8:16 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> > > We've recently been comparing Lucene's HNSW w/FAISS' and there is not
> > > a 2x difference there. FAISS does seem to be around 10-15% faster I
> > > think?  The 2x difference is roughly what I was seeing in comparisons
> > > w/hnswlib prior to the dot-product improvements we made in Lucene.
> > >
> > > On Thu, Jun 19, 2025 at 2:12 PM Adrien Grand <jpou...@gmail.com>
> wrote:
> > > >
> > > > Chris,
> > > >
> > > > FWIW I was looking at luceneknn (
> https://github.com/erikbern/ann-benchmarks/blob/f402b2cc17b980d7cd45241ab5a7a4cc0f965e55/ann_benchmarks/algorithms/luceneknn/Dockerfile#L15)
> which is on 9.7, though I don't know if it enabled the incubating vector
> API at runtime?
> > > >
> > > > I hope that mentioning ANN benchmarks did not add noise to this
> thread, I was mostly looking at whether I could find another benchmark that
> suggests that Lucene is significantly slower in similar conditions. Does it
> align with other people's experience that Lucene is 2x slower or more
> compared with other good HNSW implementations?
> > > >
> > > > Adrien
> > > >
> > > > Le jeu. 19 juin 2025, 18:44, Chris Hegarty
> <christopher.hega...@elastic.co.invalid> a écrit :
> > > >>
> > > >> Hi Adrien,
> > > >>
> > > >> > Even though it uses Elasticsearch to run the benchmark, it really
> benchmarks Lucene functionality,
> > > >>
> > > >> Agreed.
> > > >>
> > > >> > This seems consistent with results from
> https://ann-benchmarks.com/index.html though I don't know if the cause of
> the performance difference is the same or not.
> > > >>
> > > >> On ann-benchmarks specifically. Unless I’m looking in the wrong
> place, then they’re using Elasticsearch 8.7.0 [1], which predates our usage
> of the Panama Vector API for vector search. We added support for that in
> Lucene 9.7.0 -> Elasticsearch 8.9.0.  So those benchmarks are wildly out of
> date, no ?
> > > >>
> > > >> -Chris.
> > > >>
> > > >> [1]
> https://github.com/erikbern/ann-benchmarks/blob/f402b2cc17b980d7cd45241ab5a7a4cc0f965e55/ann_benchmarks/algorithms/elasticsearch/Dockerfile#L2
> > > >>
> > > >>
> > > >> > On 19 Jun 2025, at 16:39, Adrien Grand <jpou...@gmail.com> wrote:
> > > >> >
> > > >> > Hello all,
> > > >> >
> > > >> > I have been looking at this benchmark against Vespa recently:
> https://blog.vespa.ai/elasticsearch-vs-vespa-performance-comparison/.
> (The report is behind an annoying email wall, but I'm copying relevant data
> below, so hopefully you don't need to download the report.) Even though it
> uses Elasticsearch to run the benchmark, it really benchmarks Lucene
> functionality, I don't believe that Elasticsearch does anything that
> meaningfully alters the results that you would get if you were to run
> Lucene directly.
> > > >> >
> > > >> > The benchmark seems designed to highlight the benefits of Vespa's
> realtime design, that's fair game I guess. But it also runs some queries in
> read-only scenarios when I was expecting Lucene to perform better.
> > > >> >
> > > >> > One thing that got me curious is that it reports about 2x worse
> latency and throughput for pure unfiltered vector search on a force-merged
> index (so single segment/graph). Does anybody know why Lucene's HNSW may
> perform slower than Vespa's HNSW? This seems consistent with results from
> https://ann-benchmarks.com/index.html though I don't know if the cause of
> the performance difference is the same or not.
> > > >> >
> > > >> > For reference, here are details that apply to both Lucene and
> Vespa's vector search:
> > > >> >  - HNSW,
> > > >> >  - float32 vectors, no quantization,
> > > >> >  - embeddings generated using  Snowflake's Arctic-embed-xs model
> > > >> >  - 1M docs
> > > >> >  - 384 dimensions,
> > > >> >  - dot product,
> > > >> >  - m = 16,
> > > >> >  - max connections = 200,
> > > >> >  - search for top 10 hits,
> > > >> >  - no filter,
> > > >> >  - single client, so no search concurrency,
> > > >> >  - purple column is force-merged, so single segment/graph like
> Vespa.
> > > >> >
> > > >> > I never seriously looked at Lucene's vector search performance,
> so I'm very happy to be educated if I'm making naive assumptions!
> > > >> >
> > > >> > Somewhat related, is this the reason why I'm seeing many threads
> around bringing 3rd party implementations into Lucene, including ones that
> are very similar to Lucene on paper? To speed up vector search?
> > > >> >
> > > >> > --
> > > >> > Adrien
> > > >> > <vespa-vs-es-screenshot.png>
> > > >> >
> ---------------------------------------------------------------------
> > > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> > >
> > >
> > > --
> > > Adrien
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to