This is a good point, and I agree it’s not a perfect fit for Lucene testing/ development. The repository indeed focuses on datasets that can be held in memory -- by only looking at the results of methods on ann-benchmarks, we might be missing important considerations like how well the method scales to indexes that are larger than main memory, how well it fits with the rest of the search framework, etc.
I’m thinking of it as a supplementary tool that can give some performance insights. It reports search accuracy in addition to speed which luceneutil doesn’t do (yet :)). The datasets aren’t huge but also aren’t 'toy datasets' -- they range from 1-10 million vectors with dimension 100+, and are based on pretty diverse + realistic data. In the future we could extend luceneutil to cover some of this, like reporting recall on a few datasets. So ann-benchmarks can be useful within a certain scope: * Comparing our implementation against reference libraries to double-check/ debug the algorithm * Testing on diverse datasets, with ability to measure search speed *and* recall Julie On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcm...@gmail.com> wrote: > Thanks for doing this benchmarking. But I am very concerned > ann-benchmarks is a good one to be using. > > While it may be hip/trendy/popular, it clearly states that it is only > for toy datasets that fit in RAM: > https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles > > > > On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <juliet...@gmail.com> > wrote: > > > > One last follow-up: Robert's comments got me interested in better > quantifying the performance against other approaches. I hooked up Lucene > HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest > neighbor search libraries against large datasets. These two issues describe > the results: > > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937 > ) > > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941) > > > > Thanks Mike for your insights so far on the search ticket. > > > > Julie > > > > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <juliet...@gmail.com> > wrote: > >> > >> I filed one more JIRA about the approach to specifying the NN > algorithm: https://issues.apache.org/jira/browse/LUCENE-9905. > >> > >> As a summary, here's the current list of vector API issues we're > tracking: > >> * Reconsider the format name ( > https://issues.apache.org/jira/browse/LUCENE-9855) > >> * Revise approach to specifying NN algorithm ( > https://issues.apache.org/jira/browse/LUCENE-9905) > >> * Move VectorValues#search to VectorReader ( > https://issues.apache.org/jira/browse/LUCENE-9908) > >> * Should VectorValues expose both iteration and random access? ( > https://issues.apache.org/jira/browse/LUCENE-9583) > >> > >> Julie > >> > >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote: > >>> > >>> I created a JIRA about moving VectorValues#search to VectorReader: > https://issues.apache.org/jira/browse/LUCENE-9908. > >>> > >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> > wrote: > >>>> > >>>> Hello Mike, > >>>> > >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >>>>> > >>>>> I think the reason we have search() on VectorValues is that we have > >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators), > >>>>> but no way to access the VectorReader. Do you think we should also > >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader. > >>>> > >>>> > >>>> I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > >>>> > >>>>> > >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to > >>>>> floating point values. Using BinaryDocValues for this will always > >>>>> require an additional decoding step. I can see that the naming is > >>>>> confusing there. The intent is that you index the vector values, but > >>>>> no additional indexing data structure. > >>>> > >>>> > >>>> I wonder if things would be simpler if we were more opinionated and > made vectors specifically about nearest-neighbor search. Then we have a > clearer message, use vectors for NN search and doc values otherwise. As far > as I know, reinterpreting bytes as floats shouldn't add much overhead. The > main problem I know of is that the JVM won't auto-vectorize if you read > floats dynamically from a byte[], but this is something that should be > alleviated by the JDK vector API? > >>>> > >>>>> Also: the reason HNSW is > >>>>> mentioned in these SearchStrategy enums is to make room for other > >>>>> vector indexing approaches, like LSH. There was a lot of discussion > >>>>> that we wanted an API that allowed for experimenting with other > >>>>> techniques for indexing and searching vector values. > >>>> > >>>> > >>>> Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? Would it be possible to have a single > API and then two implementations of VectorsFormat, LSHVectorsFormat on the > one hand and HNSWVectorsFormat on the other hand? > >>>> > >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and > DocValues), > >>>>> but I think the situation is more akin to Points, where we have the > >>>>> options on IndexableField. The metadata we store there (dimension and > >>>>> score function) don't really result in different formats, ie code > >>>>> paths for indexing and storage; they are more like parameters to the > >>>>> format, in my mind. Perhaps the situation will look different when we > >>>>> get our second vector indexing strategy (like LSH). > >>>> > >>>> > >>>> Having the dimension count and the score function on the FieldType > actually makes sense to me. I was more wondering whether maxConn and > beamWidth actually belong to the FieldType, or if they should be made > constructor arguments of Lucene90VectorFormat. > >>>> > >>>> -- > >>>> Adrien > >>> > >>> > >>> > >>> -- > >>> Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >