Thanks for doing this benchmarking. But I am very concerned ann-benchmarks is a good one to be using.
While it may be hip/trendy/popular, it clearly states that it is only for toy datasets that fit in RAM: https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <[email protected]> wrote: > > One last follow-up: Robert's comments got me interested in better quantifying > the performance against other approaches. I hooked up Lucene HNSW to > ann-benchmarks, a commonly used repo for benchmarking nearest neighbor search > libraries against large datasets. These two issues describe the results: > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937) > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941) > > Thanks Mike for your insights so far on the search ticket. > > Julie > > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <[email protected]> wrote: >> >> I filed one more JIRA about the approach to specifying the NN algorithm: >> https://issues.apache.org/jira/browse/LUCENE-9905. >> >> As a summary, here's the current list of vector API issues we're tracking: >> * Reconsider the format name >> (https://issues.apache.org/jira/browse/LUCENE-9855) >> * Revise approach to specifying NN algorithm >> (https://issues.apache.org/jira/browse/LUCENE-9905) >> * Move VectorValues#search to VectorReader >> (https://issues.apache.org/jira/browse/LUCENE-9908) >> * Should VectorValues expose both iteration and random access? >> (https://issues.apache.org/jira/browse/LUCENE-9583) >> >> Julie >> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <[email protected]> wrote: >>> >>> I created a JIRA about moving VectorValues#search to VectorReader: >>> https://issues.apache.org/jira/browse/LUCENE-9908. >>> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <[email protected]> wrote: >>>> >>>> Hello Mike, >>>> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]> wrote: >>>>> >>>>> I think the reason we have search() on VectorValues is that we have >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >>>>> but no way to access the VectorReader. Do you think we should also >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader. >>>> >>>> >>>> I was more thinking of moving VectorValues#search to >>>> LeafReader#searchNearestVectors or something along those lines. I agree >>>> that VectorReader should only be exposed on CodecReader. >>>> >>>>> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to >>>>> floating point values. Using BinaryDocValues for this will always >>>>> require an additional decoding step. I can see that the naming is >>>>> confusing there. The intent is that you index the vector values, but >>>>> no additional indexing data structure. >>>> >>>> >>>> I wonder if things would be simpler if we were more opinionated and made >>>> vectors specifically about nearest-neighbor search. Then we have a clearer >>>> message, use vectors for NN search and doc values otherwise. As far as I >>>> know, reinterpreting bytes as floats shouldn't add much overhead. The main >>>> problem I know of is that the JVM won't auto-vectorize if you read floats >>>> dynamically from a byte[], but this is something that should be alleviated >>>> by the JDK vector API? >>>> >>>>> Also: the reason HNSW is >>>>> mentioned in these SearchStrategy enums is to make room for other >>>>> vector indexing approaches, like LSH. There was a lot of discussion >>>>> that we wanted an API that allowed for experimenting with other >>>>> techniques for indexing and searching vector values. >>>> >>>> >>>> Actually this is the thing that feels odd to me: if we end up with >>>> constants for both LSH and HNSW, then we are adding the requirement that >>>> all vector formats must implement both LSH and HNSW as they will need to >>>> support all SearchStrategy constants? Would it be possible to have a >>>> single API and then two implementations of VectorsFormat, LSHVectorsFormat >>>> on the one hand and HNSWVectorsFormat on the other hand? >>>> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >>>>> but I think the situation is more akin to Points, where we have the >>>>> options on IndexableField. The metadata we store there (dimension and >>>>> score function) don't really result in different formats, ie code >>>>> paths for indexing and storage; they are more like parameters to the >>>>> format, in my mind. Perhaps the situation will look different when we >>>>> get our second vector indexing strategy (like LSH). >>>> >>>> >>>> Having the dimension count and the score function on the FieldType >>>> actually makes sense to me. I was more wondering whether maxConn and >>>> beamWidth actually belong to the FieldType, or if they should be made >>>> constructor arguments of Lucene90VectorFormat. >>>> >>>> -- >>>> Adrien >>> >>> >>> >>> -- >>> Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
