maybe one practical to make this benchmark more interesting would be to starve the machine of RAM (by mlock()ing a huge amount of available RAM and leaving only a small amount left), so that datasets don't fit in ram during benchmarking.
there is a little script to do this in the luceneutil repo: https://github.com/mikemccand/luceneutil/blob/master/src/python/ramhog.c On Wed, Apr 28, 2021 at 1:16 PM Julie Tibshirani <juliet...@gmail.com> wrote: > > This is a good point, and I agree it’s not a perfect fit for Lucene testing/ > development. The repository indeed focuses on datasets that can be held in > memory -- by only looking at the results of methods on ann-benchmarks, we > might be missing important considerations like how well the method scales to > indexes that are larger than main memory, how well it fits with the rest of > the search framework, etc. > > I’m thinking of it as a supplementary tool that can give some performance > insights. It reports search accuracy in addition to speed which luceneutil > doesn’t do (yet :)). The datasets aren’t huge but also aren’t 'toy datasets' > -- they range from 1-10 million vectors with dimension 100+, and are based on > pretty diverse + realistic data. In the future we could extend luceneutil to > cover some of this, like reporting recall on a few datasets. > > So ann-benchmarks can be useful within a certain scope: > * Comparing our implementation against reference libraries to double-check/ > debug the algorithm > * Testing on diverse datasets, with ability to measure search speed *and* > recall > > Julie > > On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcm...@gmail.com> wrote: >> >> Thanks for doing this benchmarking. But I am very concerned >> ann-benchmarks is a good one to be using. >> >> While it may be hip/trendy/popular, it clearly states that it is only >> for toy datasets that fit in RAM: >> https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles >> >> >> >> On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <juliet...@gmail.com> wrote: >> > >> > One last follow-up: Robert's comments got me interested in better >> > quantifying the performance against other approaches. I hooked up Lucene >> > HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest >> > neighbor search libraries against large datasets. These two issues >> > describe the results: >> > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937) >> > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941) >> > >> > Thanks Mike for your insights so far on the search ticket. >> > >> > Julie >> > >> > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <juliet...@gmail.com> >> > wrote: >> >> >> >> I filed one more JIRA about the approach to specifying the NN algorithm: >> >> https://issues.apache.org/jira/browse/LUCENE-9905. >> >> >> >> As a summary, here's the current list of vector API issues we're tracking: >> >> * Reconsider the format name >> >> (https://issues.apache.org/jira/browse/LUCENE-9855) >> >> * Revise approach to specifying NN algorithm >> >> (https://issues.apache.org/jira/browse/LUCENE-9905) >> >> * Move VectorValues#search to VectorReader >> >> (https://issues.apache.org/jira/browse/LUCENE-9908) >> >> * Should VectorValues expose both iteration and random access? >> >> (https://issues.apache.org/jira/browse/LUCENE-9583) >> >> >> >> Julie >> >> >> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote: >> >>> >> >>> I created a JIRA about moving VectorValues#search to VectorReader: >> >>> https://issues.apache.org/jira/browse/LUCENE-9908. >> >>> >> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote: >> >>>> >> >>>> Hello Mike, >> >>>> >> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> >> >>>> wrote: >> >>>>> >> >>>>> I think the reason we have search() on VectorValues is that we have >> >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >> >>>>> but no way to access the VectorReader. Do you think we should also >> >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader. >> >>>> >> >>>> >> >>>> I was more thinking of moving VectorValues#search to >> >>>> LeafReader#searchNearestVectors or something along those lines. I agree >> >>>> that VectorReader should only be exposed on CodecReader. >> >>>> >> >>>>> >> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to >> >>>>> floating point values. Using BinaryDocValues for this will always >> >>>>> require an additional decoding step. I can see that the naming is >> >>>>> confusing there. The intent is that you index the vector values, but >> >>>>> no additional indexing data structure. >> >>>> >> >>>> >> >>>> I wonder if things would be simpler if we were more opinionated and >> >>>> made vectors specifically about nearest-neighbor search. Then we have a >> >>>> clearer message, use vectors for NN search and doc values otherwise. As >> >>>> far as I know, reinterpreting bytes as floats shouldn't add much >> >>>> overhead. The main problem I know of is that the JVM won't >> >>>> auto-vectorize if you read floats dynamically from a byte[], but this >> >>>> is something that should be alleviated by the JDK vector API? >> >>>> >> >>>>> Also: the reason HNSW is >> >>>>> mentioned in these SearchStrategy enums is to make room for other >> >>>>> vector indexing approaches, like LSH. There was a lot of discussion >> >>>>> that we wanted an API that allowed for experimenting with other >> >>>>> techniques for indexing and searching vector values. >> >>>> >> >>>> >> >>>> Actually this is the thing that feels odd to me: if we end up with >> >>>> constants for both LSH and HNSW, then we are adding the requirement >> >>>> that all vector formats must implement both LSH and HNSW as they will >> >>>> need to support all SearchStrategy constants? Would it be possible to >> >>>> have a single API and then two implementations of VectorsFormat, >> >>>> LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other >> >>>> hand? >> >>>> >> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >> >>>>> but I think the situation is more akin to Points, where we have the >> >>>>> options on IndexableField. The metadata we store there (dimension and >> >>>>> score function) don't really result in different formats, ie code >> >>>>> paths for indexing and storage; they are more like parameters to the >> >>>>> format, in my mind. Perhaps the situation will look different when we >> >>>>> get our second vector indexing strategy (like LSH). >> >>>> >> >>>> >> >>>> Having the dimension count and the score function on the FieldType >> >>>> actually makes sense to me. I was more wondering whether maxConn and >> >>>> beamWidth actually belong to the FieldType, or if they should be made >> >>>> constructor arguments of Lucene90VectorFormat. >> >>>> >> >>>> -- >> >>>> Adrien >> >>> >> >>> >> >>> >> >>> -- >> >>> Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org