Re: Questions about the new vector API

Robert Muir Wed, 28 Apr 2021 10:23:53 -0700

maybe one practical to make this benchmark more interesting would be
to starve the machine of RAM (by mlock()ing a huge amount of available
RAM and leaving only a small amount left), so that datasets don't fit
in ram during benchmarking.


there is a little script to do this in the luceneutil repo:
https://github.com/mikemccand/luceneutil/blob/master/src/python/ramhog.c

On Wed, Apr 28, 2021 at 1:16 PM Julie Tibshirani <juliet...@gmail.com> wrote:
>
> This is a good point, and I agree it’s not a perfect fit for Lucene testing/ 
> development. The repository indeed focuses on datasets that can be held in 
> memory -- by only looking at the results of methods on ann-benchmarks, we 
> might be missing important considerations like how well the method scales to 
> indexes that are larger than main memory, how well it fits with the rest of 
> the search framework, etc.
>
> I’m thinking of it as a supplementary tool that can give some performance 
> insights. It reports search accuracy in addition to speed which luceneutil 
> doesn’t do (yet :)). The datasets aren’t huge but also aren’t 'toy datasets' 
> -- they range from 1-10 million vectors with dimension 100+, and are based on 
> pretty diverse + realistic data. In the future we could extend luceneutil to 
> cover some of this, like reporting recall on a few datasets.
>
> So ann-benchmarks can be useful within a certain scope:
> * Comparing our implementation against reference libraries to double-check/ 
> debug the algorithm
> * Testing on diverse datasets, with ability to measure search speed *and* 
> recall
>
> Julie
>
> On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcm...@gmail.com> wrote:
>>
>> Thanks for doing this benchmarking. But I am very concerned
>> ann-benchmarks is a good one to be using.
>>
>> While it may be hip/trendy/popular, it clearly states that it is only
>> for toy datasets that fit in RAM:
>> https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles
>>
>>
>>
>> On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <juliet...@gmail.com> wrote:
>> >
>> > One last follow-up: Robert's comments got me interested in better 
>> > quantifying the performance against other approaches. I hooked up Lucene 
>> > HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest 
>> > neighbor search libraries against large datasets. These two issues 
>> > describe the results:
>> > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
>> > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
>> >
>> > Thanks Mike for your insights so far on the search ticket.
>> >
>> > Julie
>> >
>> > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <juliet...@gmail.com> 
>> > wrote:
>> >>
>> >> I filed one more JIRA about the approach to specifying the NN algorithm: 
>> >> https://issues.apache.org/jira/browse/LUCENE-9905.
>> >>
>> >> As a summary, here's the current list of vector API issues we're tracking:
>> >> * Reconsider the format name 
>> >> (https://issues.apache.org/jira/browse/LUCENE-9855)
>> >> * Revise approach to specifying NN algorithm 
>> >> (https://issues.apache.org/jira/browse/LUCENE-9905)
>> >> * Move VectorValues#search to VectorReader 
>> >> (https://issues.apache.org/jira/browse/LUCENE-9908)
>> >> * Should VectorValues expose both iteration and random access? 
>> >> (https://issues.apache.org/jira/browse/LUCENE-9583)
>> >>
>> >> Julie
>> >>
>> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote:
>> >>>
>> >>> I created a JIRA about moving VectorValues#search to VectorReader: 
>> >>> https://issues.apache.org/jira/browse/LUCENE-9908.
>> >>>
>> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote:
>> >>>>
>> >>>> Hello Mike,
>> >>>>
>> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> 
>> >>>> wrote:
>> >>>>>
>> >>>>> I think the reason we have search() on VectorValues is that we have
>> >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >>>>> but no way to access the VectorReader. Do you think we should also
>> >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >>>>
>> >>>>
>> >>>> I was more thinking of moving VectorValues#search to 
>> >>>> LeafReader#searchNearestVectors or something along those lines. I agree 
>> >>>> that VectorReader should only be exposed on CodecReader.
>> >>>>
>> >>>>>
>> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >>>>> floating point values. Using BinaryDocValues for this will always
>> >>>>> require an additional decoding step. I can see that the naming is
>> >>>>> confusing there. The intent is that you index the vector values, but
>> >>>>> no additional indexing data structure.
>> >>>>
>> >>>>
>> >>>> I wonder if things would be simpler if we were more opinionated and 
>> >>>> made vectors specifically about nearest-neighbor search. Then we have a 
>> >>>> clearer message, use vectors for NN search and doc values otherwise. As 
>> >>>> far as I know, reinterpreting bytes as floats shouldn't add much 
>> >>>> overhead. The main problem I know of is that the JVM won't 
>> >>>> auto-vectorize if you read floats dynamically from a byte[], but this 
>> >>>> is something that should be alleviated by the JDK vector API?
>> >>>>
>> >>>>> Also: the reason HNSW is
>> >>>>> mentioned in these SearchStrategy enums is to make room for other
>> >>>>> vector indexing approaches, like LSH. There was a lot of discussion
>> >>>>> that we wanted an API that allowed for experimenting with other
>> >>>>> techniques for indexing and searching vector values.
>> >>>>
>> >>>>
>> >>>> Actually this is the thing that feels odd to me: if we end up with 
>> >>>> constants for both LSH and HNSW, then we are adding the requirement 
>> >>>> that all vector formats must implement both LSH and HNSW as they will 
>> >>>> need to support all SearchStrategy constants? Would it be possible to 
>> >>>> have a single API and then two implementations of VectorsFormat, 
>> >>>> LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other 
>> >>>> hand?
>> >>>>
>> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >>>>> but I think the situation is more akin to Points, where we have the
>> >>>>> options on IndexableField. The metadata we store there (dimension and
>> >>>>> score function) don't really result in different formats, ie code
>> >>>>> paths for indexing and storage; they are more like parameters to the
>> >>>>> format, in my mind. Perhaps the situation will look different when we
>> >>>>> get our second vector indexing strategy (like LSH).
>> >>>>
>> >>>>
>> >>>> Having the dimension count and the score function on the FieldType 
>> >>>> actually makes sense to me. I was more wondering whether maxConn and 
>> >>>> beamWidth actually belong to the FieldType, or if they should be made 
>> >>>> constructor arguments of Lucene90VectorFormat.
>> >>>>
>> >>>> --
>> >>>> Adrien
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Questions about the new vector API

Reply via email to