Re: Questions about the new vector API

Robert Muir Wed, 28 Apr 2021 07:16:09 -0700

Thanks for doing this benchmarking. But I am very concerned
ann-benchmarks is a good one to be using.


While it may be hip/trendy/popular, it clearly states that it is only
for toy datasets that fit in RAM:
https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles



On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <[email protected]> wrote:
>
> One last follow-up: Robert's comments got me interested in better quantifying 
> the performance against other approaches. I hooked up Lucene HNSW to 
> ann-benchmarks, a commonly used repo for benchmarking nearest neighbor search 
> libraries against large datasets. These two issues describe the results:
> * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
> * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
>
> Thanks Mike for your insights so far on the search ticket.
>
> Julie
>
> On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <[email protected]> wrote:
>>
>> I filed one more JIRA about the approach to specifying the NN algorithm: 
>> https://issues.apache.org/jira/browse/LUCENE-9905.
>>
>> As a summary, here's the current list of vector API issues we're tracking:
>> * Reconsider the format name 
>> (https://issues.apache.org/jira/browse/LUCENE-9855)
>> * Revise approach to specifying NN algorithm 
>> (https://issues.apache.org/jira/browse/LUCENE-9905)
>> * Move VectorValues#search to VectorReader 
>> (https://issues.apache.org/jira/browse/LUCENE-9908)
>> * Should VectorValues expose both iteration and random access? 
>> (https://issues.apache.org/jira/browse/LUCENE-9583)
>>
>> Julie
>>
>> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <[email protected]> wrote:
>>>
>>> I created a JIRA about moving VectorValues#search to VectorReader: 
>>> https://issues.apache.org/jira/browse/LUCENE-9908.
>>>
>>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <[email protected]> wrote:
>>>>
>>>> Hello Mike,
>>>>
>>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]> wrote:
>>>>>
>>>>> I think the reason we have search() on VectorValues is that we have
>>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>>>> but no way to access the VectorReader. Do you think we should also
>>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>>
>>>>
>>>> I was more thinking of moving VectorValues#search to 
>>>> LeafReader#searchNearestVectors or something along those lines. I agree 
>>>> that VectorReader should only be exposed on CodecReader.
>>>>
>>>>>
>>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>>>> floating point values. Using BinaryDocValues for this will always
>>>>> require an additional decoding step. I can see that the naming is
>>>>> confusing there. The intent is that you index the vector values, but
>>>>> no additional indexing data structure.
>>>>
>>>>
>>>> I wonder if things would be simpler if we were more opinionated and made 
>>>> vectors specifically about nearest-neighbor search. Then we have a clearer 
>>>> message, use vectors for NN search and doc values otherwise. As far as I 
>>>> know, reinterpreting bytes as floats shouldn't add much overhead. The main 
>>>> problem I know of is that the JVM won't auto-vectorize if you read floats 
>>>> dynamically from a byte[], but this is something that should be alleviated 
>>>> by the JDK vector API?
>>>>
>>>>> Also: the reason HNSW is
>>>>> mentioned in these SearchStrategy enums is to make room for other
>>>>> vector indexing approaches, like LSH. There was a lot of discussion
>>>>> that we wanted an API that allowed for experimenting with other
>>>>> techniques for indexing and searching vector values.
>>>>
>>>>
>>>> Actually this is the thing that feels odd to me: if we end up with 
>>>> constants for both LSH and HNSW, then we are adding the requirement that 
>>>> all vector formats must implement both LSH and HNSW as they will need to 
>>>> support all SearchStrategy constants? Would it be possible to have a 
>>>> single API and then two implementations of VectorsFormat, LSHVectorsFormat 
>>>> on the one hand and HNSWVectorsFormat on the other hand?
>>>>
>>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>>>> but I think the situation is more akin to Points, where we have the
>>>>> options on IndexableField. The metadata we store there (dimension and
>>>>> score function) don't really result in different formats, ie code
>>>>> paths for indexing and storage; they are more like parameters to the
>>>>> format, in my mind. Perhaps the situation will look different when we
>>>>> get our second vector indexing strategy (like LSH).
>>>>
>>>>
>>>> Having the dimension count and the score function on the FieldType 
>>>> actually makes sense to me. I was more wondering whether maxConn and 
>>>> beamWidth actually belong to the FieldType, or if they should be made 
>>>> constructor arguments of Lucene90VectorFormat.
>>>>
>>>> --
>>>> Adrien
>>>
>>>
>>>
>>> --
>>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Questions about the new vector API

Reply via email to