Re: Questions about the new vector API

Adrien Grand Tue, 06 Apr 2021 05:31:03 -0700

I created a JIRA about moving VectorValues#search to VectorReader:
https://issues.apache.org/jira/browse/LUCENE-9908.


On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote:

> Hello Mike,
>
> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
>
>> I think the reason we have search() on VectorValues is that we have
>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> but no way to access the VectorReader. Do you think we should also
>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>
>
> I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
>
>
>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> floating point values. Using BinaryDocValues for this will always
>> require an additional decoding step. I can see that the naming is
>> confusing there. The intent is that you index the vector values, but
>> no additional indexing data structure.
>
>
> I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
>
> Also: the reason HNSW is
>> mentioned in these SearchStrategy enums is to make room for other
>> vector indexing approaches, like LSH. There was a lot of discussion
>> that we wanted an API that allowed for experimenting with other
>> techniques for indexing and searching vector values.
>>
>
> Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
>
> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> but I think the situation is more akin to Points, where we have the
>> options on IndexableField. The metadata we store there (dimension and
>> score function) don't really result in different formats, ie code
>> paths for indexing and storage; they are more like parameters to the
>> format, in my mind. Perhaps the situation will look different when we
>> get our second vector indexing strategy (like LSH).
>
>
> Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn
> and beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
>
> --
> Adrien
>


-- 
Adrien

Re: Questions about the new vector API

Reply via email to