Re: Questions about the new vector API

Tomoko Uchida Sun, 21 Mar 2021 17:12:04 -0700

I think the codec name is important and current naming seems not to be
appropriate anyway.
I would like to try to get consensus for that on LUCENE-9855
<https://issues.apache.org/jira/browse/LUCENE-9855>.




2021年3月20日(土) 16:04 Tomoko Uchida <[email protected]>:

> I think it makes sense that we use "ANN" or "NearestNeighbor" for ann
> related APIs, this may give proper level of abstraction to them.
> On the other hand, it slightly sounds odd to me to use it as a Codec
> name... Maybe we should use names that represents its data structure,
> instead of methods/algorithms?
> I'd propose "DenseVector" here if "Vector" is too obscure, but it is also
> just an idea.
>
> Tomoko
>
>
> 2021年3月18日(木) 5:34 Robert Muir <[email protected]>:
>
>> I'm gonna toss out one last question while we are here: Is
>> Vector(s)Format really a good name to use?
>>
>> We already have "term vectors API", and "vector highlighter" that uses
>> it. There's also the traditional "vector-space" scoring model. With java
>> 16, we get a "vector api" from java itself, too.
>>
>> I think the name is overloaded too many times already, and this one is
>> the straw that breaks the camel's back for me.
>>
>> So I'm just throwing out there the idea: if this api is about ANN, maybe
>> it should claim its own name (NeighborsFormat?) that is less ambiguous.
>>
>> On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <[email protected]>
>> wrote:
>>
>>> I see, right, we can create a Codec that applies the values takes from
>>> the schema for a given field, sure, that works.
>>>
>>> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <[email protected]> wrote:
>>> >
>>> > Configuring the codec based on the schema is something that Solr does
>>> via SchemaCodecFactory.
>>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>>> >
>>> > Would a similar approach work in your case?
>>> >
>>> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <[email protected]> a
>>> écrit :
>>> >>
>>> >> > I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>> >>
>>> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>>> >> add such visible API changes early on in the project.
>>> >>
>>> >> > I wonder if things would be simpler if we were more opinionated and
>>> made vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>> >>
>>> >> > Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants?
>>> >>
>>> >> Hmm I see I didn't think this all the way through ... I guess I had it
>>> >> in mind that there would probably only ever be a single format with
>>> >> internal variants for different vector index types, but as I have
>>> >> worked more with Lucene's index formats I see that is awkward, and I'm
>>> >> certainly open to restructuring it in a more natural way. Similarly
>>> >> for the NONE format - BinaryDocValues can be used for such
>>> >> (non-searchable) vectors. Indeed we had such an implementation and
>>> >> although we recently switched it to use the NONE format for
>>> >> uniformity, it could easily be switched back.
>>> >>
>>> >> Regarding the graph construction parameters (maxConn and beamWidth)
>>> >> I'm not sure what the right approach is exactly. We struggled to find
>>> >> the best API for this. I guess my concern about the PerField* approach
>>> >> is (at least as I think I understand it) it needs to be configured in
>>> >> code when creating a Codec. But we would like to be able to read such
>>> >> parameters from a schema configuration. I think of them as in the same
>>> >> spirit as an Analyzer. However I may not have fully appreciated the
>>> >> intention of, or how to make the best use of PerField formats. It is
>>> >> true we don't really need to write these parameters to the index;
>>> >> we're free to use different values when merging for example.
>>> >>
>>> >> -Mike
>>> >>
>>> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Hello Mike,
>>> >> >
>>> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]>
>>> wrote:
>>> >> >>
>>> >> >> I think the reason we have search() on VectorValues is that we have
>>> >> >> LeafReader.getVectorValues() (by analogy to the DocValues
>>> iterators),
>>> >> >> but no way to access the VectorReader. Do you think we should also
>>> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>> >> >
>>> >> >
>>> >> > I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>> >> >
>>> >> >>
>>> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>> >> >> floating point values. Using BinaryDocValues for this will always
>>> >> >> require an additional decoding step. I can see that the naming is
>>> >> >> confusing there. The intent is that you index the vector values,
>>> but
>>> >> >> no additional indexing data structure.
>>> >> >
>>> >> >
>>> >> > I wonder if things would be simpler if we were more opinionated and
>>> made vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>> >> >
>>> >> >> Also: the reason HNSW is
>>> >> >> mentioned in these SearchStrategy enums is to make room for other
>>> >> >> vector indexing approaches, like LSH. There was a lot of discussion
>>> >> >> that we wanted an API that allowed for experimenting with other
>>> >> >> techniques for indexing and searching vector values.
>>> >> >
>>> >> >
>>> >> > Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants? Would it be possible to have a single
>>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>>> one hand and HNSWVectorsFormat on the other hand?
>>> >> >
>>> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
>>> DocValues),
>>> >> >> but I think the situation is more akin to Points, where we have the
>>> >> >> options on IndexableField. The metadata we store there (dimension
>>> and
>>> >> >> score function) don't really result in different formats, ie code
>>> >> >> paths for indexing and storage; they are more like parameters to
>>> the
>>> >> >> format, in my mind. Perhaps the situation will look different when
>>> we
>>> >> >> get our second vector indexing strategy (like LSH).
>>> >> >
>>> >> >
>>> >> > Having the dimension count and the score function on the FieldType
>>> actually makes sense to me. I was more wondering whether maxConn and
>>> beamWidth actually belong to the FieldType, or if they should be made
>>> constructor arguments of Lucene90VectorFormat.
>>> >> >
>>> >> > --
>>> >> > Adrien
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Re: Questions about the new vector API

Reply via email to