Re: Questions about the new vector API

Tomoko Uchida Sat, 20 Mar 2021 00:05:26 -0700

I think it makes sense that we use "ANN" or "NearestNeighbor" for ann
related APIs, this may give proper level of abstraction to them.
On the other hand, it slightly sounds odd to me to use it as a Codec
name... Maybe we should use names that represents its data structure,
instead of methods/algorithms?
I'd propose "DenseVector" here if "Vector" is too obscure, but it is also
just an idea.


Tomoko


2021年3月18日(木) 5:34 Robert Muir <[email protected]>:

> I'm gonna toss out one last question while we are here: Is Vector(s)Format
> really a good name to use?
>
> We already have "term vectors API", and "vector highlighter" that uses it.
> There's also the traditional "vector-space" scoring model. With java 16, we
> get a "vector api" from java itself, too.
>
> I think the name is overloaded too many times already, and this one is the
> straw that breaks the camel's back for me.
>
> So I'm just throwing out there the idea: if this api is about ANN, maybe
> it should claim its own name (NeighborsFormat?) that is less ambiguous.
>
> On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <[email protected]>
> wrote:
>
>> I see, right, we can create a Codec that applies the values takes from
>> the schema for a given field, sure, that works.
>>
>> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <[email protected]> wrote:
>> >
>> > Configuring the codec based on the schema is something that Solr does
>> via SchemaCodecFactory.
>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>> >
>> > Would a similar approach work in your case?
>> >
>> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <[email protected]> a
>> écrit :
>> >>
>> >> > I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>> >>
>> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>> >> add such visible API changes early on in the project.
>> >>
>> >> > I wonder if things would be simpler if we were more opinionated and
>> made vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>> >>
>> >> > Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants?
>> >>
>> >> Hmm I see I didn't think this all the way through ... I guess I had it
>> >> in mind that there would probably only ever be a single format with
>> >> internal variants for different vector index types, but as I have
>> >> worked more with Lucene's index formats I see that is awkward, and I'm
>> >> certainly open to restructuring it in a more natural way. Similarly
>> >> for the NONE format - BinaryDocValues can be used for such
>> >> (non-searchable) vectors. Indeed we had such an implementation and
>> >> although we recently switched it to use the NONE format for
>> >> uniformity, it could easily be switched back.
>> >>
>> >> Regarding the graph construction parameters (maxConn and beamWidth)
>> >> I'm not sure what the right approach is exactly. We struggled to find
>> >> the best API for this. I guess my concern about the PerField* approach
>> >> is (at least as I think I understand it) it needs to be configured in
>> >> code when creating a Codec. But we would like to be able to read such
>> >> parameters from a schema configuration. I think of them as in the same
>> >> spirit as an Analyzer. However I may not have fully appreciated the
>> >> intention of, or how to make the best use of PerField formats. It is
>> >> true we don't really need to write these parameters to the index;
>> >> we're free to use different values when merging for example.
>> >>
>> >> -Mike
>> >>
>> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <[email protected]>
>> wrote:
>> >> >
>> >> > Hello Mike,
>> >> >
>> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]>
>> wrote:
>> >> >>
>> >> >> I think the reason we have search() on VectorValues is that we have
>> >> >> LeafReader.getVectorValues() (by analogy to the DocValues
>> iterators),
>> >> >> but no way to access the VectorReader. Do you think we should also
>> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >> >
>> >> >
>> >> > I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>> >> >
>> >> >>
>> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> >> floating point values. Using BinaryDocValues for this will always
>> >> >> require an additional decoding step. I can see that the naming is
>> >> >> confusing there. The intent is that you index the vector values, but
>> >> >> no additional indexing data structure.
>> >> >
>> >> >
>> >> > I wonder if things would be simpler if we were more opinionated and
>> made vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>> >> >
>> >> >> Also: the reason HNSW is
>> >> >> mentioned in these SearchStrategy enums is to make room for other
>> >> >> vector indexing approaches, like LSH. There was a lot of discussion
>> >> >> that we wanted an API that allowed for experimenting with other
>> >> >> techniques for indexing and searching vector values.
>> >> >
>> >> >
>> >> > Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants? Would it be possible to have a single
>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>> one hand and HNSWVectorsFormat on the other hand?
>> >> >
>> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
>> DocValues),
>> >> >> but I think the situation is more akin to Points, where we have the
>> >> >> options on IndexableField. The metadata we store there (dimension
>> and
>> >> >> score function) don't really result in different formats, ie code
>> >> >> paths for indexing and storage; they are more like parameters to the
>> >> >> format, in my mind. Perhaps the situation will look different when
>> we
>> >> >> get our second vector indexing strategy (like LSH).
>> >> >
>> >> >
>> >> > Having the dimension count and the score function on the FieldType
>> actually makes sense to me. I was more wondering whether maxConn and
>> beamWidth actually belong to the FieldType, or if they should be made
>> constructor arguments of Lucene90VectorFormat.
>> >> >
>> >> > --
>> >> > Adrien
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Questions about the new vector API

Reply via email to