Re: Questions about the new vector API

Robert Muir Wed, 17 Mar 2021 13:34:35 -0700

I'm gonna toss out one last question while we are here: Is Vector(s)Format
really a good name to use?


We already have "term vectors API", and "vector highlighter" that uses it.
There's also the traditional "vector-space" scoring model. With java 16, we
get a "vector api" from java itself, too.

I think the name is overloaded too many times already, and this one is the
straw that breaks the camel's back for me.

So I'm just throwing out there the idea: if this api is about ANN, maybe it
should claim its own name (NeighborsFormat?) that is less ambiguous.

On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <msoko...@gmail.com> wrote:

> I see, right, we can create a Codec that applies the values takes from
> the schema for a given field, sure, that works.
>
> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpou...@gmail.com> wrote:
> >
> > Configuring the codec based on the schema is something that Solr does
> via SchemaCodecFactory.
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
> >
> > Would a similar approach work in your case?
> >
> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msoko...@gmail.com> a
> écrit :
> >>
> >> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >>
> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
> >> add such visible API changes early on in the project.
> >>
> >> > I wonder if things would be simpler if we were more opinionated and
> made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >>
> >> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants?
> >>
> >> Hmm I see I didn't think this all the way through ... I guess I had it
> >> in mind that there would probably only ever be a single format with
> >> internal variants for different vector index types, but as I have
> >> worked more with Lucene's index formats I see that is awkward, and I'm
> >> certainly open to restructuring it in a more natural way. Similarly
> >> for the NONE format - BinaryDocValues can be used for such
> >> (non-searchable) vectors. Indeed we had such an implementation and
> >> although we recently switched it to use the NONE format for
> >> uniformity, it could easily be switched back.
> >>
> >> Regarding the graph construction parameters (maxConn and beamWidth)
> >> I'm not sure what the right approach is exactly. We struggled to find
> >> the best API for this. I guess my concern about the PerField* approach
> >> is (at least as I think I understand it) it needs to be configured in
> >> code when creating a Codec. But we would like to be able to read such
> >> parameters from a schema configuration. I think of them as in the same
> >> spirit as an Analyzer. However I may not have fully appreciated the
> >> intention of, or how to make the best use of PerField formats. It is
> >> true we don't really need to write these parameters to the index;
> >> we're free to use different values when merging for example.
> >>
> >> -Mike
> >>
> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpou...@gmail.com> wrote:
> >> >
> >> > Hello Mike,
> >> >
> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >> >>
> >> >> I think the reason we have search() on VectorValues is that we have
> >> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> >> >> but no way to access the VectorReader. Do you think we should also
> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> >
> >> >
> >> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >> >
> >> >>
> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> >> floating point values. Using BinaryDocValues for this will always
> >> >> require an additional decoding step. I can see that the naming is
> >> >> confusing there. The intent is that you index the vector values, but
> >> >> no additional indexing data structure.
> >> >
> >> >
> >> > I wonder if things would be simpler if we were more opinionated and
> made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >> >
> >> >> Also: the reason HNSW is
> >> >> mentioned in these SearchStrategy enums is to make room for other
> >> >> vector indexing approaches, like LSH. There was a lot of discussion
> >> >> that we wanted an API that allowed for experimenting with other
> >> >> techniques for indexing and searching vector values.
> >> >
> >> >
> >> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >> >
> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >> but I think the situation is more akin to Points, where we have the
> >> >> options on IndexableField. The metadata we store there (dimension and
> >> >> score function) don't really result in different formats, ie code
> >> >> paths for indexing and storage; they are more like parameters to the
> >> >> format, in my mind. Perhaps the situation will look different when we
> >> >> get our second vector indexing strategy (like LSH).
> >> >
> >> >
> >> > Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >> >
> >> > --
> >> > Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Questions about the new vector API

Reply via email to