I'm gonna toss out one last question while we are here: Is Vector(s)Format really a good name to use?
We already have "term vectors API", and "vector highlighter" that uses it. There's also the traditional "vector-space" scoring model. With java 16, we get a "vector api" from java itself, too. I think the name is overloaded too many times already, and this one is the straw that breaks the camel's back for me. So I'm just throwing out there the idea: if this api is about ANN, maybe it should claim its own name (NeighborsFormat?) that is less ambiguous. On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <msoko...@gmail.com> wrote: > I see, right, we can create a Codec that applies the values takes from > the schema for a given field, sure, that works. > > On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpou...@gmail.com> wrote: > > > > Configuring the codec based on the schema is something that Solr does > via SchemaCodecFactory. > https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java > > > > Would a similar approach work in your case? > > > > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msoko...@gmail.com> a > écrit : > >> > >> > I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > >> > >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to > >> add such visible API changes early on in the project. > >> > >> > I wonder if things would be simpler if we were more opinionated and > made vectors specifically about nearest-neighbor search. Then we have a > clearer message, use vectors for NN search and doc values otherwise. As far > as I know, reinterpreting bytes as floats shouldn't add much overhead. The > main problem I know of is that the JVM won't auto-vectorize if you read > floats dynamically from a byte[], but this is something that should be > alleviated by the JDK vector API? > >> > >> > Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? > >> > >> Hmm I see I didn't think this all the way through ... I guess I had it > >> in mind that there would probably only ever be a single format with > >> internal variants for different vector index types, but as I have > >> worked more with Lucene's index formats I see that is awkward, and I'm > >> certainly open to restructuring it in a more natural way. Similarly > >> for the NONE format - BinaryDocValues can be used for such > >> (non-searchable) vectors. Indeed we had such an implementation and > >> although we recently switched it to use the NONE format for > >> uniformity, it could easily be switched back. > >> > >> Regarding the graph construction parameters (maxConn and beamWidth) > >> I'm not sure what the right approach is exactly. We struggled to find > >> the best API for this. I guess my concern about the PerField* approach > >> is (at least as I think I understand it) it needs to be configured in > >> code when creating a Codec. But we would like to be able to read such > >> parameters from a schema configuration. I think of them as in the same > >> spirit as an Analyzer. However I may not have fully appreciated the > >> intention of, or how to make the best use of PerField formats. It is > >> true we don't really need to write these parameters to the index; > >> we're free to use different values when merging for example. > >> > >> -Mike > >> > >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpou...@gmail.com> wrote: > >> > > >> > Hello Mike, > >> > > >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> >> > >> >> I think the reason we have search() on VectorValues is that we have > >> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators), > >> >> but no way to access the VectorReader. Do you think we should also > >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader. > >> > > >> > > >> > I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > >> > > >> >> > >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to > >> >> floating point values. Using BinaryDocValues for this will always > >> >> require an additional decoding step. I can see that the naming is > >> >> confusing there. The intent is that you index the vector values, but > >> >> no additional indexing data structure. > >> > > >> > > >> > I wonder if things would be simpler if we were more opinionated and > made vectors specifically about nearest-neighbor search. Then we have a > clearer message, use vectors for NN search and doc values otherwise. As far > as I know, reinterpreting bytes as floats shouldn't add much overhead. The > main problem I know of is that the JVM won't auto-vectorize if you read > floats dynamically from a byte[], but this is something that should be > alleviated by the JDK vector API? > >> > > >> >> Also: the reason HNSW is > >> >> mentioned in these SearchStrategy enums is to make room for other > >> >> vector indexing approaches, like LSH. There was a lot of discussion > >> >> that we wanted an API that allowed for experimenting with other > >> >> techniques for indexing and searching vector values. > >> > > >> > > >> > Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? Would it be possible to have a single > API and then two implementations of VectorsFormat, LSHVectorsFormat on the > one hand and HNSWVectorsFormat on the other hand? > >> > > >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and > DocValues), > >> >> but I think the situation is more akin to Points, where we have the > >> >> options on IndexableField. The metadata we store there (dimension and > >> >> score function) don't really result in different formats, ie code > >> >> paths for indexing and storage; they are more like parameters to the > >> >> format, in my mind. Perhaps the situation will look different when we > >> >> get our second vector indexing strategy (like LSH). > >> > > >> > > >> > Having the dimension count and the score function on the FieldType > actually makes sense to me. I was more wondering whether maxConn and > beamWidth actually belong to the FieldType, or if they should be made > constructor arguments of Lucene90VectorFormat. > >> > > >> > -- > >> > Adrien > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >