I see, right, we can create a Codec that applies the values takes from the schema for a given field, sure, that works.
On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <[email protected]> wrote: > > Configuring the codec based on the schema is something that Solr does via > SchemaCodecFactory. > https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java > > Would a similar approach work in your case? > > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <[email protected]> a écrit : >> >> > I was more thinking of moving VectorValues#search to >> > LeafReader#searchNearestVectors or something along those lines. I agree >> > that VectorReader should only be exposed on CodecReader. >> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to >> add such visible API changes early on in the project. >> >> > I wonder if things would be simpler if we were more opinionated and made >> > vectors specifically about nearest-neighbor search. Then we have a clearer >> > message, use vectors for NN search and doc values otherwise. As far as I >> > know, reinterpreting bytes as floats shouldn't add much overhead. The main >> > problem I know of is that the JVM won't auto-vectorize if you read floats >> > dynamically from a byte[], but this is something that should be alleviated >> > by the JDK vector API? >> >> > Actually this is the thing that feels odd to me: if we end up with >> > constants for both LSH and HNSW, then we are adding the requirement that >> > all vector formats must implement both LSH and HNSW as they will need to >> > support all SearchStrategy constants? >> >> Hmm I see I didn't think this all the way through ... I guess I had it >> in mind that there would probably only ever be a single format with >> internal variants for different vector index types, but as I have >> worked more with Lucene's index formats I see that is awkward, and I'm >> certainly open to restructuring it in a more natural way. Similarly >> for the NONE format - BinaryDocValues can be used for such >> (non-searchable) vectors. Indeed we had such an implementation and >> although we recently switched it to use the NONE format for >> uniformity, it could easily be switched back. >> >> Regarding the graph construction parameters (maxConn and beamWidth) >> I'm not sure what the right approach is exactly. We struggled to find >> the best API for this. I guess my concern about the PerField* approach >> is (at least as I think I understand it) it needs to be configured in >> code when creating a Codec. But we would like to be able to read such >> parameters from a schema configuration. I think of them as in the same >> spirit as an Analyzer. However I may not have fully appreciated the >> intention of, or how to make the best use of PerField formats. It is >> true we don't really need to write these parameters to the index; >> we're free to use different values when merging for example. >> >> -Mike >> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <[email protected]> wrote: >> > >> > Hello Mike, >> > >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]> wrote: >> >> >> >> I think the reason we have search() on VectorValues is that we have >> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >> >> but no way to access the VectorReader. Do you think we should also >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader. >> > >> > >> > I was more thinking of moving VectorValues#search to >> > LeafReader#searchNearestVectors or something along those lines. I agree >> > that VectorReader should only be exposed on CodecReader. >> > >> >> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to >> >> floating point values. Using BinaryDocValues for this will always >> >> require an additional decoding step. I can see that the naming is >> >> confusing there. The intent is that you index the vector values, but >> >> no additional indexing data structure. >> > >> > >> > I wonder if things would be simpler if we were more opinionated and made >> > vectors specifically about nearest-neighbor search. Then we have a clearer >> > message, use vectors for NN search and doc values otherwise. As far as I >> > know, reinterpreting bytes as floats shouldn't add much overhead. The main >> > problem I know of is that the JVM won't auto-vectorize if you read floats >> > dynamically from a byte[], but this is something that should be alleviated >> > by the JDK vector API? >> > >> >> Also: the reason HNSW is >> >> mentioned in these SearchStrategy enums is to make room for other >> >> vector indexing approaches, like LSH. There was a lot of discussion >> >> that we wanted an API that allowed for experimenting with other >> >> techniques for indexing and searching vector values. >> > >> > >> > Actually this is the thing that feels odd to me: if we end up with >> > constants for both LSH and HNSW, then we are adding the requirement that >> > all vector formats must implement both LSH and HNSW as they will need to >> > support all SearchStrategy constants? Would it be possible to have a >> > single API and then two implementations of VectorsFormat, LSHVectorsFormat >> > on the one hand and HNSWVectorsFormat on the other hand? >> > >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >> >> but I think the situation is more akin to Points, where we have the >> >> options on IndexableField. The metadata we store there (dimension and >> >> score function) don't really result in different formats, ie code >> >> paths for indexing and storage; they are more like parameters to the >> >> format, in my mind. Perhaps the situation will look different when we >> >> get our second vector indexing strategy (like LSH). >> > >> > >> > Having the dimension count and the score function on the FieldType >> > actually makes sense to me. I was more wondering whether maxConn and >> > beamWidth actually belong to the FieldType, or if they should be made >> > constructor arguments of Lucene90VectorFormat. >> > >> > -- >> > Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
