I think the codec name is important and current naming seems not to be appropriate anyway. I would like to try to get consensus for that on LUCENE-9855 <https://issues.apache.org/jira/browse/LUCENE-9855>.
2021年3月20日(土) 16:04 Tomoko Uchida <[email protected]>: > I think it makes sense that we use "ANN" or "NearestNeighbor" for ann > related APIs, this may give proper level of abstraction to them. > On the other hand, it slightly sounds odd to me to use it as a Codec > name... Maybe we should use names that represents its data structure, > instead of methods/algorithms? > I'd propose "DenseVector" here if "Vector" is too obscure, but it is also > just an idea. > > Tomoko > > > 2021年3月18日(木) 5:34 Robert Muir <[email protected]>: > >> I'm gonna toss out one last question while we are here: Is >> Vector(s)Format really a good name to use? >> >> We already have "term vectors API", and "vector highlighter" that uses >> it. There's also the traditional "vector-space" scoring model. With java >> 16, we get a "vector api" from java itself, too. >> >> I think the name is overloaded too many times already, and this one is >> the straw that breaks the camel's back for me. >> >> So I'm just throwing out there the idea: if this api is about ANN, maybe >> it should claim its own name (NeighborsFormat?) that is less ambiguous. >> >> On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <[email protected]> >> wrote: >> >>> I see, right, we can create a Codec that applies the values takes from >>> the schema for a given field, sure, that works. >>> >>> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <[email protected]> wrote: >>> > >>> > Configuring the codec based on the schema is something that Solr does >>> via SchemaCodecFactory. >>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java >>> > >>> > Would a similar approach work in your case? >>> > >>> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <[email protected]> a >>> écrit : >>> >> >>> >> > I was more thinking of moving VectorValues#search to >>> LeafReader#searchNearestVectors or something along those lines. I agree >>> that VectorReader should only be exposed on CodecReader. >>> >> >>> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to >>> >> add such visible API changes early on in the project. >>> >> >>> >> > I wonder if things would be simpler if we were more opinionated and >>> made vectors specifically about nearest-neighbor search. Then we have a >>> clearer message, use vectors for NN search and doc values otherwise. As far >>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The >>> main problem I know of is that the JVM won't auto-vectorize if you read >>> floats dynamically from a byte[], but this is something that should be >>> alleviated by the JDK vector API? >>> >> >>> >> > Actually this is the thing that feels odd to me: if we end up with >>> constants for both LSH and HNSW, then we are adding the requirement that >>> all vector formats must implement both LSH and HNSW as they will need to >>> support all SearchStrategy constants? >>> >> >>> >> Hmm I see I didn't think this all the way through ... I guess I had it >>> >> in mind that there would probably only ever be a single format with >>> >> internal variants for different vector index types, but as I have >>> >> worked more with Lucene's index formats I see that is awkward, and I'm >>> >> certainly open to restructuring it in a more natural way. Similarly >>> >> for the NONE format - BinaryDocValues can be used for such >>> >> (non-searchable) vectors. Indeed we had such an implementation and >>> >> although we recently switched it to use the NONE format for >>> >> uniformity, it could easily be switched back. >>> >> >>> >> Regarding the graph construction parameters (maxConn and beamWidth) >>> >> I'm not sure what the right approach is exactly. We struggled to find >>> >> the best API for this. I guess my concern about the PerField* approach >>> >> is (at least as I think I understand it) it needs to be configured in >>> >> code when creating a Codec. But we would like to be able to read such >>> >> parameters from a schema configuration. I think of them as in the same >>> >> spirit as an Analyzer. However I may not have fully appreciated the >>> >> intention of, or how to make the best use of PerField formats. It is >>> >> true we don't really need to write these parameters to the index; >>> >> we're free to use different values when merging for example. >>> >> >>> >> -Mike >>> >> >>> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <[email protected]> >>> wrote: >>> >> > >>> >> > Hello Mike, >>> >> > >>> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]> >>> wrote: >>> >> >> >>> >> >> I think the reason we have search() on VectorValues is that we have >>> >> >> LeafReader.getVectorValues() (by analogy to the DocValues >>> iterators), >>> >> >> but no way to access the VectorReader. Do you think we should also >>> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader. >>> >> > >>> >> > >>> >> > I was more thinking of moving VectorValues#search to >>> LeafReader#searchNearestVectors or something along those lines. I agree >>> that VectorReader should only be exposed on CodecReader. >>> >> > >>> >> >> >>> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to >>> >> >> floating point values. Using BinaryDocValues for this will always >>> >> >> require an additional decoding step. I can see that the naming is >>> >> >> confusing there. The intent is that you index the vector values, >>> but >>> >> >> no additional indexing data structure. >>> >> > >>> >> > >>> >> > I wonder if things would be simpler if we were more opinionated and >>> made vectors specifically about nearest-neighbor search. Then we have a >>> clearer message, use vectors for NN search and doc values otherwise. As far >>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The >>> main problem I know of is that the JVM won't auto-vectorize if you read >>> floats dynamically from a byte[], but this is something that should be >>> alleviated by the JDK vector API? >>> >> > >>> >> >> Also: the reason HNSW is >>> >> >> mentioned in these SearchStrategy enums is to make room for other >>> >> >> vector indexing approaches, like LSH. There was a lot of discussion >>> >> >> that we wanted an API that allowed for experimenting with other >>> >> >> techniques for indexing and searching vector values. >>> >> > >>> >> > >>> >> > Actually this is the thing that feels odd to me: if we end up with >>> constants for both LSH and HNSW, then we are adding the requirement that >>> all vector formats must implement both LSH and HNSW as they will need to >>> support all SearchStrategy constants? Would it be possible to have a single >>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the >>> one hand and HNSWVectorsFormat on the other hand? >>> >> > >>> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and >>> DocValues), >>> >> >> but I think the situation is more akin to Points, where we have the >>> >> >> options on IndexableField. The metadata we store there (dimension >>> and >>> >> >> score function) don't really result in different formats, ie code >>> >> >> paths for indexing and storage; they are more like parameters to >>> the >>> >> >> format, in my mind. Perhaps the situation will look different when >>> we >>> >> >> get our second vector indexing strategy (like LSH). >>> >> > >>> >> > >>> >> > Having the dimension count and the score function on the FieldType >>> actually makes sense to me. I was more wondering whether maxConn and >>> beamWidth actually belong to the FieldType, or if they should be made >>> constructor arguments of Lucene90VectorFormat. >>> >> > >>> >> > -- >>> >> > Adrien >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: [email protected] >>> >> For additional commands, e-mail: [email protected] >>> >> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>>
