Hello Mike, On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> wrote:
> I think the reason we have search() on VectorValues is that we have > LeafReader.getVectorValues() (by analogy to the DocValues iterators), > but no way to access the VectorReader. Do you think we should also > have LeafReader.getVectorReader()? Today it's only on CodecReader. > I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader. > Re: SearchStrategy.NONE; the idea is we support efficient access to > floating point values. Using BinaryDocValues for this will always > require an additional decoding step. I can see that the naming is > confusing there. The intent is that you index the vector values, but > no additional indexing data structure. I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API? Also: the reason HNSW is > mentioned in these SearchStrategy enums is to make room for other > vector indexing approaches, like LSH. There was a lot of discussion > that we wanted an API that allowed for experimenting with other > techniques for indexing and searching vector values. > Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants? Would it be possible to have a single API and then two implementations of VectorsFormat, LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other hand? Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), > but I think the situation is more akin to Points, where we have the > options on IndexableField. The metadata we store there (dimension and > score function) don't really result in different formats, ie code > paths for indexing and storage; they are more like parameters to the > format, in my mind. Perhaps the situation will look different when we > get our second vector indexing strategy (like LSH). Having the dimension count and the score function on the FieldType actually makes sense to me. I was more wondering whether maxConn and beamWidth actually belong to the FieldType, or if they should be made constructor arguments of Lucene90VectorFormat. -- Adrien