I created a JIRA about moving VectorValues#search to VectorReader: https://issues.apache.org/jira/browse/LUCENE-9908.
On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote: > Hello Mike, > > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> I think the reason we have search() on VectorValues is that we have >> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >> but no way to access the VectorReader. Do you think we should also >> have LeafReader.getVectorReader()? Today it's only on CodecReader. >> > > I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > > >> Re: SearchStrategy.NONE; the idea is we support efficient access to >> floating point values. Using BinaryDocValues for this will always >> require an additional decoding step. I can see that the naming is >> confusing there. The intent is that you index the vector values, but >> no additional indexing data structure. > > > I wonder if things would be simpler if we were more opinionated and made > vectors specifically about nearest-neighbor search. Then we have a > clearer message, use vectors for NN search and doc values otherwise. As far > as I know, reinterpreting bytes as floats shouldn't add much overhead. The > main problem I know of is that the JVM won't auto-vectorize if you read > floats dynamically from a byte[], but this is something that should be > alleviated by the JDK vector API? > > Also: the reason HNSW is >> mentioned in these SearchStrategy enums is to make room for other >> vector indexing approaches, like LSH. There was a lot of discussion >> that we wanted an API that allowed for experimenting with other >> techniques for indexing and searching vector values. >> > > Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? Would it be possible to have a single > API and then two implementations of VectorsFormat, LSHVectorsFormat on the > one hand and HNSWVectorsFormat on the other hand? > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >> but I think the situation is more akin to Points, where we have the >> options on IndexableField. The metadata we store there (dimension and >> score function) don't really result in different formats, ie code >> paths for indexing and storage; they are more like parameters to the >> format, in my mind. Perhaps the situation will look different when we >> get our second vector indexing strategy (like LSH). > > > Having the dimension count and the score function on the FieldType > actually makes sense to me. I was more wondering whether maxConn > and beamWidth actually belong to the FieldType, or if they should be made > constructor arguments of Lucene90VectorFormat. > > -- > Adrien > -- Adrien