Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for sure unless someone revives https://issues.apache.org/jira/browse/LUCENE-9136 or something like that
On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> wrote: > > Consistent plural naming makes sense to me. I think it ended up > singular because I am biased to avoid plural names unless there is a > useful distinction to be made. But consistency should trump my > predilections. > > I think the reason we have search() on VectorValues is that we have > LeafReader.getVectorValues() (by analogy to the DocValues iterators), > but no way to access the VectorReader. Do you think we should also > have LeafReader.getVectorReader()? Today it's only on CodecReader. > > Re: SearchStrategy.NONE; the idea is we support efficient access to > floating point values. Using BinaryDocValues for this will always > require an additional decoding step. I can see that the naming is > confusing there. The intent is that you index the vector values, but > no additional indexing data structure. Also: the reason HNSW is > mentioned in these SearchStrategy enums is to make room for other > vector indexing approaches, like LSH. There was a lot of discussion > that we wanted an API that allowed for experimenting with other > techniques for indexing and searching vector values. > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), > but I think the situation is more akin to Points, where we have the > options on IndexableField. The metadata we store there (dimension and > score function) don't really result in different formats, ie code > paths for indexing and storage; they are more like parameters to the > format, in my mind. Perhaps the situation will look different when we > get our second vector indexing strategy (like LSH). > > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida > <tomoko.uchida.1...@gmail.com> wrote: > > > > > Should we rename VectorFormat to VectorsFormat? This would be more > > > consistent with other file formats that use the plural, like > > > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > > > > +1 for using plural form for consistency - if we reconsider the names, how > > about VectorValuesFormat so that it follows the naming convention for > > XXXValues? > > > > DocValuesFormat / DocValues > > PointValuesFormat / PointValues > > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues) > > > > > Should SearchStrategy constants avoid explicit references to HNSW? > > > > Also +1 for decoupling HNSW specific implementations from general vectors, > > though I am not fully sure if we can strictly separate the similarity > > metrics and search algorithms for vectors. > > LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve > > its goal? I haven't followed the issue in months because of my laziness... > > > > Thanks, > > Tomoko > > > > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: > >> > >> Hello, > >> > >> I've tried to catch up on the vector API and I have the following > >> questions. I've tried to read through discussions on JIRA first in case it > >> had been covered, but it's possible I missed some relevant ones. > >> > >> Should VectorValues#search be on VectorReader instead? It felt a bit odd > >> to me to have the search logic on the iterator. > >> > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows > >> storing vectors but that NN search won't be supported. This looks like a > >> use-case for binary doc values to me? It also slightly caught me by > >> surprise due to the inconsistency with IndexOptions.NONE, which means "do > >> not index this field" (and likewise for DocValuesType.NONE), so I first > >> assumed that SearchStrategy.NONE also meant "do not index this field as a > >> vector". > >> > >> While postings and doc-value formats allow per-field configuration via > >> PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different > >> mechanism where VectorField#createHnswType sets attributes on the field > >> type that the vectors writer then reads. Should we have a > >> PerFieldVectorsFormat instead and configure these options via the vectors > >> format? > >> > >> Should SearchStrategy constants avoid explicit references to HNSW? The > >> rest of the API seems to try to be agnostic of the way that NN search is > >> implemented. Could we make SearchStrategy only about the similarity metric > >> that is used for vectors? This particular point seems discussed on > >> LUCENE-9322 but I couldn't find the conclusion. > >> > >> Should we rename VectorFormat to VectorsFormat? This would be more > >> consistent with other file formats that use the plural, like > >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > >> > >> -- > >> Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org