There's also some good discussion on https://issues.apache.org/jira/browse/LUCENE-9583 about random access vs iterator pattern that never got fully resolved. We said we would revisit after KNN (LUCENE-9004) landed, and now it has. The usage of random access is pretty well-established there, maybe we should abandon the iterator API since it is redundant (you can always iterate over a random access API if you know the size)?
On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com> wrote: > > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for > sure unless someone revives > https://issues.apache.org/jira/browse/LUCENE-9136 or something like > that > > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> wrote: > > > > Consistent plural naming makes sense to me. I think it ended up > > singular because I am biased to avoid plural names unless there is a > > useful distinction to be made. But consistency should trump my > > predilections. > > > > I think the reason we have search() on VectorValues is that we have > > LeafReader.getVectorValues() (by analogy to the DocValues iterators), > > but no way to access the VectorReader. Do you think we should also > > have LeafReader.getVectorReader()? Today it's only on CodecReader. > > > > Re: SearchStrategy.NONE; the idea is we support efficient access to > > floating point values. Using BinaryDocValues for this will always > > require an additional decoding step. I can see that the naming is > > confusing there. The intent is that you index the vector values, but > > no additional indexing data structure. Also: the reason HNSW is > > mentioned in these SearchStrategy enums is to make room for other > > vector indexing approaches, like LSH. There was a lot of discussion > > that we wanted an API that allowed for experimenting with other > > techniques for indexing and searching vector values. > > > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), > > but I think the situation is more akin to Points, where we have the > > options on IndexableField. The metadata we store there (dimension and > > score function) don't really result in different formats, ie code > > paths for indexing and storage; they are more like parameters to the > > format, in my mind. Perhaps the situation will look different when we > > get our second vector indexing strategy (like LSH). > > > > > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida > > <tomoko.uchida.1...@gmail.com> wrote: > > > > > > > Should we rename VectorFormat to VectorsFormat? This would be more > > > > consistent with other file formats that use the plural, like > > > > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > > > > > > +1 for using plural form for consistency - if we reconsider the names, > > > how about VectorValuesFormat so that it follows the naming convention for > > > XXXValues? > > > > > > DocValuesFormat / DocValues > > > PointValuesFormat / PointValues > > > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues) > > > > > > > Should SearchStrategy constants avoid explicit references to HNSW? > > > > > > Also +1 for decoupling HNSW specific implementations from general > > > vectors, though I am not fully sure if we can strictly separate the > > > similarity metrics and search algorithms for vectors. > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it > > > achieve its goal? I haven't followed the issue in months because of my > > > laziness... > > > > > > Thanks, > > > Tomoko > > > > > > > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: > > >> > > >> Hello, > > >> > > >> I've tried to catch up on the vector API and I have the following > > >> questions. I've tried to read through discussions on JIRA first in case > > >> it had been covered, but it's possible I missed some relevant ones. > > >> > > >> Should VectorValues#search be on VectorReader instead? It felt a bit odd > > >> to me to have the search logic on the iterator. > > >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows > > >> storing vectors but that NN search won't be supported. This looks like a > > >> use-case for binary doc values to me? It also slightly caught me by > > >> surprise due to the inconsistency with IndexOptions.NONE, which means > > >> "do not index this field" (and likewise for DocValuesType.NONE), so I > > >> first assumed that SearchStrategy.NONE also meant "do not index this > > >> field as a vector". > > >> > > >> While postings and doc-value formats allow per-field configuration via > > >> PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different > > >> mechanism where VectorField#createHnswType sets attributes on the field > > >> type that the vectors writer then reads. Should we have a > > >> PerFieldVectorsFormat instead and configure these options via the > > >> vectors format? > > >> > > >> Should SearchStrategy constants avoid explicit references to HNSW? The > > >> rest of the API seems to try to be agnostic of the way that NN search is > > >> implemented. Could we make SearchStrategy only about the similarity > > >> metric that is used for vectors? This particular point seems discussed > > >> on LUCENE-9322 but I couldn't find the conclusion. > > >> > > >> Should we rename VectorFormat to VectorsFormat? This would be more > > >> consistent with other file formats that use the plural, like > > >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > > >> > > >> -- > > >> Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org