ann-benchmarks.com maintains open benchmarks of a bunch of ANN (approximate NN) algorithms. When we started this effort, HNSW was at the top of the heap in most of the benchmarks.
On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote: > > Where are the alternative algorithms that work on sequential iterators and > don't need random access? > > Seems like these should be the ones we initially add to lucene, and HNSW > should be put aside for now? (is it a toy, or can we do it without jazillions > of random accesses?) > > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com> wrote: >> >> There's also some good discussion on >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access >> vs iterator pattern that never got fully resolved. We said we would >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of >> random access is pretty well-established there, maybe we should >> abandon the iterator API since it is redundant (you can always iterate >> over a random access API if you know the size)? >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com> wrote: >> > >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for >> > sure unless someone revives >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like >> > that >> > >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> >> > wrote: >> > > >> > > Consistent plural naming makes sense to me. I think it ended up >> > > singular because I am biased to avoid plural names unless there is a >> > > useful distinction to be made. But consistency should trump my >> > > predilections. >> > > >> > > I think the reason we have search() on VectorValues is that we have >> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators), >> > > but no way to access the VectorReader. Do you think we should also >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader. >> > > >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to >> > > floating point values. Using BinaryDocValues for this will always >> > > require an additional decoding step. I can see that the naming is >> > > confusing there. The intent is that you index the vector values, but >> > > no additional indexing data structure. Also: the reason HNSW is >> > > mentioned in these SearchStrategy enums is to make room for other >> > > vector indexing approaches, like LSH. There was a lot of discussion >> > > that we wanted an API that allowed for experimenting with other >> > > techniques for indexing and searching vector values. >> > > >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >> > > but I think the situation is more akin to Points, where we have the >> > > options on IndexableField. The metadata we store there (dimension and >> > > score function) don't really result in different formats, ie code >> > > paths for indexing and storage; they are more like parameters to the >> > > format, in my mind. Perhaps the situation will look different when we >> > > get our second vector indexing strategy (like LSH). >> > > >> > > >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida >> > > <tomoko.uchida.1...@gmail.com> wrote: >> > > > >> > > > > Should we rename VectorFormat to VectorsFormat? This would be more >> > > > > consistent with other file formats that use the plural, like >> > > > > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> > > > >> > > > +1 for using plural form for consistency - if we reconsider the names, >> > > > how about VectorValuesFormat so that it follows the naming convention >> > > > for XXXValues? >> > > > >> > > > DocValuesFormat / DocValues >> > > > PointValuesFormat / PointValues >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / >> > > > VectorValues) >> > > > >> > > > > Should SearchStrategy constants avoid explicit references to HNSW? >> > > > >> > > > Also +1 for decoupling HNSW specific implementations from general >> > > > vectors, though I am not fully sure if we can strictly separate the >> > > > similarity metrics and search algorithms for vectors. >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it >> > > > achieve its goal? I haven't followed the issue in months because of my >> > > > laziness... >> > > > >> > > > Thanks, >> > > > Tomoko >> > > > >> > > > >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: >> > > >> >> > > >> Hello, >> > > >> >> > > >> I've tried to catch up on the vector API and I have the following >> > > >> questions. I've tried to read through discussions on JIRA first in >> > > >> case it had been covered, but it's possible I missed some relevant >> > > >> ones. >> > > >> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a bit >> > > >> odd to me to have the search logic on the iterator. >> > > >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows >> > > >> storing vectors but that NN search won't be supported. This looks >> > > >> like a use-case for binary doc values to me? It also slightly caught >> > > >> me by surprise due to the inconsistency with IndexOptions.NONE, which >> > > >> means "do not index this field" (and likewise for >> > > >> DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also >> > > >> meant "do not index this field as a vector". >> > > >> >> > > >> While postings and doc-value formats allow per-field configuration >> > > >> via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a >> > > >> different mechanism where VectorField#createHnswType sets attributes >> > > >> on the field type that the vectors writer then reads. Should we have >> > > >> a PerFieldVectorsFormat instead and configure these options via the >> > > >> vectors format? >> > > >> >> > > >> Should SearchStrategy constants avoid explicit references to HNSW? >> > > >> The rest of the API seems to try to be agnostic of the way that NN >> > > >> search is implemented. Could we make SearchStrategy only about the >> > > >> similarity metric that is used for vectors? This particular point >> > > >> seems discussed on LUCENE-9322 but I couldn't find the conclusion. >> > > >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be more >> > > >> consistent with other file formats that use the plural, like >> > > >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> > > >> >> > > >> -- >> > > >> Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org