Maybe that is so, but we should factor in everything: such as large scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene!
Because HNSW has dominated the nightly benchmarks, I have been digging through stacktraces and trying to figure out ways to make it work efficiently, and I'm not sure what to do. Especially merge is painful: it seems to cause a storm of page faults/random accesses due to how it works, and I don't know yet how to make it better. It seems to rebuild the entire graph, spraying random accesses across a "slow-wrapper" that binary searches each sub on every access. I don't see any way to even amortize the pain with some kind of bulk merge trick. So if we find algorithms that scale better, I think we should lend a preference towards them. For example, algorithms that allow per-segment/sequential index and merge. On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> wrote: > ann-benchmarks.com maintains open benchmarks of a bunch of ANN > (approximate NN) algorithms. When we started this effort, HNSW was at > the top of the heap in most of the benchmarks. > > On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote: > > > > Where are the alternative algorithms that work on sequential iterators > and don't need random access? > > > > Seems like these should be the ones we initially add to lucene, and HNSW > should be put aside for now? (is it a toy, or can we do it without > jazillions of random accesses?) > > > > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> > >> There's also some good discussion on > >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access > >> vs iterator pattern that never got fully resolved. We said we would > >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of > >> random access is pretty well-established there, maybe we should > >> abandon the iterator API since it is redundant (you can always iterate > >> over a random access API if you know the size)? > >> > >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> > > >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for > >> > sure unless someone revives > >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like > >> > that > >> > > >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> > > > >> > > Consistent plural naming makes sense to me. I think it ended up > >> > > singular because I am biased to avoid plural names unless there is a > >> > > useful distinction to be made. But consistency should trump my > >> > > predilections. > >> > > > >> > > I think the reason we have search() on VectorValues is that we have > >> > > LeafReader.getVectorValues() (by analogy to the DocValues > iterators), > >> > > but no way to access the VectorReader. Do you think we should also > >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader. > >> > > > >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to > >> > > floating point values. Using BinaryDocValues for this will always > >> > > require an additional decoding step. I can see that the naming is > >> > > confusing there. The intent is that you index the vector values, but > >> > > no additional indexing data structure. Also: the reason HNSW is > >> > > mentioned in these SearchStrategy enums is to make room for other > >> > > vector indexing approaches, like LSH. There was a lot of discussion > >> > > that we wanted an API that allowed for experimenting with other > >> > > techniques for indexing and searching vector values. > >> > > > >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and > DocValues), > >> > > but I think the situation is more akin to Points, where we have the > >> > > options on IndexableField. The metadata we store there (dimension > and > >> > > score function) don't really result in different formats, ie code > >> > > paths for indexing and storage; they are more like parameters to the > >> > > format, in my mind. Perhaps the situation will look different when > we > >> > > get our second vector indexing strategy (like LSH). > >> > > > >> > > > >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida > >> > > <tomoko.uchida.1...@gmail.com> wrote: > >> > > > > >> > > > > Should we rename VectorFormat to VectorsFormat? This would be > more consistent with other file formats that use the plural, like > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > >> > > > > >> > > > +1 for using plural form for consistency - if we reconsider the > names, how about VectorValuesFormat so that it follows the naming > convention for XXXValues? > >> > > > > >> > > > DocValuesFormat / DocValues > >> > > > PointValuesFormat / PointValues > >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / > VectorValues) > >> > > > > >> > > > > Should SearchStrategy constants avoid explicit references to > HNSW? > >> > > > > >> > > > Also +1 for decoupling HNSW specific implementations from general > vectors, though I am not fully sure if we can strictly separate the > similarity metrics and search algorithms for vectors. > >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does > it achieve its goal? I haven't followed the issue in months because of my > laziness... > >> > > > > >> > > > Thanks, > >> > > > Tomoko > >> > > > > >> > > > > >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: > >> > > >> > >> > > >> Hello, > >> > > >> > >> > > >> I've tried to catch up on the vector API and I have the > following questions. I've tried to read through discussions on JIRA first > in case it had been covered, but it's possible I missed some relevant ones. > >> > > >> > >> > > >> Should VectorValues#search be on VectorReader instead? It felt a > bit odd to me to have the search logic on the iterator. > >> > > >> > >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it > allows storing vectors but that NN search won't be supported. This looks > like a use-case for binary doc values to me? It also slightly caught me by > surprise due to the inconsistency with IndexOptions.NONE, which means "do > not index this field" (and likewise for DocValuesType.NONE), so I first > assumed that SearchStrategy.NONE also meant "do not index this field as a > vector". > >> > > >> > >> > > >> While postings and doc-value formats allow per-field > configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors > use a different mechanism where VectorField#createHnswType sets attributes > on the field type that the vectors writer then reads. Should we have a > PerFieldVectorsFormat instead and configure these options via the vectors > format? > >> > > >> > >> > > >> Should SearchStrategy constants avoid explicit references to > HNSW? The rest of the API seems to try to be agnostic of the way that NN > search is implemented. Could we make SearchStrategy only about the > similarity metric that is used for vectors? This particular point seems > discussed on LUCENE-9322 but I couldn't find the conclusion. > >> > > >> > >> > > >> Should we rename VectorFormat to VectorsFormat? This would be > more consistent with other file formats that use the plural, like > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? > >> > > >> > >> > > >> -- > >> > > >> Adrien > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >