Yeah, HNSW is problematic in a few ways: (1) merging is costly due to the need to completely recreate the graph. (2) searching across a segmented index sacrifices much of the performance benefit of HNSW since the cost of searching HNSW graphs scales ~logarithmically with the size of the graph, so splitting into multiple graphs and then merge sorting results is pretty expensive. I guess the random access / scan forward dynamic is another problematic area.
On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote: > > Maybe that is so, but we should factor in everything: such as large scale > indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene! > > Because HNSW has dominated the nightly benchmarks, I have been digging > through stacktraces and trying to figure out ways to make it work > efficiently, and I'm not sure what to do. > Especially merge is painful: it seems to cause a storm of page faults/random > accesses due to how it works, and I don't know yet how to make it better. > It seems to rebuild the entire graph, spraying random accesses across a > "slow-wrapper" that binary searches each sub on every access. > I don't see any way to even amortize the pain with some kind of bulk merge > trick. > > So if we find algorithms that scale better, I think we should lend a > preference towards them. For example, algorithms that allow > per-segment/sequential index and merge. > > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> wrote: >> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN >> (approximate NN) algorithms. When we started this effort, HNSW was at >> the top of the heap in most of the benchmarks. >> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote: >> > >> > Where are the alternative algorithms that work on sequential iterators and >> > don't need random access? >> > >> > Seems like these should be the ones we initially add to lucene, and HNSW >> > should be put aside for now? (is it a toy, or can we do it without >> > jazillions of random accesses?) >> > >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com> >> > wrote: >> >> >> >> There's also some good discussion on >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access >> >> vs iterator pattern that never got fully resolved. We said we would >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of >> >> random access is pretty well-established there, maybe we should >> >> abandon the iterator API since it is redundant (you can always iterate >> >> over a random access API if you know the size)? >> >> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com> >> >> wrote: >> >> > >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for >> >> > sure unless someone revives >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like >> >> > that >> >> > >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> >> >> > wrote: >> >> > > >> >> > > Consistent plural naming makes sense to me. I think it ended up >> >> > > singular because I am biased to avoid plural names unless there is a >> >> > > useful distinction to be made. But consistency should trump my >> >> > > predilections. >> >> > > >> >> > > I think the reason we have search() on VectorValues is that we have >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators), >> >> > > but no way to access the VectorReader. Do you think we should also >> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader. >> >> > > >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to >> >> > > floating point values. Using BinaryDocValues for this will always >> >> > > require an additional decoding step. I can see that the naming is >> >> > > confusing there. The intent is that you index the vector values, but >> >> > > no additional indexing data structure. Also: the reason HNSW is >> >> > > mentioned in these SearchStrategy enums is to make room for other >> >> > > vector indexing approaches, like LSH. There was a lot of discussion >> >> > > that we wanted an API that allowed for experimenting with other >> >> > > techniques for indexing and searching vector values. >> >> > > >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >> >> > > but I think the situation is more akin to Points, where we have the >> >> > > options on IndexableField. The metadata we store there (dimension and >> >> > > score function) don't really result in different formats, ie code >> >> > > paths for indexing and storage; they are more like parameters to the >> >> > > format, in my mind. Perhaps the situation will look different when we >> >> > > get our second vector indexing strategy (like LSH). >> >> > > >> >> > > >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida >> >> > > <tomoko.uchida.1...@gmail.com> wrote: >> >> > > > >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be >> >> > > > > more consistent with other file formats that use the plural, like >> >> > > > > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> >> > > > >> >> > > > +1 for using plural form for consistency - if we reconsider the >> >> > > > names, how about VectorValuesFormat so that it follows the naming >> >> > > > convention for XXXValues? >> >> > > > >> >> > > > DocValuesFormat / DocValues >> >> > > > PointValuesFormat / PointValues >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / >> >> > > > VectorValues) >> >> > > > >> >> > > > > Should SearchStrategy constants avoid explicit references to HNSW? >> >> > > > >> >> > > > Also +1 for decoupling HNSW specific implementations from general >> >> > > > vectors, though I am not fully sure if we can strictly separate the >> >> > > > similarity metrics and search algorithms for vectors. >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it >> >> > > > achieve its goal? I haven't followed the issue in months because of >> >> > > > my laziness... >> >> > > > >> >> > > > Thanks, >> >> > > > Tomoko >> >> > > > >> >> > > > >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: >> >> > > >> >> >> > > >> Hello, >> >> > > >> >> >> > > >> I've tried to catch up on the vector API and I have the following >> >> > > >> questions. I've tried to read through discussions on JIRA first in >> >> > > >> case it had been covered, but it's possible I missed some relevant >> >> > > >> ones. >> >> > > >> >> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a >> >> > > >> bit odd to me to have the search logic on the iterator. >> >> > > >> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it >> >> > > >> allows storing vectors but that NN search won't be supported. This >> >> > > >> looks like a use-case for binary doc values to me? It also >> >> > > >> slightly caught me by surprise due to the inconsistency with >> >> > > >> IndexOptions.NONE, which means "do not index this field" (and >> >> > > >> likewise for DocValuesType.NONE), so I first assumed that >> >> > > >> SearchStrategy.NONE also meant "do not index this field as a >> >> > > >> vector". >> >> > > >> >> >> > > >> While postings and doc-value formats allow per-field configuration >> >> > > >> via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a >> >> > > >> different mechanism where VectorField#createHnswType sets >> >> > > >> attributes on the field type that the vectors writer then reads. >> >> > > >> Should we have a PerFieldVectorsFormat instead and configure these >> >> > > >> options via the vectors format? >> >> > > >> >> >> > > >> Should SearchStrategy constants avoid explicit references to HNSW? >> >> > > >> The rest of the API seems to try to be agnostic of the way that NN >> >> > > >> search is implemented. Could we make SearchStrategy only about the >> >> > > >> similarity metric that is used for vectors? This particular point >> >> > > >> seems discussed on LUCENE-9322 but I couldn't find the conclusion. >> >> > > >> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be more >> >> > > >> consistent with other file formats that use the plural, like >> >> > > >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> >> > > >> >> >> > > >> -- >> >> > > >> Adrien >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org