Hi Patrick, this seems like an important topic to explore! Speeding up merging by allowing multiple graphs per segment could be an acceptable compromise. We'd probably have to run a large-scale experiment to get a good handle on the performance trade-off. My intuition though is that it could significantly hurt search latency: the HNSW algorithm has roughly logarithmic complexity, so it's always better to search a single large graph rather than several smaller ones sequentially. For example, if a single large graph gave ~log n, then searching 4 small ones would be ~4 * log(n/4), which is roughly 4 times the cost.
Another option is to still merge all graphs into one, but in a way that mostly just reuses the existing segment graphs. I'm mentioning this option for completeness, but don't have a clear idea of how it would work. There is limited academic work on merging nearest neighbor graphs (I only found https://arxiv.org/pdf/1908.00814.pdf, which is interesting but not clear if it could be adapted to HNSW?) Julie On Tue, Oct 18, 2022 at 9:43 PM Patrick Zhai <zhai7...@gmail.com> wrote: > Hi Folks > > I've talked with Mike Sokolov and learnt some KNN knowledge from him > (thank you!) during ApacheCon and one thing I learnt was that our KNN > implementation was kind of suffering from long merging time because we > currently rebuild the graph from scratch every time we merge. I noticed > there's one effort that is trying to reuse a graph from one segment to save > part of the time: https://github.com/apache/lucene/issues/11354. > > But I wonder whether it makes sense for us to take a step even further: to > be able to delay the HNSW graph merge or only do partial merge and allow > multiple HNSW graphs stay in one segment? For example, if we're merging 8 > equal sized segments and we can tolerate up to 4 hnsw graphs, then we only > need to re-insert half of the documents (after we're able to reuse old > graphs). This could slow down the search within the segment by a factor of > logK, but could potentially save a lot of merging time, especially when the > merge policy is aggressive? > > Just want to throw this idea out and please feel free to comment! > > Best > Patrick >