Idea about faster vector format merge

Patrick Zhai Tue, 18 Oct 2022 21:43:31 -0700

Hi Folks

I've talked with Mike Sokolov and learnt some KNN knowledge from him (thank
you!) during ApacheCon and one thing I learnt was that our KNN
implementation was kind of suffering from long merging time because we
currently rebuild the graph from scratch every time we merge. I noticed
there's one effort that is trying to reuse a graph from one segment to save
part of the time: https://github.com/apache/lucene/issues/11354.


But I wonder whether it makes sense for us to take a step even further: to
be able to delay the HNSW graph merge or only do partial merge and allow
multiple HNSW graphs stay in one segment? For example, if we're merging 8
equal sized segments and we can tolerate up to 4 hnsw graphs, then we only
need to re-insert half of the documents (after we're able to reuse old
graphs). This could slow down the search within the segment by a factor of
logK, but could potentially save a lot of merging time, especially when the
merge policy is aggressive?

Just want to throw this idea out and please feel free to comment!

Best
Patrick

Idea about faster vector format merge

Reply via email to