[
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580935#comment-17580935
]
Jack Mazanec commented on LUCENE-10318:
---------------------------------------
Hi [~julietibs] I was thinking about something similar and would be interested
in working on this. I can run some experiments to see if this would improve
performance, if you haven’t already started to do so.
Additionally, I am wondering if it would make sense to extend this to support
graphs that contain deleted nodes. I can think of an approach, but it is a
little messy. It would follow the same idea for merging — add vectors from
smaller graph into larger graph. However, before adding vectors from smaller
graph, all of the deleted nodes would need to be removed from the larger graph.
In order to remove a node from the graph, I think we would need to remove it
from list of neighbor arrays for each level it is in. In addition to this,
because removal would break the ordinals, we would have to update all of the
ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes
by levels and also potentially each neighbor in each NeighborArray in the
graph.
Because removing a node could cause a number of nodes in the graph to lose a
neighbor, we would need to repair the graph. To do this, I think we could
create a _repair_list_ that tracks the nodes that lost a connection due to the
deleted node{_}.{_} To fill the list, we would need to iterate over all of the
nodes in the graph and then check if any of their _m_ connections are to the
deleted node (I think this could be done when the ordinals are being updated).
If so, remove the connection and then add the node to the {_}repair_list{_}.
Once the _repair_list_ is complete, for each node in the list, search the graph
to get new neighbors to fill up the node’s connections to the desired amount.
At this point, I would expect the time it takes to finish merging to be equal
to the time it takes to insert the number of live vectors in the smaller graph
plus the size of the repair list into the large graph.
All that being said, I am not sure if removing deleted nodes in the graph would
be faster than just building the graph from scratch. From the logic above, we
would need to at least iterate over each connection in the graph and
potentially perform several list deletions. My guess is that when the repair
list is small, I think it would be faster, but when it is large, probably not.
I am going to try to start playing around with this idea, but please let me
know what you think!
> Reuse HNSW graphs when merging segments?
> ----------------------------------------
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire
> graph from scratch. In general, building these graphs is very expensive, and
> it'd be nice to optimize it in any way we can. I was wondering if during
> merge, we could choose the largest segment with no deletes, and load its HNSW
> graph into heap. Then we'd add vectors from the other segments to this graph,
> through the normal build process. This could cut down on the number of
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would
> help. I'd guess that whether it helps would also depend on details of the
> MergePolicy.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]