benwtrent opened a new issue, #12440:
URL: https://github.com/apache/lucene/issues/12440
### Description
It is well known that when merging segments, HNSW is not merged in linear
time. We lose a fair bit of useful information as all the NSGs are effectively
thrown away (except for a single graph). This means we have useful distributed
work from each segment that is ignored.
Do we think its even possible to improve HNSW?
There is existing research around merging hierarchical graphs, but not
specifically HNSW:
- https://arxiv.org/abs/1908.00814
- https://arxiv.org/abs/1912.01059 (how they merge the parallel built
graphs, not the GPU focused things)
One other option that comes to my mind is keeping NSG scores and use them on
merge. So, taking advantage of a
[FINGER](https://www.amazon.science/publications/finger-fast-inference-for-graph-based-approximate-nearest-neighbor-search)
type of idea. I am not 100% sold on using this during search as it increases
memory usage. But, if we store these in a separate file (not loaded during
search) but loaded during merge, we could reduce the number of calculations
required.
Of course, all these types of changes require due diligence and do NOT
obviate the need of a new algorithm in Lucene that is more page-fault friendly
(IVF, SPANN, DiskNN or something).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]