Hello Kannan,

The fact that adding 10k docs to an empty HNSW graph is faster than adding
10k docs to a large HNSW graph sounds expected to me, but the 120x factor
that you are reporting sounds high. Maybe your dataset is larger than the
size of your page cache, forcing your OS to read vectors from disk directly?

If this doesn't sound right, running your application with a profiler would
help identify your merging bottleneck.

On Fri, Apr 19, 2024 at 4:17 PM Krishnamurthy, Kannan
<kannan.krishnamur...@cengage.com.invalid> wrote:

> Greetings,
>
> We are experiencing slow HNSW creation times during index merge.
> Specifically, we have noticed that the HNSW graph creation becomes
> progressively slow after reaching a certain size.
>
> Our indexing workflow creates around 60 indices, each containing
> approximately 500k vectors. The vector dimensions are 768 floats. We then
> merge all these small indices into a single large index, with a force
> segment size of 1. During the merge step, the HNSW graph creation starts
> off with good performance, taking about 15 seconds to process 10k
> documents. However, once the graph reaches around 7.5m documents, the
> performance starts to degrade significantly. 10k documents now take about
> 30 minutes to process, and the processing time continues to increase as the
> graph becomes larger. We have observed similar performance issues with
> different setting, M=16 with a beam width of 100, and M=32 with a beam
> width of 50.
>
> We are using Lucene version 9.8.0 and Java version `openjdk 17.0.3` Our
> Java heap is set to 30GB, and we do not use any data compression for the
> vectors. Additionally, we have not observed any long or continuous Garbage
> Collection pauses.
>
> Greatly appreciate any pointers or thoughts on how to further debug this
> issue or improve the performance.
>
> Thanks
> Kannan Krishnamurthy.
>
>

-- 
Adrien

Reply via email to