Regarding building time, did you configure a SerialMergeScheduler? Otherwise merges run in separate threads, which would explain the speedup as adding vectors to the graph gets more and more expensive as the size of the graph increases.
Le mer. 11 oct. 2023, 05:07, Patrick Zhai <zhai7...@gmail.com> a écrit : > Hi folks, > I was running the HNSW benchmark today and found some weird results. Want > to share it here and see whether people have any ideas. > > The set up is: > the 384 dimension vector that's available in luceneutil, 100k documents. > And lucene main branch. > max_conn=64, fanout=0, beam_width=250 > > I first tried with the default setting where we use a 1994MB writer > buffer, so with 100k documents, there will be no merge happening and I will > have 1 segment at the end. > This gives me 0.755 recall and 101113ms index building time. > > Then I tried with 50MB writer buffer and then forcemerge at the last, and > with 100k documents, I'll get several segments (the final index is around > 300MB so I guess 5 or 6) before merge, and then merge them into 1 at last. > This gives me 0.692 recall but it took only 81562ms (including 34394ms > doing the merge) to index. > I have also tried disabling the initialize from graph feature (such that > when we merge we always rebuild the whole graph), or change the random > seed, but still get the similar result. > > I'm wondering: > 1. Why recall drops that much in the later setup? > 2. Why index time is way better? I think we still need to rebuild the > whole graph, or maybe it's just because we're using more off-heap memory > (and less heap) when merge (do we?)? > > Best > Patrick >