Hi Ben, Thanks! I think that's the issue! I was using some old local checkout. I will try with the latest commit and report back if the results still look weird.
On Wed, Oct 11, 2023, 12:26 Benjamin Trent <ben.w.tr...@gmail.com> wrote: > Heya Patrick, > > What version of Lucene Util are you using? There was a bug where > `forceMerge` was not actually using your configured maxConn & beamWidth. > See: https://github.com/mikemccand/luceneutil/pull/232 > > Do you have that commit and rebuilt the KnnGraphTester? > > On Wed, Oct 11, 2023 at 10:10 AM Patrick Zhai <zhai7...@gmail.com> wrote: > >> Hi Adrien, >> I'm using the default CMS, but I doubt whether the merge will be >> triggered at all in the background. Since no merge policy is changed the >> default TMP will likely only merge the segments after they reach 10 I >> believe? But the index is about 300M and the buffer size is around 50M so I >> don't think we will have enough segments to trigger the merge when I'm >> building the index? >> >> On Wed, Oct 11, 2023, 02:47 Adrien Grand <jpou...@gmail.com> wrote: >> >>> Regarding building time, did you configure a SerialMergeScheduler? >>> Otherwise merges run in separate threads, which would explain the speedup >>> as adding vectors to the graph gets more and more expensive as the size of >>> the graph increases. >>> >>> Le mer. 11 oct. 2023, 05:07, Patrick Zhai <zhai7...@gmail.com> a écrit : >>> >>>> Hi folks, >>>> I was running the HNSW benchmark today and found some weird results. >>>> Want to share it here and see whether people have any ideas. >>>> >>>> The set up is: >>>> the 384 dimension vector that's available in luceneutil, 100k >>>> documents. And lucene main branch. >>>> max_conn=64, fanout=0, beam_width=250 >>>> >>>> I first tried with the default setting where we use a 1994MB writer >>>> buffer, so with 100k documents, there will be no merge happening and I will >>>> have 1 segment at the end. >>>> This gives me 0.755 recall and 101113ms index building time. >>>> >>>> Then I tried with 50MB writer buffer and then forcemerge at the last, >>>> and with 100k documents, I'll get several segments (the final index is >>>> around 300MB so I guess 5 or 6) before merge, and then merge them into 1 at >>>> last. >>>> This gives me 0.692 recall but it took only 81562ms (including 34394ms >>>> doing the merge) to index. >>>> I have also tried disabling the initialize from graph feature (such >>>> that when we merge we always rebuild the whole graph), or change the random >>>> seed, but still get the similar result. >>>> >>>> I'm wondering: >>>> 1. Why recall drops that much in the later setup? >>>> 2. Why index time is way better? I think we still need to rebuild the >>>> whole graph, or maybe it's just because we're using more off-heap memory >>>> (and less heap) when merge (do we?)? >>>> >>>> Best >>>> Patrick >>>> >>>