[ https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475446#comment-17475446 ]
Michael Sokolov commented on LUCENE-10375: ------------------------------------------ Ooh, exciting. That code was complicated and tricky to get right too. I guess in hindsight it's not too surprising that it added some overhead. I will check out the PR; thanks for this! > Speed up HNSW merge by writing combined vector data > --------------------------------------------------- > > Key: LUCENE-10375 > URL: https://issues.apache.org/jira/browse/LUCENE-10375 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > When merging segments together, the HNSW writer creates a VectorValues > instance that gives a merged view of all the segments' VectorValues. This > merged instance is used when constructing the new HNSW graph. Graph building > needs random access, and the merged VectorValues support this by mapping from > merged ordinals -> segments and segment ordinals. > This mapping seems to add overhead. The nightly indexing benchmarks sometimes > show substantial time in Arrays.binarySearch (used to map an ordinal to a > segment): > https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples. > Instead of using a merged VectorValues to create the graph, maybe we could > first write all the segment vectors to a file, and use that file to build the > graph. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org