[ 
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478243#comment-17478243
 ] 

ASF subversion and git services commented on LUCENE-10375:
----------------------------------------------------------

Commit f68cdd4c03e2a47b23091f669763c7a164834f87 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f68cdd4 ]

LUCENE-10375: Write merged vectors to file before building graph (#601)

When merging segments together, the `KnnVectorsWriter` creates a `VectorValues`
instance with a merged view of all the segments' vectors. This merged instance
is used when constructing the new HNSW graph. Graph building needs random
access, and the merged VectorValues support this by mapping from merged
ordinals to segments and segment ordinals. This mapping can add significant
overhead when building the graph.

This change updates the HNSW merging logic to first write the combined segment
vectors to a file, then use that the file to build the graph. This helps speed
up segment merging, and also lets us simplify `VectorValuesMerger`, which
provides the merged view of vector values.

> Speed up HNSW merge by writing combined vector data
> ---------------------------------------------------
>
>                 Key: LUCENE-10375
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10375
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Major
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues 
> instance that gives a merged view of all the segments' VectorValues. This 
> merged instance is used when constructing the new HNSW graph. Graph building 
> needs random access, and the merged VectorValues support this by mapping from 
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
> show substantial time in Arrays.binarySearch (used to map an ordinal to a 
> segment): 
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could 
> first write all the segment vectors to a file, and use that file to build the 
> graph.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to