[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

Adrien Grand (Jira) Tue, 07 Jun 2022 10:13:11 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551199#comment-17551199
 ]


Adrien Grand commented on LUCENE-10592:
---------------------------------------

+1

In general I have a preference for "pull" APIs like we have for points and doc 
values, it makes it possible to iterate over the data twice without 
materializing a temporary representation of the data for instance. That said, 
it's indeed bad how indexing is super fast today but flushing is dog slow. It 
creates surprising situations where flushes might get stalled because too many 
flushes are still in progress and the overall indexing rate is very irregular. 
So I'd be supportive of moving to a push API that helps us move more of the 
cost of indexing vectors from flushing to indexing.

I guess that one argument against it could be that we're optimizing for one 
particular implementation, and future implementations might better benefit from 
a pull API. I know too little about vector search to have a sense of how likely 
we are to switch to a completely different algorithm in the near future, but in 
my opiion it'd be ok to reconsider the API then since codec APIs are expert.

> Should we build HNSW graph on the fly during indexing
> -----------------------------------------------------
>
>                 Key: LUCENE-10592
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10592
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mayya Sharipova
>            Assignee: Mayya Sharipova
>            Priority: Minor
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

Reply via email to