[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

Julie Tibshirani (Jira) Wed, 10 Nov 2021 15:13:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441993#comment-17441993
 ]


Julie Tibshirani commented on LUCENE-9583:
------------------------------------------

Sorry for the long radio silence on this one! I actually tried removing the 
public RandomAccess interface and got stuck. I was able to rework most 
interfaces so that only HNSW logic needed random access, and the public 
VectorValues interface could drop support. The part that presented a problem 
was merging, which uses a merged view of all the segments' VectorValues. We use 
this merged VectorValues to build the merged HNSW graph, so it needs to support 
random access. But these segments could have any type of vectors format, not 
just HNSW, so I couldn't guarantee they supported random access.

[~jpountz] had an idea to first write out the merged VectorValues to a file, 
then build the merged HNSW graph based on the combined VectorValues. This seems 
worth exploring to me. Perhaps it could also save effort while building the 
merged graph, since we wouldn't need to translate from the merged view to the 
individual vector values.

> How should we expose VectorValues.RandomAccess?
> -----------------------------------------------
>
>                 Key: LUCENE-9583
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9583
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: main (10.0)
>            Reporter: Michael Sokolov
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

Reply via email to