[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

ASF subversion and git services (Jira) Sat, 19 Mar 2022 21:03:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509368#comment-17509368
 ]


ASF subversion and git services commented on LUCENE-9905:
---------------------------------------------------------

Commit a4b30b4cf4c0c9b4de0c27893ea2350af498b1c0 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a4b30b4 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> ------------------------------------------
>
>                 Key: LUCENE-9905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9905
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 9.0
>            Reporter: Julie Tibshirani
>            Priority: Blocker
>             Fix For: 9.0
>
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

Reply via email to