[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509371#comment-17509371 ]
ASF subversion and git services commented on LUCENE-9905: --------------------------------------------------------- Commit f09b08563074379da2b6f24054ffe7243ba365a1 in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f09b085 ] LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat Before the assertion checked if two sets were equal, which resulted in rare failures. Now we use 'contains' from hamcrest matchers. > Revise approach to specifying NN algorithm > ------------------------------------------ > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 9.0 > Reporter: Julie Tibshirani > Priority: Blocker > Fix For: 9.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > particular nearest-neighbor search data structure and algorithm. This > flexibility is important since NN search is a developing area and we'd like > to be able to experiment and evolve the algorithm. Right now we only have one > algorithm (HNSW), but we want to maintain the ability to use another. > Currently the algorithm to use is specified through {{SearchStrategy}}, for > example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation > is expected to handle multiple algorithms. Instead we could have one format > implementation per algorithm. Our current implementation would be > HNSW-specific like {{HnswVectorFormat}}, and to experiment with another > algorithm you could create a new implementation like {{ClusterVectorFormat}}. > This would be better aligned with the codec framework, and help avoid > exposing algorithm details in the API. > A concrete proposal (note many of these names will change when LUCENE-9855 is > addressed): > # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add > HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. > # Remove references to HNSW in {{SearchStrategy}}, so there is just > {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something > like {{SimilarityFunction}}. > # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and > beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. > # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or > parameters to be configured per-field \(?\) > One note: the current HNSW-based format includes logic for storing a numeric > vector per document, as well as constructing + storing a HNSW graph. When > adding another implementation, it’d be nice to be able to reuse logic for > reading/ writing numeric vectors. I don’t think we need to design for this > right now, but we can keep it in mind for the future? > This issue is based on a thread [~jpountz] started: > [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org