[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359411#comment-17359411 ]
ASF subversion and git services commented on LUCENE-9905: --------------------------------------------------------- Commit e9339253f5ebcd88282297bdadcbe1705e15f91b in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e933925 ] LUCENE-9905: Make sure to use configured vector format when merging (#176) Before when creating a VectorWriter for merging, we would always load the default implementation. So if the format was configured with parameters, they were ignored. This issue was caught by `TestKnnGraph#testMergeProducesSameGraph`. > Revise approach to specifying NN algorithm > ------------------------------------------ > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: main (9.0) > Reporter: Julie Tibshirani > Priority: Blocker > Time Spent: 5h 40m > Remaining Estimate: 0h > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > particular nearest-neighbor search data structure and algorithm. This > flexibility is important since NN search is a developing area and we'd like > to be able to experiment and evolve the algorithm. Right now we only have one > algorithm (HNSW), but we want to maintain the ability to use another. > Currently the algorithm to use is specified through {{SearchStrategy}}, for > example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation > is expected to handle multiple algorithms. Instead we could have one format > implementation per algorithm. Our current implementation would be > HNSW-specific like {{HnswVectorFormat}}, and to experiment with another > algorithm you could create a new implementation like {{ClusterVectorFormat}}. > This would be better aligned with the codec framework, and help avoid > exposing algorithm details in the API. > A concrete proposal (note many of these names will change when LUCENE-9855 is > addressed): > # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add > HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. > # Remove references to HNSW in {{SearchStrategy}}, so there is just > {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something > like {{SimilarityFunction}}. > # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and > beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. > # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or > parameters to be configured per-field \(?\) > One note: the current HNSW-based format includes logic for storing a numeric > vector per document, as well as constructing + storing a HNSW graph. When > adding another implementation, it’d be nice to be able to reuse logic for > reading/ writing numeric vectors. I don’t think we need to design for this > right now, but we can keep it in mind for the future? > This issue is based on a thread [~jpountz] started: > [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org