[ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358832#comment-17358832 ]
ASF subversion and git services commented on LUCENE-9905: --------------------------------------------------------- Commit 05ae738fc90bbacae333b4818d7dc2880f27f3fd in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=05ae738 ] LUCENE-9905: Move HNSW build parameters to codec (#166) Previously, the max connections and beam width parameters could be configured as field type attributes. This PR moves them to be parameters on Lucene90HnswVectorFormat, to avoid exposing details of the vector format implementation in the API. > Revise approach to specifying NN algorithm > ------------------------------------------ > > Key: LUCENE-9905 > URL: https://issues.apache.org/jira/browse/LUCENE-9905 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: main (9.0) > Reporter: Julie Tibshirani > Priority: Blocker > Time Spent: 4h 40m > Remaining Estimate: 0h > > In LUCENE-9322 we decided that the new vectors API shouldn’t assume a > particular nearest-neighbor search data structure and algorithm. This > flexibility is important since NN search is a developing area and we'd like > to be able to experiment and evolve the algorithm. Right now we only have one > algorithm (HNSW), but we want to maintain the ability to use another. > Currently the algorithm to use is specified through {{SearchStrategy}}, for > example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation > is expected to handle multiple algorithms. Instead we could have one format > implementation per algorithm. Our current implementation would be > HNSW-specific like {{HnswVectorFormat}}, and to experiment with another > algorithm you could create a new implementation like {{ClusterVectorFormat}}. > This would be better aligned with the codec framework, and help avoid > exposing algorithm details in the API. > A concrete proposal (note many of these names will change when LUCENE-9855 is > addressed): > # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add > HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}. > # Remove references to HNSW in {{SearchStrategy}}, so there is just > {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something > like {{SimilarityFunction}}. > # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and > beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}. > # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or > parameters to be configured per-field \(?\) > One note: the current HNSW-based format includes logic for storing a numeric > vector per document, as well as constructing + storing a HNSW graph. When > adding another implementation, it’d be nice to be able to reuse logic for > reading/ writing numeric vectors. I don’t think we need to design for this > right now, but we can keep it in mind for the future? > This issue is based on a thread [~jpountz] started: > [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org