[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xin-Chun Zhang updated LUCENE-9136: ----------------------------------- Summary: Introduce IVFFlat to Lucene for ANN similarity search (was: Introduce IVFFlat for ANN similarity search) > Introduce IVFFlat to Lucene for ANN similarity search > ----------------------------------------------------- > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Xin-Chun Zhang > Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > IVFFlat and HNSW are the most popular ones among all the algorithms. > Recently, implementation of ANN algorithms for Lucene, such as HNSW > (Hierarchical Navigable Small World, LUCENE-9004), has made great progress. > IVFFlat has smaller index size but requires k-means clustering, while HNSW is > faster in query but require extra storage for graphs [indexing 1M > vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. > Each of them has its merits and demerits. Since HNSW is now under > development, it may be better to provide IVFFlat for an alternative choice. > I will soon commit my personal implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org