[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053623#comment-17053623 ]
Xin-Chun Zhang commented on LUCENE-9136: ---------------------------------------- 1. My personal git branch: [https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat]. 2. The vector format is as follows, !image-2020-03-07-01-25-58-047.png|width=535,height=297! Structure of IVF index meta is as follows, !image-2020-03-07-01-27-12-859.png|width=606,height=276! Structure of IVF data: !image-2020-03-07-01-22-06-132.png|width=529,height=309! 3. Ann-benchmark tool could be found in: [https://github.com/irvingzhang/ann-benchmarks]. Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset): 1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, recall: 76.8%~99.7% !https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583504416262-89784074-c9dc-4489-99a1-5e4b3c76e5fc.png|width=624,height=430! 2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, recall 65.8%~96.3% !https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583510066130-b4fbcb29-8ad7-4ff2-99ce-c52f7c27826e.png|width=679,height=468! 3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, recall 71.1%~99.2% !https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583515010497-20b74f41-72c3-48ce-a929-1cbfbd6a6423.png|width=691,height=476! > Introduce IVFFlat to Lucene for ANN similarity search > ----------------------------------------------------- > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Xin-Chun Zhang > Priority: Major > Attachments: image-2020-03-07-01-22-06-132.png, > image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search problems, IVFFlat is also > very popular with many users and supporters. Compared with HNSW, IVFFlat has > smaller index size but requires k-means clustering, while HNSW is faster in > query (no training required) but requires extra storage for saving graphs > [indexing 1M > vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. > Another advantage is that IVFFlat can be faster and more accurate when > enables GPU parallel computing (current not support in Java). Both algorithms > have their merits and demerits. Since HNSW is now under development, it may > be better to provide both implementations (HNSW && IVFFlat) for potential > users who are faced with very different scenarios and want to more choices. > The latest branch is > [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org