[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

Xin-Chun Zhang (Jira) Tue, 21 Jan 2020 05:41:36 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ]


Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 1:40 PM:
-----------------------------------------------------------------

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 200,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 50000 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 200,000 per segment). A subset after shuffling is 
selected for training, thereby saving time and memory. The insertion 
performance of IVFFlat is better due to no extra executions on insertion while 
HNSW need to maintain the graph. However, IVFFlat consumes more time in 
flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 50000 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
>                 Key: LUCENE-9136
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9136
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Xin-Chun Zhang
>            Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

Reply via email to