[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037727#comment-17037727
 ] 

Xin-Chun Zhang commented on LUCENE-9136:
----------------------------------------

Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms."??

In the previous implementation 
([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]),
 the cluster information is divided into two parts – meta (.ifi) and data(.ifd) 
as shown in the following figure, where each cluster with a postings list is 
stored in the data file (.ifd) and not kept on-heap. A major concern of this 
implementation is its reading performance of cluster data since reading is a 
very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index 
time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the 
clustering cost. In addition, more benchmarks will be added soon.

> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
>                 Key: LUCENE-9136
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9136
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Xin-Chun Zhang
>            Priority: Major
>         Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to