[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057803#comment-17057803
 ] 

Jim Ferenczi commented on LUCENE-9136:
--------------------------------------

> ??I was thinking we could actually reuse the existing `PostingsFormat` and 
>`DocValuesFormat` implementations.??

 That's the one of the main reason why this approach is interesting for Lucene. 
The main operation at query time is a basic inverted lists search so it would 
be a shame to not reuse the existing formats that were designed for this 
purpose. In general I think that this approach (k-means clustering at index 
time) is very compelling since it's a light layer on top of existing 
functionalities. The computational cost is big (running k-means and assigning 
vectors to centroids) but we can ensure that it remains acceptable by capping 
the number of centroids or by using an hybrid approach with a small-world graph 
like Julie suggested. 

Regarding the link with the graph-based approach, I wonder what the new ANN 
Codec should expose. If the goal is to provide approximate nearest neighbors 
capabilities to Lucene I don't think we want to leak any implementation details 
there.

It's difficult to tell now since both effort are in the design phase but I 
think we should aim at something very simple that only exposes an approximate 
nearest neighbor search. Something like:
{code:java}
interface VectorFormat {
  TopDocs ann(int topN, int maxDocsToVisit);
  float[] getVector(int docID);
}{code}
should be enough. Most of the format we have in Lucene have sensible defaults 
or compute parameters based on the shape of the data so I don't think we should 
expose tons of options here. This is another advantage of this approach in my 
opinion since we can compute the number of centroids needed for each segment 
automatically. The research in this area are also moving fast so we need to 
remain open to new approaches without requiring to add a new format all the 
time. 

> Actually, we need random access to the vector values! For a typical search 
>engine, we are going to retrieving the best matched documents after obtaining 
>the TopK docIDs. Retrieving vectors via these docIDs requires random access to 
>the vector values.

You can sort the TopK (which should be small) by docIDs and then perform the 
lookup sequentially ? That's how we retrieve stored fields from top documents 
in the normal search. This is again an advantage against the graph based 
approach because it is compliant with the search model in Lucene that requires 
forward iteration.

To move forward on this issue I'd like to list the things that need 
clarifications in my opinion:
 * Do we need a new VectorFormat that can be shared with the graph-based 
approach ?
 ** This decision and the design of the VectorFormat is important to ensure 
that both efforts can move independently. Currently it it not clear if this 
approach can move forward if the graph-based approach is stalled or needs more 
work. 
I tend to think that having a simple format upfront can drive decisions we make 
on both approaches so we should tackle this first.
 *   What is the acceptable state for this approach to be considered ready to 
merge ?

 ** Lots of optimizations have been mentioned in both issues but I think we 
should drive for simplicity first.
That's the beauty of the k-means approach, it's simple to understand and reason 
about. We should have
a first version that reuses the internal data formats since they fit perfectly. 
I think that's what Julie's pr brings here while leaving the room for further 
improvements like any Lucene features.
 ** We should decorrelate the progress here from the one in the other Lucene 
issue. This is linked to question 1 but I think it's key to move forward.

In general I feel like the branch proposed by [~irvingzhang] and the additional 
changes by [~jtibshirani] are moving toward the right direction. The qps 
improvement over a brute-force approach are already compelling as outlined in 
[https://github.com/apache/lucene-solr/pull/1314] so I don't think it will be 
difficult to have a consensus whether this would be useful to add in Lucene.

> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
>                 Key: LUCENE-9136
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9136
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Xin-Chun Zhang
>            Priority: Major
>         Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to