[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-05 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052727#comment-17052727
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 3/6/20, 3:34 AM:
-

Hi, [~jtibshirani], thanks for you excellent work!

??I was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations.??

Yes, the codes could be simple by reusing these formats. But I agree with 
[~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use 
a dedicated format for maintaining reasons. Moreover, If we are going to use a 
dedicated vector format for HNSW, this format should also be applied to IVFFlat 
because IVFFlat and HNSW are used for the same purpose of ANN search. It may be 
strange to users if IVFFlat and HNSW perform completely different.

 

??In particular, it doesn’t require random access for doc values, they are only 
accessed through forward iteration.??

Actually, we need random access to the vector values! For a typical search 
engine, we are going to retrieving the best matched documents after obtaining 
the TopK docIDs. Retrieving vectors via these docIDs requires random access to 
the vector values.


was (Author: irvingzhang):
Hi, [~jtibshirani], thanks for you excellent work!

??I was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations.??

Yes, the codes could be simple by reusing these formats. But I agree with 
[~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use 
a dedicated format for maintaining reasons. Moreover, If we are going to use a 
dedicated vector format for HNSW, this could also applied to IVFFlat because 
IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange 
to users if IVFFlat and HNSW perform completely different.

 

??In particular, it doesn’t require random access for doc values, they are only 
accessed through forward iteration.??

Actually, we need random access to the vector values! For a typical search 
engine, we are going to retrieving the best matched documents after obtaining 
the TopK docIDs. Retrieving vectors via these docIDs requires random access to 
the vector values.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search pro

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-03 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:39 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is a major 
concern, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-03 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello @Xin-Chun Zhang, thanks for the updates and for trying out the idea. I 
was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for 

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-03 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard f

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-24 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037727#comment-17037727
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 2/24/20 10:01 AM:
--

Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms."??

In the previous implementation 
([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]),
 the cluster information is divided into two parts – meta (.ifi) and data(.ifd) 
as shown in the following figure, where each cluster with a postings list is 
stored in the data file (.ifd) and not kept on-heap. A major concern of this 
implementation is its reading performance of cluster data since reading is a 
very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index 
time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the 
clustering cost. In addition, more benchmarks will be added soon.

 
h2. *UPDATE – Feb. 24, 2020*

I have  add a new implementation for IVF index, which has been marked as ***V2 
under the package org.apache.lucene.codecs.lucene90. In current implementation, 
the IVF index has been divided into two files with suffixes .ifi and .ifd, 
respectively. The .ifd file will be read if cluster information is needed. The 
experiments are conducted on dataset sift1M (Test codes: 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/KnnIvfPerformTester.java]),
 detailed results are as follows,
 # add document -- 3921 ms;
 # commit -- 3912286 ms (mainly spent on k-means training, 10 iterations, 4000 
centroids, totally 512,000 vectors used for training);
 # R@100 recall time and recall ratio are listed in the following table

 
||nprobe||avg. search time (ms)||recall ratio (%)||
|8|28.0755|44.154|
|16|27.1745|57.9945|
|32|32.986|71.7003|
|64|40.4082|83.50471|
|128|50.9569|92.07929|
|256|73.923|97.150894|

 Compare with on-heap implementation of IVF index, the query time increases 
significantly (22%~71%). Actually, IVF index is comprised of unique docIDs, and 
will not take up too much memory. *There is a small argument about whether to 
keep the cluster information on-heap or not. Hope to hear more suggestions.*

 

 


was (Author: irvingzhang):
Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms."??

In the previous implementation 
([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]),
 the cluster information is divided into two parts – meta (.ifi) and data(.ifd) 
as shown in the following figure, where each cluster with a postings list is 
stored in the data file (.ifd) and not kept on-heap. A major concern of this 
implementation is its reading performance of cluster data since reading is a 
very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index 
time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the 
clustering cost. In addition, more benchmarks will be added soon.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the p

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/15/20 1:44 AM:
---

Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior when indexing large 
segments. Because of this concern, it could be nice to include benchmarks for 
index time (in addition to QPS). A couple more thoughts on this point:
 * FAISS helps address the concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. As an example, I 
experimented with FAISS's implementation of IVFFlat, and found that I could run 
very few k-means iterations, but still achieve similar performance. Here are 
some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) 
centroids. The cell values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.9

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:04 PM:


Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 

> Introduce IVFFlat to Lucene for ANN similarity search
> --

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:03 PM:


Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require more distance 
computations than HNSW. A nice property of the approach is that it's based on a 
classic algorithm, k-means -- it is easy to understand, and has few tuning 
parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset.
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
{{random centroids  0.578      0.68       0.902       0.961}}
{{k-means, 1 iter   0.729      0.821      0.961       0.987}}
{{k-means, 2 iters  0.775      0.854      0.968       0.989}}
{{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 

> Introduce IVFFlat to Lucene for ANN similarity search
> ---

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-12 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 2/12/20 9:33 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is sufficient large (e.g. > 200,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is sufficient large (e.g. > 10,000,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be r

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-11 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 2/11/20 8:21 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is sufficient large (e.g. > 10,000,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is sufficient large (e.g. > 5,000,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-11 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 2/11/20 8:02 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is sufficient large (e.g. > 5,000,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 200,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly 

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-21 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 1:40 PM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format (only one meta file with suffix .ifi) of IVFFlat is shown in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 200,000 per 
segment). A subset after shuffling is selected for training, thereby saving 
time and memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 200,000 per segment). A subset after shuffling is 
selected for training, thereby saving time and memory. The insertion 
performance of IVFFlat is better due to no extra executions on insertion while 
HNSW need to maintain the graph. However, IVFFlat consumes more time in 
flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-21 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 1:26 PM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 200,000 per segment). A subset after shuffling is 
selected for training, thereby saving time and memory. The insertion 
performance of IVFFlat is better due to no extra executions on insertion while 
HNSW need to maintain the graph. However, IVFFlat consumes more time in 
flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # 

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 7:40 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>97% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # H

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 6:31 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>97% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:44 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Anyone is welcomed to 
participate in further development.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Loca

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:41 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Anyone is welcomed to 
participate in further development.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW. 
And its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW,

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:38 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW. 
And its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching. In my implementation, the clustering process was optimized when the 
number of vectors is large (e.g. > 40,000 per segment). A subset after 
shuffling is selected for training rather than the whole set of vectors, 
decreasing time and memory. The insertion performance of IVFFlat is better due 
to no extra executions on insertion while HNSW needs to maintain the graph. 
However, IVFFlat consumes more time in flushing because of the k-means 
clustering. The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test cases show that the query performance of 
IVFFlat is slightly better than HNSW. My personal branch 
[#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. 
Test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must be some bugs that need to 
be fixed and and I would like to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, th

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/20/20 2:34 PM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching. In my implementation, the clustering process was optimized when the 
number of vectors is large (e.g. > 40,000 per segment). A subset after 
shuffling is selected for training rather than the whole set of vectors, 
decreasing time and memory. The insertion performance of IVFFlat is better due 
to no extra executions on insertion while HNSW needs to maintain the graph. 
However, IVFFlat consumes more time in flushing because of the k-means 
clustering. The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test cases show that the query performance of 
IVFFlat is slightly better than HNSW. My personal branch 
[#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. 
Test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must be some bugs that need to 
be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.  My [personal 
branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is 
here. The clustering process was optimized when the number of vectors is large 
(e.g. > 40,000 per segment). The query performance of IVFFlat seem slightly 
better than HNSW. The insert performance of IVFFlat is also better than HNSW 
due to it has no extra executions while HNSW need to maintain the graph. 
However, IVFFlat consumes more time in flushing due to the k-means clustering. 
The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. The performance could be further 
optimized. Now it has some codes that are similar to HNSW, which could be 
refactored. Moreover, there must some bugs need to be fixed and and would like 
to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made grea