[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052727#comment-17052727 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 3/6/20, 3:34 AM: - Hi, [~jtibshirani], thanks for you excellent work! ??I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations.?? Yes, the codes could be simple by reusing these formats. But I agree with [~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this format should also be applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different. ??In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.?? Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values. was (Author: irvingzhang): Hi, [~jtibshirani], thanks for you excellent work! ??I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations.?? Yes, the codes could be simple by reusing these formats. But I agree with [~tomoko] that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this could also applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different. ??In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.?? Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search pro
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663 ] Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:39 AM: --- Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the approach: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is a major concern, and it will be important to think about how it can be improved. was (Author: jtibshirani): Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the approach: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is quite high, and it will be important to think about how it can be improved. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663 ] Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM: --- Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the idea: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is quite high, and it will be important to think about how it can be improved. was (Author: jtibshirani): Hello @Xin-Chun Zhang, thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the idea: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is quite high, and it will be important to think about how it can be improved. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663 ] Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM: --- Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the approach: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is quite high, and it will be important to think about how it can be improved. was (Author: jtibshirani): Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. This would have a few advantages: * It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic. * I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap -- we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries. I created this draft PR to sketch out the idea: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is ~1 hour for 1 million points -- as we discussed earlier this is quite high, and it will be important to think about how it can be improved. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard f
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037727#comment-17037727 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 2/24/20 10:01 AM: -- Hi, [~jtibshirani], thanks for your suggestions! ??"I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms."?? In the previous implementation ([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]), the cluster information is divided into two parts – meta (.ifi) and data(.ifd) as shown in the following figure, where each cluster with a postings list is stored in the data file (.ifd) and not kept on-heap. A major concern of this implementation is its reading performance of cluster data since reading is a very frequent behavior on kNN search. I will test and check the performance. !image-2020-02-16-15-05-02-451.png! ??"Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS)..."?? Many thanks! I will check the links you mentioned and consider optimize the clustering cost. In addition, more benchmarks will be added soon. h2. *UPDATE – Feb. 24, 2020* I have add a new implementation for IVF index, which has been marked as ***V2 under the package org.apache.lucene.codecs.lucene90. In current implementation, the IVF index has been divided into two files with suffixes .ifi and .ifd, respectively. The .ifd file will be read if cluster information is needed. The experiments are conducted on dataset sift1M (Test codes: [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/KnnIvfPerformTester.java]), detailed results are as follows, # add document -- 3921 ms; # commit -- 3912286 ms (mainly spent on k-means training, 10 iterations, 4000 centroids, totally 512,000 vectors used for training); # R@100 recall time and recall ratio are listed in the following table ||nprobe||avg. search time (ms)||recall ratio (%)|| |8|28.0755|44.154| |16|27.1745|57.9945| |32|32.986|71.7003| |64|40.4082|83.50471| |128|50.9569|92.07929| |256|73.923|97.150894| Compare with on-heap implementation of IVF index, the query time increases significantly (22%~71%). Actually, IVF index is comprised of unique docIDs, and will not take up too much memory. *There is a small argument about whether to keep the cluster information on-heap or not. Hope to hear more suggestions.* was (Author: irvingzhang): Hi, [~jtibshirani], thanks for your suggestions! ??"I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms."?? In the previous implementation ([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]), the cluster information is divided into two parts – meta (.ifi) and data(.ifd) as shown in the following figure, where each cluster with a postings list is stored in the data file (.ifd) and not kept on-heap. A major concern of this implementation is its reading performance of cluster data since reading is a very frequent behavior on kNN search. I will test and check the performance. !image-2020-02-16-15-05-02-451.png! ??"Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS)..."?? Many thanks! I will check the links you mentioned and consider optimize the clustering cost. In addition, more benchmarks will be added soon. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the p
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/15/20 1:44 AM: --- Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior when indexing large segments. Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS). A couple more thoughts on this point: * FAISS helps address the concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. As an example, I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.9
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:04 PM: Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} > Introduce IVFFlat to Lucene for ANN similarity search > --
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:03 PM: Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means -- it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} > Introduce IVFFlat to Lucene for ANN similarity search > ---
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 2/12/20 9:33 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 10,000,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be r
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 2/11/20 8:21 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 10,000,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 5,000,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 2/11/20 8:02 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 5,000,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 1:40 PM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 1:26 PM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > #
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 7:40 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>96% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # H
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 6:31 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time < 10ms and recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:44 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Anyone is welcomed to participate in further development. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Loca
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:41 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Anyone is welcomed to participate in further development. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW,
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:38 AM: - I worked on this issue for about three to four days. And it now works fine for searching. My personal dev branch is available in github [https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is very large (e.g. > 40,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. My test cases show that the query performance of IVFFlat is better than HNSW. And its recall is pretty high (recall>97% over a set of 5 random vectors with 100 dimensions). My test class for IVFFlat is under the directory [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. In my implementation, the clustering process was optimized when the number of vectors is large (e.g. > 40,000 per segment). A subset after shuffling is selected for training rather than the whole set of vectors, decreasing time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW needs to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. The designed format of IVFFlat index is presented in the class Lucene90IvfFlatIndexFormat. My test cases show that the query performance of IVFFlat is slightly better than HNSW. My personal branch [#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. Test class for IVFFlat is under the directory of [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. Now it has some codes that are similar to HNSW, which could be refactored. Moreover, there must be some bugs that need to be fixed and and I would like to hear more comments. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > Recently, th
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507 ] Xin-Chun Zhang edited comment on LUCENE-9136 at 1/20/20 2:34 PM: - I worked on this issue for about three to four days. And it now works fine for searching. In my implementation, the clustering process was optimized when the number of vectors is large (e.g. > 40,000 per segment). A subset after shuffling is selected for training rather than the whole set of vectors, decreasing time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW needs to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering. The designed format of IVFFlat index is presented in the class Lucene90IvfFlatIndexFormat. My test cases show that the query performance of IVFFlat is slightly better than HNSW. My personal branch [#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. Test class for IVFFlat is under the directory of [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. Now it has some codes that are similar to HNSW, which could be refactored. Moreover, there must be some bugs that need to be fixed and and I would like to hear more comments. was (Author: irvingzhang): I worked on this issue for about three to four days. And it now works fine for searching. My [personal branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. The clustering process was optimized when the number of vectors is large (e.g. > 40,000 per segment). The query performance of IVFFlat seem slightly better than HNSW. The insert performance of IVFFlat is also better than HNSW due to it has no extra executions while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing due to the k-means clustering. The designed format of IVFFlat index is presented in the class Lucene90IvfFlatIndexFormat. My test class for IVFFlat is under the directory of [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java]. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat. The work is still in its early stage. The performance could be further optimized. Now it has some codes that are similar to HNSW, which could be refactored. Moreover, there must some bugs need to be fixed and and would like to hear more comments. > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made grea