[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xin-Chun Zhang updated LUCENE-9136: ----------------------------------- Description: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface [[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as kd-tree; # Hashing methods, such as LSH (local sensitive hashing); # Product quantization algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; IVFFlat and HNSW are the most popular ones among all the algorithms. Recently, implementation of ANN algorithms for Lucene, such as HNSW (Hierarchical Navigable Small World) Approximate nearest vector search , has made great progress. Compared with HNSW, IVFFlat requires less memory and disks. Introduce IVFFlat to Lucene will provide one more optional choice for interesting users. I'm now working on the implementation of IVFFlat. And I will try my best to reuse the excellent work by LUCENE-9004. was: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface [faiss|[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as kd-tree; # Hashing methods, such as LSH (local sensitive hashing); # Product quantization algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; IVFFlat and HNSW are the most popular ones among all the algorithms. Recently, implementation of ANN algorithms for Lucene, such as HNSW (Hierarchical Navigable Small World) Approximate nearest vector search , has made great progress. Compared with HNSW, IVFFlat requires less memory and disks. Introduce IVFFlat to Lucene will provide one more optional choice for interesting users. I'm now working on the implementation of IVFFlat. And I will try my best to reuse the excellent work by LUCENE-9004. > Introduce IVFFlat for ANN similarity search > ------------------------------------------- > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Xin-Chun Zhang > Priority: Major > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as kd-tree; > # Hashing methods, such as LSH (local sensitive hashing); > # Product quantization algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > IVFFlat and HNSW are the most popular ones among all the algorithms. > Recently, implementation of ANN algorithms for Lucene, such as HNSW > (Hierarchical Navigable Small World) Approximate nearest vector search , has > made great progress. Compared with HNSW, IVFFlat requires less memory and > disks. Introduce IVFFlat to Lucene will provide one more optional choice for > interesting users. I'm now working on the implementation of IVFFlat. And I > will try my best to reuse the excellent work by LUCENE-9004. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org