[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 7:40 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>97% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # H

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 6:53 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (avg time < 10ms and recall > 96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. 


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (avg time < 10ms and recall > 96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navi

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 6:31 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (avg time < 10ms and recall>97% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 6:30 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (avg time < 10ms and recall > 96% over a set of 5 random 
vectors with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 3:45 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The format 
of IVFFlat index can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will e

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:44 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Everyone is welcomed to 
participate in this issue.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Anyone is welcomed to 
participate in further development.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Loca

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:41 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments. Anyone is welcomed to 
participate in further development.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW. 
And its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW,

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 3:42 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The format 
of IVFFlat index can be seen in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases 
show that the query performance of IVFFlat is better than HNSW, and its recall 
is pretty high (recall>97% over a set of 5 random vectors with 100 
dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal 
[branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is 
available in github. The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
and its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
>

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/21/20 3:38 AM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching.

My personal dev branch is available in github 
[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]. The index 
format of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my 
implementation, the clustering process was optimized when the number of vectors 
is very large (e.g. > 40,000 per segment). A subset after shuffling is selected 
for training, thereby saving time and memory. The insertion performance of 
IVFFlat is better due to no extra executions on insertion while HNSW need to 
maintain the graph. However, IVFFlat consumes more time in flushing because of 
the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW. 
And its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be 
fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching. In my implementation, the clustering process was optimized when the 
number of vectors is large (e.g. > 40,000 per segment). A subset after 
shuffling is selected for training rather than the whole set of vectors, 
decreasing time and memory. The insertion performance of IVFFlat is better due 
to no extra executions on insertion while HNSW needs to maintain the graph. 
However, IVFFlat consumes more time in flushing because of the k-means 
clustering. The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test cases show that the query performance of 
IVFFlat is slightly better than HNSW. My personal branch 
[#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. 
Test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must be some bugs that need to 
be fixed and and I would like to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, th

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 3:33 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory usage. And it 
supports GPU parallel computing, making it faster and more accurate than HNSW. 

My personal 
[branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is 
available in github. The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
and its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory. And it supports 
GPU parallel computing, making it faster and more accurate than HNSW. 

The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
and its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First a

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 1/21/20 3:30 AM:
-

I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory. And it supports 
GPU parallel computing, making it faster and more accurate than HNSW. 

The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, 
and its recall is pretty high (recall>97% over a set of 5 random vectors 
with 100 dimensions). My test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory. And it supports 
GPU parallel computing, making it faster and more accurate than HNSW. 

The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering. My test cases 
show that the query performance of IVFFlat is slightly better than HNSW. My 
test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable

[GitHub] [lucene-solr] andyvuong opened a new pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-01-20 Thread GitBox
andyvuong opened a new pull request #1188: SOLR-14044: Support collection and 
shard deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188
 
 
   This PR addS support for shard and collection deletion in shared storage 
(SOLR-14044) and also includes a major refactor of the existing 
BlobDeleteManager and deletion code.
   
   The BlobDeleteManager uses refactors the existing async processing machinery 
and BlobDeleteManager manages two deletion pools now - the existing one for 
handling normal indexing flow deletion (as we push) and a pool used 
specifically by the Overseer for handling collection and shard deletion. 
   
   Currently working on a another test that I should add soon but the rest is 
review-able. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud

2020-01-20 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley reassigned SOLR-5146:
--

Assignee: David Smiley  (was: Shalin Shekhar Mangar)

> Figure out what it would take for lazily-loaded cores to play nice with 
> SolrCloud
> -
>
> Key: SOLR-5146
> URL: https://issues.apache.org/jira/browse/SOLR-5146
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Affects Versions: 4.5, 6.0
>Reporter: Erick Erickson
>Assignee: David Smiley
>Priority: Major
>
> The whole lazy-load core thing was implemented with non-SolrCloud use-cases 
> in mind. There are several user-list threads that ask about using lazy cores 
> with SolrCloud, especially in multi-tenant use-cases.
> This is a marker JIRA to investigate what it would take to make lazy-load 
> cores play nice with SolrCloud. It's especially interesting how this all 
> works with shards, replicas, leader election, recovery, etc.
> NOTE: This is pretty much totally unexplored territory. It may be that a few 
> trivial modifications are all that's needed. OTOH, It may be that we'd have 
> to rip apart SolrCloud to handle this case. Until someone dives into the 
> code, we don't know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
ErickErickson commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks 
to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#issuecomment-576393583
 
 
   Thanks for the tip. Maybe after banging my head against this wall for a 
couple of hours I finally understand what the doLast{} bit is about. I’d broken 
out some tasks in the gradle file to do things like clean regenerated files 
first but _not_ wrapped them in doLast{} blocks, so they were being executed at 
config time, and removing things that subsequently didn’t get rebuilt because 
there was no execution phase on that task.
   
   Straightened out now.
   
   You know, if I actually read the damn output when I added printlns into the 
tasks under the "> Configure project :lucene:core” section and figured it out 
faster. Siii..
   
   Erick
   
   
   > On Jan 20, 2020, at 1:33 PM, Dawid Weiss  wrote:
   > 
   > Looking at the patch this task should not be triggered when you run 
./gradlew assemble... You don't need to complicate it with onlyIf or any such 
directives - a task is only triggered if it's attached to something executed in 
the task graph.
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
dweiss commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks to 
Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#issuecomment-576388831
 
 
   Looking at the patch this task should not be triggered when you run 
./gradlew assemble... You don't need to complicate it with onlyIf or any such 
directives - a task is only triggered if it's attached to something executed in 
the task graph. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
ErickErickson commented on issue #1186: LUCENE-9134: Port ant-regenerate tasks 
to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#issuecomment-576384783
 
 
   Dawid:
   
   Thanks, I’ll implement your comments.
   
   Meanwhile I’m trying to get the regenerate task to _not_ kick in when 
executing things like “gw assemble” by using constructs like:
   
   regenerate.onlyIf {
 gradle.startParameter.taskRequests.args.findAll { String s -> s == 
"regenerate" }.size() > 0
   }
   
   is there a better way? So far this works on an individual task in a gradle 
file. I’d like a way for it to apply to _all_ tasks in a particular gradle 
file, including dependency resolution…. Digging.
   
   As for the other files that changed, almost all of the differences are 
results of things I did intentionally. 
   
   - Some tasks themselves use a replaceregexp with quoted lines that I decided 
to “fix” indentation on. Unnecessary and distracting I agree.
   
   - The SuppressWarnings were moved around so that once this runs, people 
don’t have to go back in and hand-edit the generated files to suppress them.
   
   - This little jewel: “implemetation’s” is because javacc apparently 
generates the misspelling it was hand-corrected.
   
   - As for changes like: 
   
   private java.util.List jj_expentries = new java.util.ArrayList<>(); 
   changed to:
   private java.util.List jj_expentries = new 
java.util.ArrayList();
   
   removing the redundancies were hand-edits, see: LUCENE-5512. and I don’t see 
the harm in leaving them in. Hmmm. I’ll try doing a replaceregexp on them to 
avoid compiler warnings and thrashing on it.
   
   - I can’t explain at all why invoking javacc in this environment decided to 
move some methods around in StandardSyntaxParser.java I certainly changed the 
jj file in order to put the SuppressWarnings in, so it makes sense that the 
java file was regenerated. But the methods are identical, just in different 
places so I’m not worrying about it.
   
   - As for the cast changes, all I can say is that I get no warnings. Hmmm, 
let me add -Xlint:cast to the parameters and see if they come back and fix as 
part of the task if so.
   
   Thanks!
   
   
   > On Jan 20, 2020, at 10:12 AM, Dawid Weiss  wrote:
   > 
   > @dweiss commented on this pull request.
   > 
   > I only had time for a quick scan through the patch. I think it needs some 
love in how we use javacc (the current way is not right). I also didn't quite 
understand from the diff file which changes are part of the patch and which are 
caused by regenerated files... and why those regenerated files are not 
identical to what they were before (does ant regenerate also leave them in a 
changed state?)
   > 
   > In gradle/defaults-java.gradle:
   > 
   > > @@ -6,6 +6,8 @@ allprojects {
   >  targetCompatibility = "11"
   >  
   >  compileJava.options.encoding = "UTF-8"
   > +compileJava.options.compilerArgs << '-Xlint:unchecked'
   > 
   > I added linting options in a separate commit.
   > 
   > In lucene/queryparser/build.gradle:
   > 
   > > +  File inputFile
   > +
   > +  @OutputDirectory
   > +  File target
   > +
   > +  String lineSeparator = System.lineSeparator()
   > +  @TaskAction
   > +  void javacc() {
   > +
   > +
   > +String javaCCClasspath = 
project.project(":lucene:queryparser").configurations.javaCCDeps.asPath
   > +String javaCCHome = javaCCClasspath.substring(0, 
javaCCClasspath.lastIndexOf("/"))
   > +
   > +// This bit seems really awkward, but I didn't find a good way to 
either convince the ant task to accept a different
   > +// name than javacc.jar...
   > +// nocommit So I'm taking the javacc-5.0.jar file that's downloaded 
to Gradle's cache and renaming it.
   > 
   > You shouldn't be doing anything with those files. I think it'd be better 
to use javacc directly rather than through ant - then you could just invoke the 
javacc with the classpath you declared in the dependency, that's it.
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson commented on a change in pull request #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
ErickErickson commented on a change in pull request #1186: LUCENE-9134: Port 
ant-regenerate tasks to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#discussion_r368662934
 
 

 ##
 File path: gradle/defaults-java.gradle
 ##
 @@ -6,6 +6,8 @@ allprojects {
 targetCompatibility = "11"
 
 compileJava.options.encoding = "UTF-8"
+compileJava.options.compilerArgs << '-Xlint:unchecked'
 
 Review comment:
   Yep, saw that and I'll update this to master and resolve.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson commented on a change in pull request #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
ErickErickson commented on a change in pull request #1186: LUCENE-9134: Port 
ant-regenerate tasks to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#discussion_r368662726
 
 

 ##
 File path: lucene/queryparser/build.gradle
 ##
 @@ -7,3 +7,224 @@ dependencies {
 
   testImplementation project(':lucene:test-framework')
 }
+
+configure(":lucene:queryparser") {
+  configurations {
+javaCCDeps
+  }
+
+  dependencies {
+javaCCDeps "net.java.dev.javacc:javacc:5.0"
+  }
+}
+
+String lineSeparator = System.lineSeparator()
+
+task runJavaccQueryParser(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/classic/QueryParser.jj')
+  target file('src/java/org/apache/lucene/queryparser/classic')
+  doLast {
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/classic/QueryParser.java",
+byline: "true",
+match: "public QueryParser\\(CharStream ",
+replace: "protected QueryParser(CharStream ")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/classic/QueryParser.java",
+byline: "true",
+match: "public QueryParser\\(QueryParserTokenManager ",
+replace: "protected QueryParser(QueryParserTokenManager ")
+  }
+}
+
+task runJavaccSurround(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/surround/parser/QueryParser.jj')
+  target file('src/java/org/apache/lucene/queryparser/surround/parser')
+}
+
+task runJavaccFlexible(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/flexible/standard/parser/StandardSyntaxParser.jj')
+  target 
file('src/java/org/apache/lucene/queryparser/flexible/standard/parser')
+  doLast {
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "public class ParseException extends Exception",
+replace: "public class ParseException extends QueryNodeParseException",
+flags: "g",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "package 
org.apache.lucene.queryparser.flexible.standard.parser;",
+replace: "package 
org.apache.lucene.queryparser.flexible.standard.parser;${lineSeparator}${lineSeparator}"
 +
+"import 
org.apache.lucene.queryparser.flexible.messages.Message;${lineSeparator}" +
+"import 
org.apache.lucene.queryparser.flexible.messages.MessageImpl;${lineSeparator}" +
+"import 
org.apache.lucene.queryparser.flexible.core.*;${lineSeparator}" +
+"import org.apache.lucene.queryparser.flexible.core.messages.*;",
+flags: "g",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(Token 
currentTokenVal.*\$(\\s\\s[^}].*\\n)*  \\}",
+replace: "  public ParseException(Token 
currentTokenVal,${lineSeparator}" +
+"int[][] expectedTokenSequencesVal, String[] tokenImageVal) 
{${lineSeparator}" +
+"  super(new MessageImpl(QueryParserMessages.INVALID_SYNTAX, 
initialise(${lineSeparator}" +
+"  currentTokenVal, expectedTokenSequencesVal, 
tokenImageVal)));${lineSeparator}" +
+"  this.currentToken = currentTokenVal;${lineSeparator}" +
+"  this.expectedTokenSequences = 
expectedTokenSequencesVal;${lineSeparator}" +
+"  this.tokenImage = tokenImageVal;${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+
ant.replaceregexp(file:"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(String message.*\$(\\s\\s[^}].*\\n)*  
\\}",
+replace: "  public ParseException(Message message) {${lineSeparator}" +
+"super(message);${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+
ant.replaceregexp(file:"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(\\).*\$(\\s\\s[^}].*\\n)*  \\}",
+replace: "  public ParseException() {${lineSeparator}" +
+"super(new MessageImpl(QueryParserMessages.INVALID_SYNTAX, 
\"Error\"));${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public String getMessage\\(\\).*\$(\\s\\s\\s\\s[^}].*\n)*
\\}",
+replace: "  private static String initialise(Token currentToken, int[][] 
expectedTokenSequences, String[] tokenImage) {${lineSeparator}" +
+"String eol = System.getProperty("lineSeparator"

[jira] [Commented] (SOLR-14192) Race condition between SchemaManager and ZkIndexSchemaReader

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019638#comment-17019638
 ] 

ASF subversion and git services commented on SOLR-14192:


Commit 4c72b3d9704d6dc2fcb7a85b628d7658477a651f in lucene-solr's branch 
refs/heads/branch_8x from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c72b3d ]

SOLR-14192: Race condition between SchemaManager and ZkIndexSchemaReader.


> Race condition between SchemaManager and ZkIndexSchemaReader
> 
>
> Key: SOLR-14192
> URL: https://issues.apache.org/jira/browse/SOLR-14192
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: 8.5
>
> Attachments: SOLR-14192.patch
>
>
> Spin-off from SOLR-14128 and SOLR-13368.
> In SolrCloud when a SolrCore is created and it uses managed schema then its 
> {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial 
> {{schema.xml}} to {{managed-schema}}. This includes removing the original 
> {{schema.xml}} file.
> SOLR-13368 added some locking to make sure the changed resource name (i.e. 
> {{managed-schema}}) becomes visible only when this process is complete, and 
> that in-flight requests to /admin/schema block until this process is 
> complete, to avoid returning inconsistent data. This locking mechanism uses 
> simple Object monitors.
> However, if there's more than 1 node in the cluster the subsequent request to 
> retrieve schema may execute on a core that still hasn't reloaded its schema 
> ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to 
> trigger), and the resource name in that stale schema still points to 
> {{schema.xml}}, which by this time no longer exists because it was removed by 
> {{ManagedIndexSchemaFactory}} in the first core.
> As I see it there are two bugs here:
>  # there's no distributed locking when this upgrade is performed, so it's 
> natural that there are multiple cores racing against each other to perform 
> this upgrade.
>  # the upgrade process removes {{schema.xml}} too early - it triggers all 
> other cores by creating the {{managed-schema}} file, and then other cores 
> reload from the new managed schema - but it should wait until this reload is 
> complete on all cores because only then it's safe to delete the non-managed 
> resource as it's no longer in use by any core.
> Issue 1. can be solved by adding an ephemeral znode lock so that only one 
> core can perform the upgrade. Issue 2. can be solved by using 
> {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and 
> deleting {{schema.xml}} only after it's done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-01-20 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019625#comment-17019625
 ] 

David Smiley commented on SOLR-14040:
-

This optimization was huge to Salesforce's situation involving ~2000 live 
SolrCores and large schemas referring to heavy objects.  The benefit is both in 
memory and core startup time -- the latter helps the utility of transient cores 
being viable.  We're ensuring this optimization works in SolrCloud as well as 
it already does in standalone Solr.  There was definitely an analysis proving 
this out.  We ought to re-do one in SolrCloud mode as I fear there are 
new/additional speed bumps slowing core loading.

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14192) Race condition between SchemaManager and ZkIndexSchemaReader

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019576#comment-17019576
 ] 

ASF subversion and git services commented on SOLR-14192:


Commit 6244b7150e6dd544879e142e59c8f98a2694d837 in lucene-solr's branch 
refs/heads/master from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6244b71 ]

SOLR-14192: Race condition between SchemaManager and ZkIndexSchemaReader.


> Race condition between SchemaManager and ZkIndexSchemaReader
> 
>
> Key: SOLR-14192
> URL: https://issues.apache.org/jira/browse/SOLR-14192
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: 8.5
>
> Attachments: SOLR-14192.patch
>
>
> Spin-off from SOLR-14128 and SOLR-13368.
> In SolrCloud when a SolrCore is created and it uses managed schema then its 
> {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial 
> {{schema.xml}} to {{managed-schema}}. This includes removing the original 
> {{schema.xml}} file.
> SOLR-13368 added some locking to make sure the changed resource name (i.e. 
> {{managed-schema}}) becomes visible only when this process is complete, and 
> that in-flight requests to /admin/schema block until this process is 
> complete, to avoid returning inconsistent data. This locking mechanism uses 
> simple Object monitors.
> However, if there's more than 1 node in the cluster the subsequent request to 
> retrieve schema may execute on a core that still hasn't reloaded its schema 
> ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to 
> trigger), and the resource name in that stale schema still points to 
> {{schema.xml}}, which by this time no longer exists because it was removed by 
> {{ManagedIndexSchemaFactory}} in the first core.
> As I see it there are two bugs here:
>  # there's no distributed locking when this upgrade is performed, so it's 
> natural that there are multiple cores racing against each other to perform 
> this upgrade.
>  # the upgrade process removes {{schema.xml}} too early - it triggers all 
> other cores by creating the {{managed-schema}} file, and then other cores 
> reload from the new managed schema - but it should wait until this reload is 
> complete on all cores because only then it's safe to delete the non-managed 
> resource as it's no longer in use by any core.
> Issue 1. can be solved by adding an ephemeral znode lock so that only one 
> core can perform the upgrade. Issue 2. can be solved by using 
> {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and 
> deleting {{schema.xml}} only after it's done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on a change in pull request #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
dweiss commented on a change in pull request #1186: LUCENE-9134: Port 
ant-regenerate tasks to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#discussion_r368595382
 
 

 ##
 File path: lucene/queryparser/build.gradle
 ##
 @@ -7,3 +7,224 @@ dependencies {
 
   testImplementation project(':lucene:test-framework')
 }
+
+configure(":lucene:queryparser") {
+  configurations {
+javaCCDeps
+  }
+
+  dependencies {
+javaCCDeps "net.java.dev.javacc:javacc:5.0"
+  }
+}
+
+String lineSeparator = System.lineSeparator()
+
+task runJavaccQueryParser(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/classic/QueryParser.jj')
+  target file('src/java/org/apache/lucene/queryparser/classic')
+  doLast {
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/classic/QueryParser.java",
+byline: "true",
+match: "public QueryParser\\(CharStream ",
+replace: "protected QueryParser(CharStream ")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/classic/QueryParser.java",
+byline: "true",
+match: "public QueryParser\\(QueryParserTokenManager ",
+replace: "protected QueryParser(QueryParserTokenManager ")
+  }
+}
+
+task runJavaccSurround(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/surround/parser/QueryParser.jj')
+  target file('src/java/org/apache/lucene/queryparser/surround/parser')
+}
+
+task runJavaccFlexible(type: JavaCC) {
+  outputs.upToDateWhen { false } //nocommit
+  inputFile 
file('src/java/org/apache/lucene/queryparser/flexible/standard/parser/StandardSyntaxParser.jj')
+  target 
file('src/java/org/apache/lucene/queryparser/flexible/standard/parser')
+  doLast {
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "public class ParseException extends Exception",
+replace: "public class ParseException extends QueryNodeParseException",
+flags: "g",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "package 
org.apache.lucene.queryparser.flexible.standard.parser;",
+replace: "package 
org.apache.lucene.queryparser.flexible.standard.parser;${lineSeparator}${lineSeparator}"
 +
+"import 
org.apache.lucene.queryparser.flexible.messages.Message;${lineSeparator}" +
+"import 
org.apache.lucene.queryparser.flexible.messages.MessageImpl;${lineSeparator}" +
+"import 
org.apache.lucene.queryparser.flexible.core.*;${lineSeparator}" +
+"import org.apache.lucene.queryparser.flexible.core.messages.*;",
+flags: "g",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(Token 
currentTokenVal.*\$(\\s\\s[^}].*\\n)*  \\}",
+replace: "  public ParseException(Token 
currentTokenVal,${lineSeparator}" +
+"int[][] expectedTokenSequencesVal, String[] tokenImageVal) 
{${lineSeparator}" +
+"  super(new MessageImpl(QueryParserMessages.INVALID_SYNTAX, 
initialise(${lineSeparator}" +
+"  currentTokenVal, expectedTokenSequencesVal, 
tokenImageVal)));${lineSeparator}" +
+"  this.currentToken = currentTokenVal;${lineSeparator}" +
+"  this.expectedTokenSequences = 
expectedTokenSequencesVal;${lineSeparator}" +
+"  this.tokenImage = tokenImageVal;${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+
ant.replaceregexp(file:"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(String message.*\$(\\s\\s[^}].*\\n)*  
\\}",
+replace: "  public ParseException(Message message) {${lineSeparator}" +
+"super(message);${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+
ant.replaceregexp(file:"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public ParseException\\(\\).*\$(\\s\\s[^}].*\\n)*  \\}",
+replace: "  public ParseException() {${lineSeparator}" +
+"super(new MessageImpl(QueryParserMessages.INVALID_SYNTAX, 
\"Error\"));${lineSeparator}" +
+"  }",
+flags: "gm",
+byline: "false")
+ant.replaceregexp(file: 
"src/java/org/apache/lucene/queryparser/flexible/standard/parser/ParseException.java",
+match: "^  public String getMessage\\(\\).*\$(\\s\\s\\s\\s[^}].*\n)*
\\}",
+replace: "  private static String initialise(Token currentToken, int[][] 
expectedTokenSequences, String[] tokenImage) {${lineSeparator}" +
+"String eol = System.getProperty("lineSeparator", 
"\n"

[GitHub] [lucene-solr] dweiss commented on a change in pull request #1186: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-20 Thread GitBox
dweiss commented on a change in pull request #1186: LUCENE-9134: Port 
ant-regenerate tasks to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1186#discussion_r368594079
 
 

 ##
 File path: gradle/defaults-java.gradle
 ##
 @@ -6,6 +6,8 @@ allprojects {
 targetCompatibility = "11"
 
 compileJava.options.encoding = "UTF-8"
+compileJava.options.compilerArgs << '-Xlint:unchecked'
 
 Review comment:
   I added linting options in a separate commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss closed pull request #1185: Jira/lucene 9151

2020-01-20 Thread GitBox
dweiss closed pull request #1185: Jira/lucene 9151
URL: https://github.com/apache/lucene-solr/pull/1185
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on issue #1185: Jira/lucene 9151

2020-01-20 Thread GitBox
dweiss commented on issue #1185: Jira/lucene 9151
URL: https://github.com/apache/lucene-solr/pull/1185#issuecomment-576309634
 
 
   Duplicate of PR 1186.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang edited comment on LUCENE-9136 at 1/20/20 2:34 PM:
-

I worked on this issue for about three to four days. And it now works fine for 
searching. In my implementation, the clustering process was optimized when the 
number of vectors is large (e.g. > 40,000 per segment). A subset after 
shuffling is selected for training rather than the whole set of vectors, 
decreasing time and memory. The insertion performance of IVFFlat is better due 
to no extra executions on insertion while HNSW needs to maintain the graph. 
However, IVFFlat consumes more time in flushing because of the k-means 
clustering. The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test cases show that the query performance of 
IVFFlat is slightly better than HNSW. My personal branch 
[#[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is here. 
Test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must be some bugs that need to 
be fixed and and I would like to hear more comments.


was (Author: irvingzhang):
I worked on this issue for about three to four days. And it now works fine for 
searching.  My [personal 
branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is 
here. The clustering process was optimized when the number of vectors is large 
(e.g. > 40,000 per segment). The query performance of IVFFlat seem slightly 
better than HNSW. The insert performance of IVFFlat is also better than HNSW 
due to it has no extra executions while HNSW need to maintain the graph. 
However, IVFFlat consumes more time in flushing due to the k-means clustering. 
The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. The performance could be further 
optimized. Now it has some codes that are similar to HNSW, which could be 
refactored. Moreover, there must some bugs need to be fixed and and would like 
to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made grea

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019519#comment-17019519
 ] 

Xin-Chun Zhang commented on LUCENE-9004:


I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat 
algorithm to Lucene. IVFFlat is widely used in many fields, from computer 
vision to speech recognition for its smaller index and memory. And it supports 
GPU parallel computing, making it faster and more accurate than HNSW. 

The format of IVFFlat index can be seen in the class 
Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was 
optimized when the number of vectors is very large (e.g. > 40,000 per segment). 
A subset after shuffling is selected for training, thereby saving time and 
memory. The insertion performance of IVFFlat is better due to no extra 
executions on insertion while HNSW need to maintain the graph. However, IVFFlat 
consumes more time in flushing because of the k-means clustering. My test cases 
show that the query performance of IVFFlat is slightly better than HNSW. My 
test class for IVFFlat is under the directory 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. Now it has some codes that are similar to 
HNSW, which could be refactored. Moreover, there must have some bugs that need 
to be fixed and and I would like to hear more comments.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new

[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-20 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019507#comment-17019507
 ] 

Xin-Chun Zhang commented on LUCENE-9136:


I worked on this issue for about three to four days. And it now works fine for 
searching.  My [personal 
branch|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] is 
here. The clustering process was optimized when the number of vectors is large 
(e.g. > 40,000 per segment). The query performance of IVFFlat seem slightly 
better than HNSW. The insert performance of IVFFlat is also better than HNSW 
due to it has no extra executions while HNSW need to maintain the graph. 
However, IVFFlat consumes more time in flushing due to the k-means clustering. 
The designed format of IVFFlat index is presented in the class 
Lucene90IvfFlatIndexFormat. My test class for IVFFlat is under the directory of 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/TestKnnIvfFlat.java].
 Performance comparison between IVFFlat and HNSW is in the class 
TestKnnGraphAndIvfFlat.

The work is still in its early stage. The performance could be further 
optimized. Now it has some codes that are similar to HNSW, which could be 
refactored. Moreover, there must some bugs need to be fixed and and would like 
to hear more comments.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019457#comment-17019457
 ] 

ASF subversion and git services commented on LUCENE-9077:
-

Commit 351b30489c2a47f32500abbb7bdc4ea30e37a247 in lucene-solr's branch 
refs/heads/gradle-master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=351b304 ]

LUCENE-9077: Enable javac linting as in ant. TONS of warnings are currently 
printed.


> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on th

[jira] [Commented] (LUCENE-9145) Address warnings found by static analysis

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019455#comment-17019455
 ] 

ASF subversion and git services commented on LUCENE-9145:
-

Commit 338d386ae08a1edecb89df5497cb46d0abf154ad in lucene-solr's branch 
refs/heads/gradle-master from Mike
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=338d386 ]

LUCENE-9145 First pass addressing static analysis (#1181)

Fixed a bunch of the smaller warnings found by error-prone compiler
plugin, while ignoring a lot of the bigger ones.

> Address warnings found by static analysis
> -
>
> Key: LUCENE-9145
> URL: https://issues.apache.org/jira/browse/LUCENE-9145
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9053) java.lang.AssertionError: inputs are added out of order lastInput=[f0 9d 9c 8b] vs input=[ef ac 81 67 75 72 65]

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019454#comment-17019454
 ] 

ASF subversion and git services commented on LUCENE-9053:
-

Commit 8147e491ce3905bb3543f2c7e34a4ecb60382b49 in lucene-solr's branch 
refs/heads/gradle-master from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8147e49 ]

LUCENE-9053: improve FST's package-info.java comment to clarify required 
(Unicode code point) sort order for FST.Builder


> java.lang.AssertionError: inputs are added out of order lastInput=[f0 9d 9c 
> 8b] vs input=[ef ac 81 67 75 72 65]
> ---
>
> Key: LUCENE-9053
> URL: https://issues.apache.org/jira/browse/LUCENE-9053
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: gitesh
>Priority: Minor
>
> Even if the inputs are sorted in unicode order, I get following exception 
> while creating FST:
>  
> {code:java}
> // Input values (keys). These must be provided to Builder in Unicode sorted 
> order!
> String inputValues[] = {"𝐴", "figure", "flagship"};
> long outputValues[] = {5, 7, 12};
> PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
> Builder builder = new Builder(FST.INPUT_TYPE.BYTE1, outputs);
> BytesRefBuilder scratchBytes = new BytesRefBuilder();
> IntsRefBuilder scratchInts = new IntsRefBuilder();
> for (int i = 0; i < inputValues.length; i++) {
>  scratchBytes.copyChars(inputValues[i]);
>  builder.add(Util.toIntsRef(scratchBytes.get(), scratchInts), 
> outputValues[i]);
> }
> FST fst = builder.finish();
> Long value = Util.get(fst, new BytesRef("figure"));
> System.out.println(value);
> {code}
>  Please note that figure {color:#172b4d}and{color} flagship {color:#172b4d}are 
> using the ligature character{color} fl {color:#172b4d}above. {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019456#comment-17019456
 ] 

ASF subversion and git services commented on LUCENE-9077:
-

Commit 1ad6bc9361bb6737d85615b7450dd4b28345c572 in lucene-solr's branch 
refs/heads/gradle-master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1ad6bc9 ]

LUCENE-9077: Allow locally staged files in git status precommit check.


> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work done by 

[jira] [Commented] (SOLR-10217) Add a query for the background set to the significantTerms streaming expression

2020-01-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019424#comment-17019424
 ] 

Jan Høydahl commented on SOLR-10217:


Coming back to this. The use case we had was news articles. So you'd like to 
compare a foreground set of last week content to some background set of, say 
last month content instead of whole corpus. You may also want to limit the 
background set to articles from one newspaper or a set of papers only.

> Add a query for the background set to the significantTerms streaming 
> expression
> ---
>
> Key: SOLR-10217
> URL: https://issues.apache.org/jira/browse/SOLR-10217
> Project: Solr
>  Issue Type: New Feature
>Reporter: Gethin James
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-10217.patch, SOLR-10217.patch, SOLR-20217.patch
>
>
> Following the work on SOLR-10156 we now have a significantTerms expression.
> Currently, the background set is always the full index.  It would be great if 
> we could use a query to define the background set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-14151) Make schema components load from packages

2020-01-20 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul reassigned SOLR-14151:
-

Assignee: Noble Paul

> Make schema components load from packages
> -
>
> Key: SOLR-14151
> URL: https://issues.apache.org/jira/browse/SOLR-14151
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Example:
> {code:xml}
>  
> 
>   
>generateNumberParts="0" catenateWords="0"
>   catenateNumbers="0" catenateAll="0"/>
>   
>   
> 
>   
> {code}
> * When a package is updated, the entire {{IndexSchema}} object is refreshed, 
> but the SolrCore object is not reloaded
> * Any component can be prefixed with the package name
> * The semantics of loading plugins remain the same as that of the components 
> in {{solrconfig.xml}}
> * Plugins can be registered using schema API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] zsgyulavari commented on issue #1144: SOLR-13756 updated restlet mvn repository url.

2020-01-20 Thread GitBox
zsgyulavari commented on issue #1144: SOLR-13756 updated restlet mvn repository 
url.
URL: https://github.com/apache/lucene-solr/pull/1144#issuecomment-576204938
 
 
   @uschindler, @joel-bernstein I've checked what I can do. The decision is 
that we won't be maintaining a separate public mirror for restlets. Currently 
the cloudera repo resolves last, so it might be a good backup, but if there is 
a trust issue we can/should remove it altogether since the old one is not used 
since 6.6.
   
   On another note I should make it work with gradle as well now I guess. Can 
you please decide on whether I should remove or keep the cloudera repo 
meanwhile?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4702) Terms dictionary compression

2020-01-20 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019337#comment-17019337
 ] 

Adrien Grand commented on LUCENE-4702:
--

I think it's ready now. I restored some of the performance of PKLookup by 
encoding explicitly some cases that are common for ID fields like all docFreqs 
equal to 1, I now get a slowdown of 6% only. I also increased compression a bit 
in the case that the ID scheme generates monotonically increasing ids like 
Flake IDs or auto-increment IDs by delta-encoding the singleton doc IDs that 
are pulsed into the terms dictionary. It saves some more space on the 
wikibigall index, see the "other bytes" section especially.

{noformat}
  index FST:
72 bytes
  terms:
6647577 terms
39885462 bytes (6.0 bytes/term)
  blocks:
189932 blocks
184655 terms-only blocks
5277 sub-block-only blocks
0 mixed blocks
0 floor blocks
189932 non-floor blocks
0 floor sub-blocks
14057717 term suffix bytes before compression (43.1 suffix-bytes/block)
8180420 compressed term suffix bytes (0.58 compression ratio - compression 
count by algorithm: NO_COMPRESSION: 189932)
6647577 term stats bytes before compression (35.0 stats-bytes/block)
189932 compressed term stats bytes (0.03 compression ratio)
15904094 other bytes (83.7 other-bytes/block)
{noformat}

> Terms dictionary compression
> 
>
> Key: LUCENE-4702
> URL: https://issues.apache.org/jira/browse/LUCENE-4702
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
> TaskQPS baseline  StdDevQPS compressed  StdDev
> Pct diff
>   Fuzzy1  111.50  (2.0%)   78.78  (1.5%)  
> -29.4% ( -32% -  -26%)
>   Fuzzy2   36.99  (2.7%)   28.59  (1.5%)  
> -22.7% ( -26% -  -18%)
>  Respell  122.86  (2.1%)  103.89  (1.7%)  
> -15.4% ( -18% -  -11%)
> Wildcard  100.58  (4.3%)   94.42  (3.2%)   
> -6.1% ( -13% -1%)
>  Prefix3  124.90  (5.7%)  122.67  (4.7%)   
> -1.8% ( -11% -9%)
>OrHighLow  169.87  (6.8%)  167.77  (8.0%)   
> -1.2% ( -15% -   14%)
>  LowTerm 1949.85  (4.5%) 1929.02  (3.4%)   
> -1.1% (  -8% -7%)
>   AndHighLow 2011.95  (3.5%) 1991.85  (3.3%)   
> -1.0% (  -7% -5%)
>   OrHighHigh  155.63  (6.7%)  154.12  (7.9%)   
> -1.0% ( -14% -   14%)
>  AndHighHigh  341.82  (1.2%)  339.49  (1.7%)   
> -0.7% (  -3% -2%)
>OrHighMed  217.55  (6.3%)  216.16  (7.1%)   
> -0.6% ( -13% -   13%)
>   IntNRQ   53.10 (10.9%)   52.90  (8.6%)   
> -0.4% ( -17% -   21%)
>  MedTerm  998.11  (3.8%)  994.82  (5.6%)   
> -0.3% (  -9% -9%)
>  MedSpanNear   60.50  (3.7%)   60.36  (4.8%)   
> -0.2% (  -8% -8%)
> HighSpanNear   19.74  (4.5%)   19.72  (5.1%)   
> -0.1% (  -9% -9%)
>  LowSpanNear  101.93  (3.2%)  101.82  (4.4%)   
> -0.1% (  -7% -7%)
>   AndHighMed  366.18  (1.7%)  366.93  (1.7%)
> 0.2% (  -3% -3%)
> PKLookup  237.28  (4.0%)  237.96  (4.2%)
> 0.3% (  -7% -8%)
>MedPhrase  173.17  (4.7%)  174.69  (4.7%)
> 0.9% (  -8% -   10%)
>  LowSloppyPhrase  180.91  (2.6%)  182.79  (2.7%)
> 1.0% (  -4% -6%)
>LowPhrase  374.64  (5.5%)  379.11  (5.8%)
> 1.2% (  -9% -   13%)
> HighTerm  253.14  (7.9%)  256.97 (11.4%)
> 1.5% ( -16% -   22%)
>   HighPhrase   19.52 (10.6%)   19.83 (11.0%)
> 1.6% ( -18% -   25%)
>  MedSloppyPhrase  141.90  (2.6%)  144.11  (2.5%)
> 1.6% (  -3% -6%)
> HighSloppyPhrase   25.26  (4.8%)   25.97  (5.0%)
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensi

[jira] [Commented] (LUCENE-9077) Gradle build

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019313#comment-17019313
 ] 

ASF subversion and git services commented on LUCENE-9077:
-

Commit 351b30489c2a47f32500abbb7bdc4ea30e37a247 in lucene-solr's branch 
refs/heads/master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=351b304 ]

LUCENE-9077: Enable javac linting as in ant. TONS of warnings are currently 
printed.


> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work 

[GitHub] [lucene-solr] iverase commented on issue #1170: LUCENE-9141: Simplify LatLonShapeXQuery API.

2020-01-20 Thread GitBox
iverase commented on issue #1170: LUCENE-9141: Simplify LatLonShapeXQuery API.
URL: https://github.com/apache/lucene-solr/pull/1170#issuecomment-576177467
 
 
   Update the PR by making LatLonGeometry an abstract class so the method that 
return a Component2D can be made protected.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase opened a new pull request #1187: LUCENE-9152: Improve line intersection detection for polygons

2020-01-20 Thread GitBox
iverase opened a new pull request #1187: LUCENE-9152: Improve line intersection 
detection for polygons
URL: https://github.com/apache/lucene-solr/pull/1187
 
 
   This patch changes the logic so we consider the boundary if no points of the 
triangle are inside the polygon. If all points are inside, then the boundary is 
not consider.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9152) Improve line intersections from polygons

2020-01-20 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-9152:


 Summary: Improve line intersections from polygons
 Key: LUCENE-9152
 URL: https://issues.apache.org/jira/browse/LUCENE-9152
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Ignacio Vera


Currently we always check triangle intersection in polygons without considering 
the boundary. This is not totally right as it might miss an intersection if a 
polygon and a triangle are touching each other.

 

The proposal is the following:

   *  if there is no points of the triangle inside the polygon, then consider 
the boundary

   * If all points are inside the polygon, then do not consider the boundary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-01-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019305#comment-17019305
 ] 

ASF subversion and git services commented on LUCENE-9077:
-

Commit 1ad6bc9361bb6737d85615b7450dd4b28345c572 in lucene-solr's branch 
refs/heads/master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1ad6bc9 ]

LUCENE-9077: Allow locally staged files in git status precommit check.


> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work done by Mark Mi

[jira] [Updated] (LUCENE-9077) Gradle build

2020-01-20 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-9077:

Description: 
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes below on why this respin is needed.

The code lives on *gradle-master* branch. It is kept with sync with *master*. 
Try running the following to see an overview of helper guides concerning 
typical workflow, testing and ant-migration helpers:

gradlew :help

A list of items that needs to be added or requires work. If you'd like to work 
on any of these, please add your name to the list. Once you have a patch/ pull 
request let me (dweiss) know - I'll try to coordinate the merges.
 * (/) Apply forbiddenAPIs
 * (/) Generate hardware-aware gradle defaults for parallelism (count of 
workers and test JVMs).
 * (/) Fail the build if --tests filter is applied and no tests execute during 
the entire build (this allows for an empty set of filtered tests at single 
project level).
 * (/) Port other settings and randomizations from common-build.xml
 * (/) Configure security policy/ sandboxing for tests.
 * (/) test's console output on -Ptests.verbose=true
 * (/) add a :helpDeps explanation to how the dependency system works (palantir 
plugin, lockfile) and how to retrieve structured information about current 
dependencies of a given module (in a tree-like output).
 * (/) jar checksums, jar checksum computation and validation. This should be 
done without intermediate folders (directly on dependency sets).
 * (/) verify min. JVM version and exact gradle version on build startup to 
minimize odd build side-effects
 * (/) Repro-line for failed tests/ runs.
 * (/) add a top-level README note about building with gradle (and the required 
JVM).
 * (/) add an equivalent of 'validate-source-patterns' 
(check-source-patterns.groovy) to precommit.
 * (/) add an equivalent of 'rat-sources' to precommit.
 * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) to 
precommit.
* (/) javadoc compilation

Hard-to-implement stuff already investigated:
 * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
to be any way to do this in a reasonably efficient way. There are onOutput 
listeners but they're slow to operate and solr tests emit *tons* of output so 
it's an overkill.-
 * (!) (LUCENE-9120) *Tests working with security-debug logs or other JVM-early 
log output*. Gradle's test runner works by redirecting Java's stdout/ syserr so 
this just won't work. Perhaps we can spin the ant-based test runner for such 
corner-cases.

Of lesser importance:
 * Add an equivalent of 'documentation-lint" to precommit.
 * (/) Do not require files to be committed before running precommit. (staged 
files are fine).
 * (/) add rendering of javadocs (gradlew javadoc)
 * Attach javadocs to maven publications.
 * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
it'll be difficult to run it sensibly because gradle doesn't offer cwd 
separation for the forked test runners.
 * if you diff solr packaged distribution against ant-created distribution 
there are minor differences in library versions and some JARs are excluded/ 
moved around. I didn't try to force these as everything seems to work (tests, 
etc.) – perhaps these differences should  be fixed in the ant build instead.
 * [EOE] identify and port various "regenerate" tasks from ant builds (javacc, 
precompiled automata, etc.)
 * Fill in POM details in gradle/defaults-maven.gradle so that they reflect the 
previous content better (dependencies aside).
 * Add any IDE integration layers that should be added (I use IntelliJ and it 
imports the project out of the box, without the need for any special tuning).
 * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; currently 
XSLT...)
 * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
from a binary distribution? 

 

*{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
Cao Mạnh Đạt but also applies lessons learned from those two efforts:
 * *Do not try to do too many things at once*. If we deviate too far from 
master, the branch will be hard to merge.
 * *Do everything in baby-steps* and add small, independent build fragments 
replacing the old ant infrastructure.
 * *Try to engage people to run, test and contribute early*. It can't be a 
one-man effort. The more people understand and can contribute to the build, the 
more healthy it will be.

 

  was:
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes below on why this respin is needed.

The code lives on *gradle-master* branch. It is kept with sync with *master*. 
Try running the following to see an overview of helper guides concerning 
typical workflow, testing and ant-migration helper

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-01-20 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019292#comment-17019292
 ] 

Noble Paul commented on SOLR-14040:
---

Do we have any study domne on what is the benefit we get by sharing schema? 
Schema itself is a lightweight object. The only expensive part in loading 
`IndexSchema` is the cost of parsing and loading the xml. if we can use a 
better parser that will cease to be a problem

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase commented on issue #1174: LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core

2020-01-20 Thread GitBox
iverase commented on issue #1174: LUCENE-8621: Refactor LatLonShape, XYShape, 
and all query and utility classes to core
URL: https://github.com/apache/lucene-solr/pull/1174#issuecomment-576152492
 
 
   +1 to backport


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org