date:20200207

[jira] [Resolved] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14248.
--
Resolution: Fixed

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032834#comment-17032834
 ] 

ASF subversion and git services commented on SOLR-14248:


Commit e623eb53207b8dabfe36d6a9679b7590ec4a1d20 in lucene-solr's branch 
refs/heads/branch_8x from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e623eb5 ]

SOLR-14248: Improve ClusterStateMockUtil and make its methods public

(cherry picked from commit f5c132be6d3fc20f689e630517e7c6be2166f17e)


> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032830#comment-17032830
 ] 

ASF subversion and git services commented on SOLR-14248:


Commit f5c132be6d3fc20f689e630517e7c6be2166f17e in lucene-solr's branch 
refs/heads/master from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f5c132b ]

SOLR-14248: Improve ClusterStateMockUtil and make its methods public


> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032824#comment-17032824
 ] 

Shalin Shekhar Mangar commented on SOLR-14248:
--

The latest patch adds support for replica types and resolves a conflict 
introduced by SOLR-14245. It also adds a test for this class. This is ready to 
go.

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14248:
-
Attachment: SOLR-14248.patch

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032804#comment-17032804
 ] 

Robert Muir commented on LUCENE-9201:
-

{quote}
Package summary: "ant documentation" uses "package.html" as package summary 
description, but "gradlew javadoc" ignores "package.html" (so some packages 
lacks summary description in "package-summary.html" when building javadocs by 
Gradle). We might be able to make Gradle Javadoc task to properly handle 
"package.html" files with some options. Or, should we replace all 
"package.html" with "package-info.java" at this time?
{quote}

I found the answer to this. Gradle is fundamentally broken here, its not 
possible to fix it.

When ant runs javadocs, we supply just a source directory (-sourcepath) and a 
list of packages:

{noformat}
javadoc -sourcepath /home/rmuir/workspace/lucene-solr/lucene/core/src/java 
org.apache.lucene org.apache.lucene.analysis 
org.apache.lucene.analysis.standard ...
{noformat}

When gradle runs javadocs, it does not do this, it passes each .java file 
individually:

{noformat}
javadoc 
'/home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/search/SearcherFactory.java'
'/home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/search/QueryCache.java'
 ...
{noformat}

it seems the whole design is to make it work with their SourceTask/FileTree 
crap. And you can't pass individual html files to the javadoc tool to 
workaround it. It takes only source files or package names.

I can't see any way to pass their task package list the way we do with ant: it 
*REALLY* wants to be based on the FileTree. Maybe we should call the ant task 
from gradle? They really messed this up.

The other thing that seems really broken is the missing linkoffline. There are 
links between the modules (e.g. lucene-analyzers and lucene-core) and the 
linkoffline makes that work. But it seems the gradle build is structured to 
make per-module output dirs which won't work here.



> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9149) Increase data dimension limit in BKD

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032748#comment-17032748
 ] 

ASF subversion and git services commented on LUCENE-9149:
-

Commit 0bd2496205a1319c34df2b8a236fb87f329bb3f4 in lucene-solr's branch 
refs/heads/branch_8x from Nicholas Knize
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bd2496 ]

LUCENE-9149: Increase data dimension limit in BKD


> Increase data dimension limit in BKD
> 
>
> Key: LUCENE-9149
> URL: https://issues.apache.org/jira/browse/LUCENE-9149
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nick Knize
>Priority: Major
> Attachments: LUCENE-9149.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LUCENE-8496 added selective indexing; the ability to designate the first K <= 
> N dimensions for driving the construction of the BKD internal nodes. Follow 
> on work stored the "data dimensions" for only the leaf nodes and only the 
> "index dimensions" are stored for the internal nodes. While 
> {{maxPointsInLeafNode}} is still important for managing the BKD heap memory 
> footprint (thus we don't want this to get too large), I'd like to propose 
> increasing the {{MAX_DIMENSIONS}} limit (to something not too crazy like 16; 
> effectively doubling the index dimension limit) while maintaining the 
> {{MAX_INDEX_DIMENSIONS}} at 8.
> Doing this will enable us to encode higher dimension data within a lower 
> dimension index (e.g., 3D tessellated triangles as a 10 dimension point using 
> only the first 6 dimensions for index construction)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14250) Solr tries to read request body after error response is sent

2020-02-07 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SOLR-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-14250:
---
Description: 
If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} 
the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 
continue}} response, then the client will send the body which will be processed 
and then a final response is sent by Solr.

However, if such a request leads to an error (e.g. 404 or 401), then Solr will 
skip the 100 response and instead send the error response directly. The very 
last ation of {{SolrDispatchFilter#doFilter}} is to call 
{{consumeInputFully()}}. However, this should not be done in case an error 
response has already been sent, else you'll provoke an exception in Jetty's 
HTTP lib:
{noformat}
2020-02-07 23:13:26.459 INFO  (qtp403547747-24) [   ] 
o.a.s.s.SolrDispatchFilter Could not consume full client request => 
java.io.IOException: Committed before 100 Continues
at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)
java.io.IOException: Committed before 100 Continues
at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)
 ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610]
at org.eclipse.jetty.server.Request.getInputStream(Request.java:872) 
~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610]
at 
javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185)
 ~[javax.servlet-api-3.1.0.jar:3.1.0]
at 
org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]
at 
org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]
{noformat}
 

  was:
If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} 
the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 
continue}} response, then the client will send the body which will be processed 
and then a final response is sent by Solr.

However, if such a request leads to an error (e.g. 404 or 401), then Solr will 
skip the 100 response and instead send the error response directly. The very 
last ation of {{SolrDispatchFilter#doFilter}} is to call 
{{consumeInputFully()}}. However, this should not be done in case an error 
response has already been sent, else you'll provoke an exception in Jetty's 
HTTP lib:
{noformat}
solr1_1      | 2020-02-07 23:13:26.459 INFO  (qtp403547747-24) [   ] 
o.a.s.s.SolrDispatchFilter Could not consume full client request => 
java.io.IOException: Committed before 100 Continuessolr1_1      | 2020-02-07 
23:13:26.459 INFO  (qtp403547747-24) [   ] o.a.s.s.SolrDispatchFilter Could not 
consume full client request => java.io.IOException: Committed before 100 
Continuessolr1_1      |  at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)solr1_1
      | java.io.IOException: Committed before 100 Continuessolr1_1      |  at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)
 ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1      |  at 
org.eclipse.jetty.server.Request.getInputStream(Request.java:872) 
~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1      |  at 
javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185)
 ~[javax.servlet-api-3.1.0.jar:3.1.0]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28] {noformat}
 


> Solr tries to read request body after error response is sent
> 
>
> Key: SOLR-14250
> URL: https://issues.apache.org/jira/browse/SOLR-14250
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Jan Høydahl
>

[jira] [Created] (SOLR-14250) Solr tries to read request body after error response is sent

2020-02-07 Thread Jira

Jan Høydahl created SOLR-14250:
--

 Summary: Solr tries to read request body after error response is 
sent
 Key: SOLR-14250
 URL: https://issues.apache.org/jira/browse/SOLR-14250
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Jan Høydahl


If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} 
the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 
continue}} response, then the client will send the body which will be processed 
and then a final response is sent by Solr.

However, if such a request leads to an error (e.g. 404 or 401), then Solr will 
skip the 100 response and instead send the error response directly. The very 
last ation of {{SolrDispatchFilter#doFilter}} is to call 
{{consumeInputFully()}}. However, this should not be done in case an error 
response has already been sent, else you'll provoke an exception in Jetty's 
HTTP lib:
{noformat}
solr1_1      | 2020-02-07 23:13:26.459 INFO  (qtp403547747-24) [   ] 
o.a.s.s.SolrDispatchFilter Could not consume full client request => 
java.io.IOException: Committed before 100 Continuessolr1_1      | 2020-02-07 
23:13:26.459 INFO  (qtp403547747-24) [   ] o.a.s.s.SolrDispatchFilter Could not 
consume full client request => java.io.IOException: Committed before 100 
Continuessolr1_1      |  at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)solr1_1
      | java.io.IOException: Committed before 100 Continuessolr1_1      |  at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)
 ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1      |  at 
org.eclipse.jetty.server.Request.getInputStream(Request.java:872) 
~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1      |  at 
javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185)
 ~[javax.servlet-api-3.1.0.jar:3.1.0]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28]solr1_1      |  at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445)
 ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 
2020-01-10 13:40:28] {noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032732#comment-17032732
 ] 

ASF subversion and git services commented on LUCENE-9146:
-

Commit 7c20f6b8c5ec46cdd3f8f32a2fedcb5b0406ba3b in lucene-solr's branch 
refs/heads/master from Anshum Gupta
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c20f6b ]

LUCENE-9146: Create gradle precommit action (#1245)



> Switch GitHub PR test from ant precommit to gradle
> --
>
> Key: LUCENE-9146
> URL: https://issues.apache.org/jira/browse/LUCENE-9146
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Anshum Gupta
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle

2020-02-07 Thread Anshum Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032734#comment-17032734
 ] 

Anshum Gupta commented on LUCENE-9146:
--

Merged into master.

> Switch GitHub PR test from ant precommit to gradle
> --
>
> Key: LUCENE-9146
> URL: https://issues.apache.org/jira/browse/LUCENE-9146
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Anshum Gupta
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] anshumg merged pull request #1245: LUCENE-9146: Create gradle precommit action

2020-02-07 Thread GitBox

anshumg merged pull request #1245: LUCENE-9146: Create gradle precommit action
URL: https://github.com/apache/lucene-solr/pull/1245
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests

2020-02-07 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723
 ] 

Kevin Risden edited comment on SOLR-14249 at 2/7/20 10:56 PM:
--

So I haven't personally looked at Krb5HttpClientBuilder recently, other than 
completely unrelated SOLR-13726. Part of the reason that a lot of clients 
buffer is due to how Kerberos SPNEGO authentication works.

There are 2 parts typically
* a request without authentication where the server returns a 401 with a 
negotiate response
* a request with authentication in response to the negotiate which the server 
can verify

If you don't put any optimizations in place every request becomes two. A lot of 
times a cookie is used here to limit the amount of HTTP requests.

The reason the 401 and second request is an issue - is if the request is a non 
repeatable one - like a POST body. The client ends up sending the body and gets 
a 401 then goes o crap I need to send the body again and can't - because its 
non repeatable.

So a lot of times the super simple workaround is to buffer the request - do the 
401 check dance and then proceed. This is a way to make a non repeatable 
request semi repeatable.

This buffering has issues though as you found where the buffer should be 
limited in size which then limits the usefulness of this technique.

There are a few alternatives to buffering:
* Authenticate upfront with say an OPTIONS request - which will get the cookie. 
the next request say a POST won't have any issue and won't do the 401 dance
* "preemptively" does SPNEGO authorization if you know the SPN needed and 
create the right authorization header - this also skips the 401 and server can 
check the header
* Use "Expect: 100-continue" header which asks the server if it can handle the 
request without the body and if it can then send the body. This actually holds 
the data from being sent in the first place if possible.
** Curl automatically activates "Expect: 100-continue" under a few conditions- 
https://gms.tf/when-curl-sends-100-continue.html
** Apache HttpClient does NOT do any special handling of "Expect: 100-continue"
** not sure if Jetty HttpClient does anything with "Expect: 100-continue"

So long story short - yes buffering is a problem.


was (Author: risdenk):
So I haven't personally looked at Krb5HttpClientBuilder recently, other than 
completely unrelated SOLR-13726. Part of the reason that a lot of clients 
buffer is due to how Kerberos SPNEGO authentication works.

There are 2 parts typically
* a request without authentication where the server returns a 401 with a 
negotiate response
* a request with authentication in response to the negotiate which the server 
can verify

If you don't put any optimizations in place every request becomes two. A lot of 
times a cookie is used here to limit the amount of HTTP requests.

The reason the 401 and second request is an issue - is if the request is a non 
repeatable one - like a POST body. The client ends up sending the body and gets 
a 401 then goes o crap I need to send the body again and can't - because its 
non repeatable.

So a lot of times the super simple workaround is to buffer the request - do the 
401 check dance and then proceed. This is a way to make a non repeatable 
request semi repeatable.

This buffering has issues though as you found where the buffer should be 
limited in size which then limits the usefulness of this technique.

There are a few alternatives to buffering:
* Authenticate upfront with say an OPTIONS request - which will get the cookie. 
the next request say a POST won't have any issue and won't do the 401 dance
* Use "Expect: 100-continue" header which asks the server if it can handle the 
request without the body and if it can then send the body. This actually holds 
the data from being sent in the first place if possible.
** Curl automatically activates "Expect: 100-continue" under a few conditions- 
https://gms.tf/when-curl-sends-100-continue.html
** Apache HttpClient does NOT do any special handling of "Expect: 100-continue"
** not sure if Jetty HttpClient does anything with "Expect: 100-continue"

So long story short - yes buffering is a problem.

> Krb5HttpClientBuilder should not buffer requests 
> -
>
> Key: SOLR-14249
> URL: https://issues.apache.org/jira/browse/SOLR-14249
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Authentication, SolrJ
>Affects Versions: 7.4, master (9.0), 8.4.1
>Reporter: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-14249-reproduction.patch
>
>
> When SolrJ clients enable Kerberos authentication, a request interceptor is 
> set up which wraps the actual HttpEntity in a

[jira] [Comment Edited] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests

2020-02-07 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723
 ] 

Kevin Risden edited comment on SOLR-14249 at 2/7/20 10:53 PM:
--

So I haven't personally looked at Krb5HttpClientBuilder recently, other than 
completely unrelated SOLR-13726. Part of the reason that a lot of clients 
buffer is due to how Kerberos SPNEGO authentication works.

There are 2 parts typically
* a request without authentication where the server returns a 401 with a 
negotiate response
* a request with authentication in response to the negotiate which the server 
can verify

If you don't put any optimizations in place every request becomes two. A lot of 
times a cookie is used here to limit the amount of HTTP requests.

The reason the 401 and second request is an issue - is if the request is a non 
repeatable one - like a POST body. The client ends up sending the body and gets 
a 401 then goes o crap I need to send the body again and can't - because its 
non repeatable.

So a lot of times the super simple workaround is to buffer the request - do the 
401 check dance and then proceed. This is a way to make a non repeatable 
request semi repeatable.

This buffering has issues though as you found where the buffer should be 
limited in size which then limits the usefulness of this technique.

There are a few alternatives to buffering:
* Authenticate upfront with say an OPTIONS request - which will get the cookie. 
the next request say a POST won't have any issue and won't do the 401 dance
* Use "Expect: 100-continue" header which asks the server if it can handle the 
request without the body and if it can then send the body. This actually holds 
the data from being sent in the first place if possible.
** Curl automatically activates "Expect: 100-continue" under a few conditions- 
https://gms.tf/when-curl-sends-100-continue.html
** Apache HttpClient does NOT do any special handling of "Expect: 100-continue"
** not sure if Jetty HttpClient does anything with "Expect: 100-continue"

So long story short - yes buffering is a problem.


was (Author: risdenk):
So I haven't personally looked at Krb5HttpClientBuilder recently, other than 
completely unrelated SOLR-13726. Part of the reason that a lot of clients 
buffer is due to how Kerberos SPNEGO authentication works.

There are 2 parts typically
* a request without authentication where the server returns a 401 with a 
negotiate response
* a request with authentication in response to the negotiate which the server 
can verify

If you don't put any optimizations in place every request becomes two. A lot of 
times a cookie is used here to limit the amount of HTTP requests.

The reason the 401 and second request is an issue - is if the request is a non 
repeatable one - like a POST body. 

So a lot of times the super simple workaround is to buffer the request - do the 
401 check dance and then proceed. This is a way to make a non repeatable 
request semi repeatable.

This buffering has issues though as you found where the buffer should be 
limited in size which then limits the usefulness of this technique.

There are a few alternatives to buffering:
* Authenticate upfront with say an OPTIONS request - which will get the cookie. 
the next request say a POST won't have any issue and won't do the 401 dance
* Use "Expect: 100-continue" header which asks the server if it can handle the 
request without the body and if it can then send the body. This actually holds 
the data from being sent in the first place if possible.
** Curl automatically activates "Expect: 100-continue" under a few conditions- 
https://gms.tf/when-curl-sends-100-continue.html
** Apache HttpClient does NOT do any special handling of "Expect: 100-continue"
** not sure if Jetty HttpClient does anything with "Expect: 100-continue"

So long story short - yes buffering is a problem.

> Krb5HttpClientBuilder should not buffer requests 
> -
>
> Key: SOLR-14249
> URL: https://issues.apache.org/jira/browse/SOLR-14249
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Authentication, SolrJ
>Affects Versions: 7.4, master (9.0), 8.4.1
>Reporter: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-14249-reproduction.patch
>
>
> When SolrJ clients enable Kerberos authentication, a request interceptor is 
> set up which wraps the actual HttpEntity in a BufferedHttpEntity.  This 
> BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can 
> be repeated if needed.  This works fine for small requests, but when requests 
> get large storing the entire request in memory causes contention or 
> OutOfMemoryErrors.
> The easiest way for this to manifest is to

[jira] [Commented] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests

2020-02-07 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723
 ] 

Kevin Risden commented on SOLR-14249:
-

So I haven't personally looked at Krb5HttpClientBuilder recently, other than 
completely unrelated SOLR-13726. Part of the reason that a lot of clients 
buffer is due to how Kerberos SPNEGO authentication works.

There are 2 parts typically
* a request without authentication where the server returns a 401 with a 
negotiate response
* a request with authentication in response to the negotiate which the server 
can verify

If you don't put any optimizations in place every request becomes two. A lot of 
times a cookie is used here to limit the amount of HTTP requests.

The reason the 401 and second request is an issue - is if the request is a non 
repeatable one - like a POST body. 

So a lot of times the super simple workaround is to buffer the request - do the 
401 check dance and then proceed. This is a way to make a non repeatable 
request semi repeatable.

This buffering has issues though as you found where the buffer should be 
limited in size which then limits the usefulness of this technique.

There are a few alternatives to buffering:
* Authenticate upfront with say an OPTIONS request - which will get the cookie. 
the next request say a POST won't have any issue and won't do the 401 dance
* Use "Expect: 100-continue" header which asks the server if it can handle the 
request without the body and if it can then send the body. This actually holds 
the data from being sent in the first place if possible.
** Curl automatically activates "Expect: 100-continue" under a few conditions- 
https://gms.tf/when-curl-sends-100-continue.html
** Apache HttpClient does NOT do any special handling of "Expect: 100-continue"
** not sure if Jetty HttpClient does anything with "Expect: 100-continue"

So long story short - yes buffering is a problem.

> Krb5HttpClientBuilder should not buffer requests 
> -
>
> Key: SOLR-14249
> URL: https://issues.apache.org/jira/browse/SOLR-14249
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Authentication, SolrJ
>Affects Versions: 7.4, master (9.0), 8.4.1
>Reporter: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-14249-reproduction.patch
>
>
> When SolrJ clients enable Kerberos authentication, a request interceptor is 
> set up which wraps the actual HttpEntity in a BufferedHttpEntity.  This 
> BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can 
> be repeated if needed.  This works fine for small requests, but when requests 
> get large storing the entire request in memory causes contention or 
> OutOfMemoryErrors.
> The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, 
> which opens a connection to Solr and streams documents out in an ever 
> increasing request entity until the doc queue held by the client is emptied.
> I ran into this while troubleshooting a DIH run that would reproducibly load 
> a few hundred thousand documents before progress stalled out.  Solr never 
> crashed and the DIH thread was still alive, but the 
> ConcurrentUpdateSolrClient used by DIH had its "Runner" thread disappear 
> around the time of the stall and an OOM like the one below could be seen in 
> solr-8983-console.log:
> {code}
> WARNING: Uncaught exception in thread: 
> Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering]
> java.lang.OutOfMemoryError: Java heap space
>   at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0)
>   at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
>   at 
> java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
>   at 
> java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
>   at 
> java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
>   at 
> org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
>   at 
> org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94)
>   at 
> org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145)
>   at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
>   at

[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle

2020-02-07 Thread Anshum Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032717#comment-17032717
 ] 

Anshum Gupta commented on LUCENE-9146:
--

https://github.com/apache/lucene-solr/pull/1245


> Switch GitHub PR test from ant precommit to gradle
> --
>
> Key: LUCENE-9146
> URL: https://issues.apache.org/jira/browse/LUCENE-9146
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Anshum Gupta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle

2020-02-07 Thread Anshum Gupta (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anshum Gupta reassigned LUCENE-9146:


Assignee: Anshum Gupta

> Switch GitHub PR test from ant precommit to gradle
> --
>
> Key: LUCENE-9146
> URL: https://issues.apache.org/jira/browse/LUCENE-9146
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Anshum Gupta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9213:

Fix Version/s: master (9.0)

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9213.
-
Resolution: Fixed

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032706#comment-17032706
 ] 

ASF subversion and git services commented on LUCENE-9213:
-

Commit 69f26d099ec36adec251cbf36594ea375d7fc620 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=69f26d0 ]

LUCENE-9213: fix documentation-lint (and finally precommit) to work on java 12 
and 13

the "missing javadocs" checker needed tweaks to work with the format
changes of java 13.

As a followup we may investigate javadoc (maybe the new doclet api). It
has its own missing checks too now, but they are black vs white (either
fully documented or not checked), whereas this python tool allows us to
"improve", e.g. enforce that all classes have doc, even if all
methods do not yet.


> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032702#comment-17032702
 ] 

Robert Muir commented on LUCENE-9213:
-

I want to followup and investigate the new doclet api, I am concerned about all 
the format-changes in the html and jdk releases too fast. Maybe it can do the 
checks we need easily.

But for now this gets {{ant precommit}} working with java 12 and 13 (the 
original issue i wanted to solve).

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9149) Increase data dimension limit in BKD

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032700#comment-17032700
 ] 

ASF subversion and git services commented on LUCENE-9149:
-

Commit 206a70e7b79050db0d351135e406cfb997cbeee1 in lucene-solr's branch 
refs/heads/master from Nicholas Knize
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=206a70e ]

LUCENE-9149: Increase data dimension limit in BKD


> Increase data dimension limit in BKD
> 
>
> Key: LUCENE-9149
> URL: https://issues.apache.org/jira/browse/LUCENE-9149
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nick Knize
>Priority: Major
> Attachments: LUCENE-9149.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-8496 added selective indexing; the ability to designate the first K <= 
> N dimensions for driving the construction of the BKD internal nodes. Follow 
> on work stored the "data dimensions" for only the leaf nodes and only the 
> "index dimensions" are stored for the internal nodes. While 
> {{maxPointsInLeafNode}} is still important for managing the BKD heap memory 
> footprint (thus we don't want this to get too large), I'd like to propose 
> increasing the {{MAX_DIMENSIONS}} limit (to something not too crazy like 16; 
> effectively doubling the index dimension limit) while maintaining the 
> {{MAX_INDEX_DIMENSIONS}} at 8.
> Doing this will enable us to encode higher dimension data within a lower 
> dimension index (e.g., 3D tessellated triangles as a 10 dimension point using 
> only the first 6 dimensions for index construction)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] asfgit merged pull request #1182: LUCENE-9149: Increase data dimension limit in BKD

2020-02-07 Thread GitBox

asfgit merged pull request #1182: LUCENE-9149: Increase data dimension limit in 
BKD
URL: https://github.com/apache/lucene-solr/pull/1182
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] anshumg opened a new pull request #1245: Create gradle precommit action

2020-02-07 Thread GitBox

anshumg opened a new pull request #1245: Create gradle precommit action
URL: https://github.com/apache/lucene-solr/pull/1245
 
 
   This adds a gradle precommit action w/ Java11 for all branches.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032680#comment-17032680
 ] 

Robert Muir commented on LUCENE-9213:
-

I tested and got documentation-lint BUILD SUCCESSFUL for lucene and solr with 
javas 11, 12, 13.

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032664#comment-17032664
 ] 

Robert Muir commented on LUCENE-9213:
-

I had to tweak slightly for that case (generics). Now everything passes on Java 
13:
{noformat}
-documentation-lint:
 [echo] Checking for broken links...
 [exec] 
 [exec] Crawl/parse...
 [exec] 
 [exec] Verify...
 [echo] Checking for missing docs...

BUILD SUCCESSFUL
{noformat}

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9213:

Attachment: LUCENE-9213.patch

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch, LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032651#comment-17032651
 ] 

Robert Muir commented on LUCENE-9213:
-

Seems we still have one bug left at least. I hope it does not involve 
generics...

{noformat}
 [exec] Verify...
 [echo] Checking for missing docs...
 [exec] 
 [exec] build/docs/core/org/apache/lucene/analysis/CharArrayMap.html
 [exec]   missing Methods: put(java.lang.Object,V)
 [exec] 
 [exec] Missing javadocs were found!
{noformat}

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032648#comment-17032648
 ] 

Robert Muir commented on LUCENE-9213:
-

Attached is the current patch I am testing now. cc [~mikemccand]

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9213:

Attachment: LUCENE-9213.patch

> fix documentation-lint on recent java
> -
>
> Key: LUCENE-9213
> URL: https://issues.apache.org/jira/browse/LUCENE-9213
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9213.patch
>
>
> Currently this is disabled unless you use java 11. It works with java 12. For 
> java 13, it the python checker needs some slight tweaks.
> Javadocs are formatted differently in each release but the changes between 12 
> and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9213) fix documentation-lint on recent java

2020-02-07 Thread Robert Muir (Jira)

Robert Muir created LUCENE-9213:
---

 Summary: fix documentation-lint on recent java
 Key: LUCENE-9213
 URL: https://issues.apache.org/jira/browse/LUCENE-9213
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


Currently this is disabled unless you use java 11. It works with java 12. For 
java 13, it the python checker needs some slight tweaks.

Javadocs are formatted differently in each release but the changes between 12 
and 13 were enough to anger the checker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests

2020-02-07 Thread Jason Gerlowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gerlowski updated SOLR-14249:
---
Attachment: SOLR-14249-reproduction.patch

> Krb5HttpClientBuilder should not buffer requests 
> -
>
> Key: SOLR-14249
> URL: https://issues.apache.org/jira/browse/SOLR-14249
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Authentication, SolrJ
>Affects Versions: 7.4, master (9.0), 8.4.1
>Reporter: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-14249-reproduction.patch
>
>
> When SolrJ clients enable Kerberos authentication, a request interceptor is 
> set up which wraps the actual HttpEntity in a BufferedHttpEntity.  This 
> BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can 
> be repeated if needed.  This works fine for small requests, but when requests 
> get large storing the entire request in memory causes contention or 
> OutOfMemoryErrors.
> The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, 
> which opens a connection to Solr and streams documents out in an ever 
> increasing request entity until the doc queue held by the client is emptied.
> I ran into this while troubleshooting a DIH run that would reproducibly load 
> a few hundred thousand documents before progress stalled out.  Solr never 
> crashed and the DIH thread was still alive, but the 
> ConcurrentUpdateSolrClient used by DIH had its "Runner" thread disappear 
> around the time of the stall and an OOM like the one below could be seen in 
> solr-8983-console.log:
> {code}
> WARNING: Uncaught exception in thread: 
> Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering]
> java.lang.OutOfMemoryError: Java heap space
>   at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0)
>   at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
>   at 
> java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
>   at 
> java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
>   at 
> java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
>   at 
> org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
>   at 
> org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94)
>   at 
> org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145)
>   at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:764)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:705)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:367)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:223)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:330)
>   at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
>   at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:155)
>   at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:91)
>   at 
> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
>   at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:264)
>   at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73)
>   at 
> org.apache.http.entity.BufferedHttpEntity.(BufferedHttpEntity.java:62)
>   at 
> org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder.lambda$new$3(Krb5HttpClientBuilder.java:155)
>   at 
> org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder$$Lambda$459/0x000800623840.process(Unknown
>  Source)
>   at 
> org.apache.solr.client.solrj.impl.HttpClientUtil$DynamicInterceptor$1.accept(HttpClientUtil.java:177)
> {code}
> We took

[jira] [Created] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests

2020-02-07 Thread Jason Gerlowski (Jira)

Jason Gerlowski created SOLR-14249:
--

 Summary: Krb5HttpClientBuilder should not buffer requests 
 Key: SOLR-14249
 URL: https://issues.apache.org/jira/browse/SOLR-14249
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Authentication, SolrJ
Affects Versions: 8.4.1, 7.4, master (9.0)
Reporter: Jason Gerlowski


When SolrJ clients enable Kerberos authentication, a request interceptor is set 
up which wraps the actual HttpEntity in a BufferedHttpEntity.  This 
BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can be 
repeated if needed.  This works fine for small requests, but when requests get 
large storing the entire request in memory causes contention or 
OutOfMemoryErrors.

The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, 
which opens a connection to Solr and streams documents out in an ever 
increasing request entity until the doc queue held by the client is emptied.

I ran into this while troubleshooting a DIH run that would reproducibly load a 
few hundred thousand documents before progress stalled out.  Solr never crashed 
and the DIH thread was still alive, but the ConcurrentUpdateSolrClient used by 
DIH had its "Runner" thread disappear around the time of the stall and an OOM 
like the one below could be seen in solr-8983-console.log:
{code}
WARNING: Uncaught exception in thread: 
Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering]
java.lang.OutOfMemoryError: Java heap space
  at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0)
  at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
  at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
  at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
  at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
  at 
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
  at 
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94)
  at org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145)
  at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848)
  at 
org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932)
  at 
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328)
  at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
  at 
org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616)
  at 
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
  at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
  at 
org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:764)
  at 
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
  at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
  at 
org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:705)
  at 
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:367)
  at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
  at 
org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:223)
  at 
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:330)
  at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228)
  at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:155)
  at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:91)
  at 
org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
  at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:264)
  at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73)
  at 
org.apache.http.entity.BufferedHttpEntity.(BufferedHttpEntity.java:62)
  at 
org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder.lambda$new$3(Krb5HttpClientBuilder.java:155)
  at 
org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder$$Lambda$459/0x000800623840.process(Unknown
 Source)
  at 
org.apache.solr.client.solrj.impl.HttpClientUtil$DynamicInterceptor$1.accept(HttpClientUtil.java:177)
{code}

We took heap dumps and were able to confirm that the entire 8gb heap was taken 
up with a single massive CUSC request body that was being buffered!

(As an aside, I had no idea that OutOfMemoryError's could happen without 
killing the entire JVM.  But apparently they can.  CUSC.Runner propagates the 
OOM as it should and the OOM kills the Runner thread.  Since that thread is the 
gc-root for the massive BufferedHttpEntity though, a garbage collection frees

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032615#comment-17032615
 ] 

Robert Muir commented on LUCENE-9201:
-

OK, thanks for working on the PR. At a glance it looks good to me. But we may 
get better feedback from Dawid Weiss when he is back online in a few days.

I will try to investigate more of the problems that you uncovered...

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud

2020-02-07 Thread Ilan Ginzburg (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032610#comment-17032610
 ] 

Ilan Ginzburg commented on SOLR-5146:
-

Thanks [~erickerickson] for the wider context overview. If we solve the leader 
issue, ensuring index is up to date (and making it so if it's not) is likely a 
lot easier with SHARED collections and replicas, i.e. index files written to a 
Blob storage that becomes the "source of truth" 
(https://github.com/apache/lucene-solr/tree/jira/SOLR-13101).

My understanding [~dsmiley] is that a replica being unloaded totally, i.e. 
files are on disk but nothing in memory, would require changes to the current 
strategy of always having replica specific Zookeeper connections/state for the 
leader election process.

> Figure out what it would take for lazily-loaded cores to play nice with 
> SolrCloud
> -
>
> Key: SOLR-5146
> URL: https://issues.apache.org/jira/browse/SOLR-5146
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Affects Versions: 4.5, 6.0
>Reporter: Erick Erickson
>Assignee: David Smiley
>Priority: Major
>
> The whole lazy-load core thing was implemented with non-SolrCloud use-cases 
> in mind. There are several user-list threads that ask about using lazy cores 
> with SolrCloud, especially in multi-tenant use-cases.
> This is a marker JIRA to investigate what it would take to make lazy-load 
> cores play nice with SolrCloud. It's especially interesting how this all 
> works with shards, replicas, leader election, recovery, etc.
> NOTE: This is pretty much totally unexplored territory. It may be that a few 
> trivial modifications are all that's needed. OTOH, It may be that we'd have 
> to rip apart SolrCloud to handle this case. Until someone dives into the 
> code, we don't know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-07 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032601#comment-17032601
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Thank you [~rcmuir] for your work and comments.

I updated the PR (refactored the gradle tasks and ported ant build details, as 
much as I can). I hope it is a good starting point, if not perfect. There are 
still not ported the ant scripts' hacks, especially around "ecj-macro" stuff, 
that I cannot figure out how to copy to gradle.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9194) Simplify XYShapeXQuery API

2020-02-07 Thread Ignacio Vera (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-9194.
--
Fix Version/s: 8.5
 Assignee: Ignacio Vera
   Resolution: Fixed

master: 73dbf6d06108e9f18423521e339230bda37f8524

branch 8.x: 5c1f2ca22a756b16f0e35aa5dde221578fe1ce76

> Simplify XYShapeXQuery API 
> ---
>
> Key: LUCENE-9194
> URL: https://issues.apache.org/jira/browse/LUCENE-9194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Similar to what was done in LUCENE-9141 simplify XYShape queries.
>  
> This change will allow as well to make most of the internal geo classes 
> package private.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood edited a comment on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood edited a comment on issue #1234: Add compression for Binary doc 
value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216
 
 
   >Strange that Mark would measure 4x slowdown from decoding the lengths... 
Perhaps the random bytes are not totally incompressible, just barely 
compressible?
   
   I may have been too hasty in that reply - I've not been able to reproduce 
that and the raw vs compressed timings are very similar in the additional tests 
I've done so echo what @jpountz expects. My first (faster) run had random bytes 
selected in the range 0-20 and not the 0-127 range where I'm seeing parity


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] iverase merged pull request #1224: LUCENE-9194: Simplify XYShapeQuery API

2020-02-07 Thread GitBox

iverase merged pull request #1224: LUCENE-9194: Simplify XYShapeQuery API
URL: https://github.com/apache/lucene-solr/pull/1224
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood edited a comment on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood edited a comment on issue #1234: Add compression for Binary doc 
value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216
 
 
   >Strange that Mark would measure 4x slowdown from decoding the lengths... 
Perhaps the random bytes are not totally incompressible, just barely 
compressible?
   
   I may have been too hasty in that reply - I've not been able to reproduce 
that and the timings are very similar in the additional tests I've done so echo 
what @jpountz expects. My first (faster) run had random bytes selected in the 
range 0-20 and not the 0-127 range where I'm seeing parity


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216
 
 
   >Strange that Mark would measure 4x slowdown from decoding the lengths... 
Perhaps the random bytes are not totally incompressible, just barely 
compressible?
   
   I may have been too hasty in that reply - I've not been able to reproduce 
that and the timings are very similar in the additional tests I've done so echo 
what @jpountz expects


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

msokolov commented on issue #1234: Add compression for Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583538389
 
 
   Strange that Mark would measure 4x slowdown from decoding the lengths... 
Perhaps the random bytes are not totally incompressible, just barely 
compressible? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on issue #1234: Add compression for Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583536606
 
 
   @msokolov FWIW LZ4 only removes duplicate strings from a stream: when it 
finds one it inserts a reference to a previous sequence of bytes. In the 
special case that the content in incompressible, the LZ4 compressed data just 
consists of the number of bytes followed by the bytes, so the only overhead 
compared to reading the bytes directly is the decoding of the number of bytes, 
which should be rather low.
   
   I don't have a preference regarding whether we should have an explicit 
"not-compressed" case, but I understand how not having one helps keep things 
simpler.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376529195
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
+private int uncompressedBlockLength = 0;
+private int numDocsInBlock = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
+  int docInBlockId = docNumber % 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  assert docInBlockId < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  
+  
+  // already read and uncompressed?
+  if (blockId != lastBlockId) {
+lastBlockId = blockId;
+long blockStartOffset = addresses.get(blockId);
+compressedData.seek(blockStartOffset);
+
+numDocsInBlock = compressedData.readVInt();
+assert numDocsInBlock <= 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+uncompressedDocEnds = new int[numDocsInBlock];
+uncompressedBlockLength = 0;
+
+int onlyLength = -1;
+for (int i = 0; i < numDocsInBlock; i++) {
+  if (i == 0) {
+// The first length value is special. It is shifted and has a bit 
to denote if
+// all other values are the same length
+int lengthPlusSameInd = compressedData.readVInt();
+int sameIndicator = lengthPlusSameInd & 1;
+int firstValLength = lengthPlusSameInd >>1;
 
 Review comment:
   Since you are stealing a bit, we should do an unsigned shift (`>>>`) instead.
   
   This would never be a problem in practice, but imagine than the length was a 
31-bits integer. Shifting by one bit on the left at index time would make this 
number negative. So here we need an unsigned shift rather than a signed shift 
that preserves the sign.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376527753
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
+private int uncompressedBlockLength = 0;
+private int numDocsInBlock = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
+  int docInBlockId = docNumber % 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  assert docInBlockId < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  
+  
+  // already read and uncompressed?
+  if (blockId != lastBlockId) {
+lastBlockId = blockId;
+long blockStartOffset = addresses.get(blockId);
+compressedData.seek(blockStartOffset);
+
+numDocsInBlock = compressedData.readVInt();
+assert numDocsInBlock <= 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+uncompressedDocEnds = new int[numDocsInBlock];
+uncompressedBlockLength = 0;
+
+int onlyLength = -1;
+for (int i = 0; i < numDocsInBlock; i++) {
+  if (i == 0) {
+// The first length value is special. It is shifted and has a bit 
to denote if
+// all other values are the same length
+int lengthPlusSameInd = compressedData.readVInt();
+int sameIndicator = lengthPlusSameInd & 1;
+int firstValLength = lengthPlusSameInd >>1;
+if (sameIndicator == 1) {
+  onlyLength = firstValLength;
+}
+uncompressedBlockLength += firstValLength;
+  } else {
+if (onlyLength == -1) {
+  // Various lengths are stored - read each from disk
+  uncompressedBlockLength += compressedData.readVInt();
+} else {
+  // Only one length 
+  uncompressedBlockLength += onlyLength;
+}
+  }
+  uncompressedDocEnds[i] = uncompressedBlockLength;
 
 Review comment:
   maybe we could call it `uncompressedDocStarts` and set the index at `i+1` 
which would then help below to remove the else block of the `docInBlockId > 0` 
condition below?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376532189
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
+private int uncompressedBlockLength = 0;
+private int numDocsInBlock = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
+  int docInBlockId = docNumber % 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  assert docInBlockId < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  
+  
+  // already read and uncompressed?
+  if (blockId != lastBlockId) {
+lastBlockId = blockId;
+long blockStartOffset = addresses.get(blockId);
+compressedData.seek(blockStartOffset);
+
+numDocsInBlock = compressedData.readVInt();
+assert numDocsInBlock <= 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+uncompressedDocEnds = new int[numDocsInBlock];
 
 Review comment:
   can we reuse the same array across blocks?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376531952
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
 
 Review comment:
   in the past we've put these constants in the meta file and BinaryEntry so 
that it's easier to change values over time


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376528169
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private int []uncompressedDocEnds = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
+private int uncompressedBlockLength = 0;
+private int numDocsInBlock = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
+  int docInBlockId = docNumber % 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  assert docInBlockId < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
+  
+  
+  // already read and uncompressed?
+  if (blockId != lastBlockId) {
+lastBlockId = blockId;
+long blockStartOffset = addresses.get(blockId);
+compressedData.seek(blockStartOffset);
+
+numDocsInBlock = compressedData.readVInt();
 
 Review comment:
   do we really need to record the number of documents in the block? It should 
be 32 for all blocks except for the last one? Maybe at index-time we could 
append dummy values to the last block to make sure it has 32 values too, and we 
wouldn't need this vInt anymore?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] rmuir commented on issue #1236: Add back assertions removed by LUCENE-9187.

2020-02-07 Thread GitBox

rmuir commented on issue #1236: Add back assertions removed by LUCENE-9187.
URL: https://github.com/apache/lucene-solr/pull/1236#issuecomment-583534489
 
 
   +1, thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583529462
 
 
   >Did you also test read performance in this incompressible case?
   
   Just tried it and it does look 4x faster reading raw random bytes Vs 
compressed random bytes
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

jpountz commented on issue #1234: Add compression for Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583529199
 
 
   In the case of content that can't be compressed, the compressed data will 
consist of the number of bytes, followed by the bytes. So decompressing 
consists of decoding the length and then reading the bytes. The only overhead 
compared to reading bytes directly is the decoding of the number of bytes, so I 
would believe that the overhead is rather small.
   
   I don't have a strong preference regarding whether this case should be 
handled explicitly or not. It's true that not having a special "not-compressed" 
case helps keep the logic simpler.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9194) Simplify XYShapeXQuery API

2020-02-07 Thread Ignacio Vera (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032560#comment-17032560
 ] 

Ignacio Vera commented on LUCENE-9194:
--

Pr related to this change: https://github.com/apache/lucene-solr/pull/1224

> Simplify XYShapeXQuery API 
> ---
>
> Key: LUCENE-9194
> URL: https://issues.apache.org/jira/browse/LUCENE-9194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Minor
>
> Similar to what was done in LUCENE-9141 simplify XYShape queries.
>  
> This change will allow as well to make most of the internal geo classes 
> package private.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov edited a comment on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

msokolov edited a comment on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583519622
 
 
   > The LZ4 compressed versions of this content were only marginally bigger 
than their raw counterparts 
   
   Did you also test read performance in this incompressible case?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob opened a new pull request #1244: SOLR-14247 Remove unneeded sleeps

2020-02-07 Thread GitBox

madrob opened a new pull request #1244: SOLR-14247 Remove unneeded sleeps
URL: https://github.com/apache/lucene-solr/pull/1244
 
 
   This test is slow because it sleeps a lot. Removing the sleeps, it still 
passes consistently on my machine, but I would like other folks to confirm this 
on their different hardware as well.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `ant precommit` and the appropriate test suite.
   - [ ] ~I have added tests for my changes.~
   - [ ] ~I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).~
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

msokolov commented on issue #1234: Add compression for Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583519622
 
 
   > The LZ4 compressed versions of this content were only marginally bigger 
than their raw counterparts 
   Did you also test read performance in this incompressible case?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost 
by payload 
URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-583518344
 
 
   I have applied the changes to solve the feedback points and consequentially 
added additional tests to cover some missing scenario.
   We should be almost ready to go :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] 
Synonym Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376513280
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java
 ##
 @@ -450,9 +485,13 @@ protected Query analyzePhrase(String field, TokenStream 
stream, int slop) throws
 position += 1;
   }
   builder.add(new Term(field, termAtt.getBytesRef()), position);
+  phraseBoost = boostAtt.getBoost();
 
 Review comment:
   I implemented a simple multiplicative boost.
   It's back compatible with the designed use case (multi term synonym -> 
single concept -> single boost -> e.g. panthera onca => jaguar|0.95, big 
cat|0.85, black panther|0.65))
   
   But it's also compatible in not synonym cases, if the user needs a boost per 
token in phrase and span queries.
   It's in the upcoming commit, let me know if you believe something different 
is necessary


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032531#comment-17032531
 ] 

ASF subversion and git services commented on SOLR-14245:


Commit f8163439ffbb36876f236551f8322a5e5851ba87 in lucene-solr's branch 
refs/heads/branch_8x from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f816343 ]

SOLR-14245: Validate Replica / ReplicaInfo on creation.


> Validate Replica / ReplicaInfo on creation
> --
>
> Key: SOLR-14245
> URL: https://issues.apache.org/jira/browse/SOLR-14245
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Minor
> Fix For: 8.5
>
>
> Replica / ReplicaInfo should be immutable and their fields should be 
> validated on creation.
> Some users reported that very rarely during a failed collection CREATE or 
> DELETE, or when the Overseer task queue becomes corrupted, Solr may write to 
> ZK incomplete replica infos (eg. node_name = null).
> This problem is difficult to reproduce but we should add safeguards anyway to 
> prevent writing such corrupted replica info to ZK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] 
Synonym Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376503587
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java
 ##
 @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, 
TokenStream source, BooleanCla
 end = articulationPoints[i];
   }
   lastState = end;
-  final Query queryPos;
+  final Query positionalQuery;
   if (graph.hasSidePath(start)) {
-final Iterator it = graph.getFiniteStrings(start, end);
+final Iterator sidePathsIterator = 
graph.getFiniteStrings(start, end);
 Iterator queries = new Iterator() {
   @Override
   public boolean hasNext() {
-return it.hasNext();
+return sidePathsIterator.hasNext();
   }
 
   @Override
   public Query next() {
-TokenStream ts = it.next();
-return createFieldQuery(ts, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
+TokenStream sidePath = sidePathsIterator.next();
+return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
   }
 };
-queryPos = newGraphSynonymQuery(queries);
+positionalQuery = newGraphSynonymQuery(queries);
   } else {
-Term[] terms = graph.getTerms(field, start);
+List attributes = graph.getTerms(start);
 
 Review comment:
   a tentative change is coming in the next commit, I added also few tests to 
cover that else coding branch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032528#comment-17032528
 ] 

ASF subversion and git services commented on SOLR-14245:


Commit 9a190935869a5fba8c4935f85988fe712066c465 in lucene-solr's branch 
refs/heads/master from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9a19093 ]

SOLR-14245: Validate Replica / ReplicaInfo on creation.


> Validate Replica / ReplicaInfo on creation
> --
>
> Key: SOLR-14245
> URL: https://issues.apache.org/jira/browse/SOLR-14245
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Minor
> Fix For: 8.5
>
>
> Replica / ReplicaInfo should be immutable and their fields should be 
> validated on creation.
> Some users reported that very rarely during a failed collection CREATE or 
> DELETE, or when the Overseer task queue becomes corrupted, Solr may write to 
> ZK incomplete replica infos (eg. node_name = null).
> This problem is difficult to reproduce but we should add safeguards anyway to 
> prevent writing such corrupted replica info to ZK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] 
Synonym Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376478661
 
 

 ##
 File path: solr/core/src/test-files/solr/collection1/conf/schema12.xml
 ##
 @@ -238,6 +227,18 @@
 
   
 
+  
 
 Review comment:
   Fixed in the next coming commit!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] 
Synonym Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376476976
 
 

 ##
 File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.boost;
+
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.search.BoostAttribute;
+
+import java.io.IOException;
+
+
+/**
+ * Characters before the delimiter are the "token", those after are the boost.
+ * 
+ * For example, if the delimiter is '|', then for the string "foo|0.7", foo is 
the token
+ * and 0.7 is the boost.
+ * 
+ * Note make sure your Tokenizer doesn't split on the delimiter, or this won't 
work
+ */
+public final class DelimitedBoostTokenFilter extends TokenFilter {
+  private final char delimiter;
+  private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
+  private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class);
+
+  public DelimitedBoostTokenFilter(TokenStream input, char delimiter) {
+super(input);
+this.delimiter = delimiter;
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  final char[] buffer = termAtt.buffer();
+  final int length = termAtt.length();
+  for (int i = 0; i < length; i++) {
+if (buffer[i] == delimiter) {
+  float boost = Float.parseFloat(new String(buffer, i + 1, (length - 
(i + 1;
+  boostAtt.setBoost(boost);
+  termAtt.setLength(i);
+  return true;
+}
+  }
+  // we have not seen the delimiter
+  boostAtt.setBoost(1.0f);
 
 Review comment:
   Fixed in the next coming commit


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] 
Synonym Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376476198
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/schema/TextField.java
 ##
 @@ -43,6 +43,7 @@
 public class TextField extends FieldType {
   protected boolean autoGeneratePhraseQueries;
   protected boolean enableGraphQueries;
+  protected boolean synonymBoostByPayload;
 
 Review comment:
   agreed and fixed!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376473778
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java
 ##
 @@ -450,9 +485,13 @@ protected Query analyzePhrase(String field, TokenStream 
stream, int slop) throws
 position += 1;
   }
   builder.add(new Term(field, termAtt.getBytesRef()), position);
+  phraseBoost = boostAtt.getBoost();
 
 Review comment:
   I think this isn't quite right, because we need to combine boosts together 
somehow; currently your phrase boost is just the boost of the last term in the 
phrase.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376474333
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java
 ##
 @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, 
TokenStream source, BooleanCla
 end = articulationPoints[i];
   }
   lastState = end;
-  final Query queryPos;
+  final Query positionalQuery;
   if (graph.hasSidePath(start)) {
-final Iterator it = graph.getFiniteStrings(start, end);
+final Iterator sidePathsIterator = 
graph.getFiniteStrings(start, end);
 Iterator queries = new Iterator() {
   @Override
   public boolean hasNext() {
-return it.hasNext();
+return sidePathsIterator.hasNext();
   }
 
   @Override
   public Query next() {
-TokenStream ts = it.next();
-return createFieldQuery(ts, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
+TokenStream sidePath = sidePathsIterator.next();
+return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
   }
 };
-queryPos = newGraphSynonymQuery(queries);
+positionalQuery = newGraphSynonymQuery(queries);
   } else {
-Term[] terms = graph.getTerms(field, start);
+List attributes = graph.getTerms(start);
 
 Review comment:
   This is what GraphTokenStreamFiniteStrings returns currently, for multiple 
tokens at the same position.  Maybe `TermAndBoost[]` would make more sense 
though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost 
by payload 
URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-583474019
 
 
   hi @romseygeek , @dsmiley ,
   first of all, thank you again for your patience and very useful insights.
   I have incorporated Alan's changes and cleaned everything up.
   
   My un-resolved questions:
   - boostAttribute doesn’t use BytesRef but directly float, is it a concern? 
We are expected to use it at query time, so we could actually see a query time 
minimal benefit in not encoding/decoding?
   - Alan expressed concerns over SpanBoostQuery, mentioning they are sort of 
broken, what should we do in that regard? right now the create span query seems 
to work as expected with boosted synonyms(see the related test), I suspect if 
SpanBoostQuery are broken , they should get resolved in another ticket?
   - from an original comment in the test code 
org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle:
   "confirm autoGeneratePhraseQueries always builds OR queries"
   I changed that, was there any reason for that behaviour?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2020-02-07 Thread Alessandro Benedetti (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032489#comment-17032489
 ] 

Alessandro Benedetti commented on SOLR-12238:
-

hi [~dsmiley], [~romseygeek], first of all, thank you again for your patience 
and very useful insights.
The child Lucene issue and pull request have been updated incorporating Alan's 
suggestions.


> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376460611
 
 

 ##
 File path: solr/core/src/test-files/solr/collection1/conf/schema12.xml
 ##
 @@ -238,6 +227,18 @@
 
   
 
+  
 
 Review comment:
   You can remove "payload" everywhere from this PR now; no?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376450137
 
 

 ##
 File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.boost;
+
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.search.BoostAttribute;
+
+import java.io.IOException;
+
+
+/**
+ * Characters before the delimiter are the "token", those after are the boost.
+ * 
+ * For example, if the delimiter is '|', then for the string "foo|0.7", foo is 
the token
+ * and 0.7 is the boost.
+ * 
+ * Note make sure your Tokenizer doesn't split on the delimiter, or this won't 
work
+ */
+public final class DelimitedBoostTokenFilter extends TokenFilter {
+  private final char delimiter;
+  private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
+  private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class);
+
+  public DelimitedBoostTokenFilter(TokenStream input, char delimiter) {
+super(input);
+this.delimiter = delimiter;
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  final char[] buffer = termAtt.buffer();
+  final int length = termAtt.length();
+  for (int i = 0; i < length; i++) {
+if (buffer[i] == delimiter) {
+  float boost = Float.parseFloat(new String(buffer, i + 1, (length - 
(i + 1;
+  boostAtt.setBoost(boost);
+  termAtt.setLength(i);
+  return true;
+}
+  }
+  // we have not seen the delimiter
+  boostAtt.setBoost(1.0f);
 
 Review comment:
   Shouldn't be needed; leave the boost be -- defaults to 1.0 any way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376455962
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java
 ##
 @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, 
TokenStream source, BooleanCla
 end = articulationPoints[i];
   }
   lastState = end;
-  final Query queryPos;
+  final Query positionalQuery;
   if (graph.hasSidePath(start)) {
-final Iterator it = graph.getFiniteStrings(start, end);
+final Iterator sidePathsIterator = 
graph.getFiniteStrings(start, end);
 Iterator queries = new Iterator() {
   @Override
   public boolean hasNext() {
-return it.hasNext();
+return sidePathsIterator.hasNext();
   }
 
   @Override
   public Query next() {
-TokenStream ts = it.next();
-return createFieldQuery(ts, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
+TokenStream sidePath = sidePathsIterator.next();
+return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, 
getAutoGenerateMultiTermSynonymsPhraseQuery(), 0);
   }
 };
-queryPos = newGraphSynonymQuery(queries);
+positionalQuery = newGraphSynonymQuery(queries);
   } else {
-Term[] terms = graph.getTerms(field, start);
+List attributes = graph.getTerms(start);
 
 Review comment:
   I think I mentioned a List of AttributeSource is weird (I've never seen 
this) and it's heavyweight.  Why not a TokenStream or TermAndBoost[] ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload

2020-02-07 Thread GitBox

dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym 
Queries boost by payload 
URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376459427
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/schema/TextField.java
 ##
 @@ -43,6 +43,7 @@
 public class TextField extends FieldType {
   protected boolean autoGeneratePhraseQueries;
   protected boolean enableGraphQueries;
+  protected boolean synonymBoostByPayload;
 
 Review comment:
   I thought we switched the approach from a payload to boost attribute?  
Besides; it's not clear we need this toggle at all since the user could arrange 
for this behavior simply by having the new DelimitedBoost filter thing in the 
chain.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-02-07 Thread Alessandro Benedetti (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032462#comment-17032462
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

hi [~romseygeek], first of all, thank you again for your patience and very 
useful insights.
I have incorporated your changes and cleaned everything up.
You find the original PR updated.

My un-resolved questions:

- boostAttribute doesn’t use BytesRef but directly float, is it a concern? We 
are expected to use it at query time, so we could actually see a query time 
minimal benefit in not encoding/decoding?

- you expressed concerns over SpanBoostQuery, mentioning they are sort of 
broken, what should we do in that regard? right now the create span query seems 
to work as expected with boosted synonyms(see the related test), I suspect if 
SpanBoostQuery are broken , they should get resolved in another ticket?

- from an original comment in the test code  
org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle:
 "confirm autoGeneratePhraseQueries always builds OR queries"
I changed that, was there any reason for that?

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583449275
 
 
   There was a suggestion from @jimczi that we fall back to writing raw data if 
content doesn't compress well. I'm not sure this logic is worth developing for 
the reasons outlined below:
   
   I wrote a [compression 
buffer](https://gist.github.com/markharwood/91cc8d96d6611ad97df11f244b1b1d0f) 
to see what the compression algo outputs before deciding whether to write the 
compressed or  raw data to disk.
   I tested with the most uncompressible content I could imagine:
   
   public static void fillRandom(byte[] buffer, int length) {
   for (int i = 0; i < length; i++) {
   buffer[i] =  (byte) (Math.random() * Byte.MAX_VALUE);
   }
   } 
   
   The LZ4 compressed versions of this content were only marginally bigger than 
their raw counterparts (adding 0.4% overhead to the original content e.g. 
96,921 compressed vs 96,541 raw bytes).
   On that basis I'm not sure if it's worth doubling the memory costs of the 
indexing logic (we would require a temporary output buffer that is at least the 
same size as the raw data being compressed) and additional byte shuffling.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-07 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032436#comment-17032436
 ] 

Robert Muir commented on LUCENE-9201:
-

now that overview.html works, i tried investigating package.html problems. i 
can reproduce it and the problem is specific to gradle. switching to 
package-info.java is definitely a solution, but i can't stand unexplained 
mysteries.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032431#comment-17032431
 ] 

Shalin Shekhar Mangar commented on SOLR-14248:
--

This patch fixes all the problems except for #5. The way it fixes #3 is a hack 
but that's the best I could do without creating a builder class for 
DocCollection. I've left a todo comment in there to describe the hack and 
eventual fix.

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14248:
-
Attachment: SOLR-14248.patch

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)

Shalin Shekhar Mangar created SOLR-14248:


 Summary: Improve ClusterStateMockUtil and make its methods public
 Key: SOLR-14248
 URL: https://issues.apache.org/jira/browse/SOLR-14248
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: master (9.0), 8.5


While working on SOLR-13996, I had the need to mock the cluster state for 
various configurations and I used ClusterStateMockUtil.

However, I ran into a few issues that needed to be fixed:
1. The methods in this class are protected making it useful only within the 
same package
2. A null router was set for DocCollection objects
3. The DocCollection object is created before the slices so the 
DocCollection.getActiveSlices method returns empty list because the active 
slices map is created inside the DocCollection constructor
4. It did not set core name for the replicas it created
5. It has no support for replica types so it only creates nrt replicas

I will use this Jira to fix these problems and make the methods in that class 
public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] romseygeek opened a new pull request #1243: LUCENE-9212: Intervals.multiterm() should take CompiledAutomaton

2020-02-07 Thread GitBox

romseygeek opened a new pull request #1243: LUCENE-9212: Intervals.multiterm() 
should take CompiledAutomaton
URL: https://github.com/apache/lucene-solr/pull/1243
 
 
   Currently it takes `Automaton` and then compiles it internally, but we need 
to do things
   like check for binary-vs-unicode status; it should just take 
`CompiledAutomaton` instead,
   and put responsibility for determinization, binaryness, etc, on the caller.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9212) Intervals.multiterm() should take a CompiledAutomaton

2020-02-07 Thread Alan Woodward (Jira)

Alan Woodward created LUCENE-9212:
-

 Summary: Intervals.multiterm() should take a CompiledAutomaton
 Key: LUCENE-9212
 URL: https://issues.apache.org/jira/browse/LUCENE-9212
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Alan Woodward
Assignee: Alan Woodward


LUCENE-9028 added a `multiterm` factory method for intervals that accepts an 
arbitrary Automaton, and converts it internally into a CompiledAutomaton.  This 
isn't necessarily correct behaviour, however, because Automatons can be defined 
in both binary and unicode space, and there's no way of telling which it is 
when it comes to compiling them.  In particular, for automatons produced by 
FuzzyTermsEnum, we need to convert them to unicode before compilation.

The `multiterm` factory should just take `CompiledAutomaton` directly, and we 
should deprecate the methods that take `Automaton` and remove in master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-07 Thread Xin-Chun Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Description: 
Representation learning (RL) has been an established discipline in the machine 
learning space for decades but it draws tremendous attention lately with the 
emergence of deep learning. The central problem of RL is to determine an 
optimal representation of the input data. By embedding the data into a high 
dimensional vector, the vector retrieval (VR) method is then applied to search 
the relevant items.

With the rapid development of RL over the past few years, the technique has 
been used extensively in industry from online advertising to computer vision 
and speech recognition. There exist many open source implementations of VR 
algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
choices for potential users. However, the aforementioned implementations are 
all written in C++, and no plan for supporting Java interface, making it hard 
to be integrated in Java projects or those who are not familier with C/C++  
[[https://github.com/facebookresearch/faiss/issues/105]]. 

The algorithms for vector retrieval can be roughly classified into four 
categories,
 # Tree-base algorithms, such as KD-tree;
 # Hashing methods, such as LSH (Local Sensitive Hashing);
 # Product quantization based algorithms, such as IVFFlat;
 # Graph-base algorithms, such as HNSW, SSG, NSG;

where IVFFlat and HNSW are the most popular ones among all the VR algorithms.

Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
LUCENE-9004) for Lucene, has made great progress. The issue draws attention of 
those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 

As an alternative for solving ANN similarity search problems, IVFFlat is also 
very popular with many users and supporters. Compared with HNSW, IVFFlat has 
smaller index size but requires k-means clustering, while HNSW is faster in 
query (no training required) but requires extra storage for saving graphs 
[indexing 1M 
vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. 
The recall ratio of IVFFlat could be gradually increased by adjusting the query 
parameter (nprobe), while it's hard for HNSW to improve the accuracy. In 
theory, IVFFlat could achieve 100% recall ratio. Another advantage is that 
IVFFlat can be faster and more accurate when enables GPU parallel computing 
(current not support in Java). Both algorithms have their merits and demerits. 
Since HNSW is now under development, it may be better to provide both 
implementations (HNSW && IVFFlat) for potential users who are faced with very 
different scenarios and want to more choices.

  was:
Representation learning (RL) has been an established discipline in the machine 
learning space for decades but it draws tremendous attention lately with the 
emergence of deep learning. The central problem of RL is to determine an 
optimal representation of the input data. By embedding the data into a high 
dimensional vector, the vector retrieval (VR) method is then applied to search 
the relevant items.

With the rapid development of RL over the past few years, the technique has 
been used extensively in industry from online advertising to computer vision 
and speech recognition. There exist many open source implementations of VR 
algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
choices for potential users. However, the aforementioned implementations are 
all written in C++, and no plan for supporting Java interface, making it hard 
to be integrated in Java projects or those who are not familier with C/C++  
[[https://github.com/facebookresearch/faiss/issues/105]]. 

The algorithms for vector retrieval can be roughly classified into four 
categories,
 # Tree-base algorithms, such as KD-tree;
 # Hashing methods, such as LSH (Local Sensitive Hashing);
 # Product quantization based algorithms, such as IVFFlat;
 # Graph-base algorithms, such as HNSW, SSG, NSG;

where IVFFlat and HNSW are the most popular ones among all the VR algorithms.

Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
LUCENE-9004) for Lucene, has made great progress. The issue draws attention of 
those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 

As an alternative for solving ANN similarity search problems, IVFFlat is also 
very popular with many users and supporters. Compared with HNSW, IVFFlat has 
smaller index size but requires k-means clustering, while HNSW is faster in 
query (no training required) but requires extra storage for saving graphs 
[indexing 1M 
vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. 
Another advantage is that IVFFlat can be faster and more accurate when enables 
GPU parallel computing (current not support in Java). Both

[jira] [Created] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-07 Thread Mark Harwood (Jira)

Mark Harwood created LUCENE-9211:


 Summary: Adding compression to BinaryDocValues storage
 Key: LUCENE-9211
 URL: https://issues.apache.org/jira/browse/LUCENE-9211
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Mark Harwood
Assignee: Mark Harwood


While SortedSetDocValues can be used today to store identical values in a 
compact form this is not effective for data with many unique values.

The proposal is that BinaryDocValues should be stored in LZ4 compressed blocks 
which can dramatically reduce disk storage costs in many cases. The proposal is 
blocks of a number of documents are stored as a single compressed blob along 
with metadata that records offsets where the original document values can be 
found in the uncompressed content.

There's a trade-off here between efficient compression (more docs-per-block = 
better compression) and fast retrieval times (fewer docs-per-block = faster 
read access for single values). A fixed block size of 32 docs seems like it 
would be a reasonable compromise for most scenarios.

A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-07 Thread GitBox

markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583313015
 
 
   I've reclaimed my Jira log-in and opened 
https://issues.apache.org/jira/browse/LUCENE-9211


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-12930) Add developer documentation to source repo

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032251#comment-17032251
 ] 

ASF subversion and git services commented on SOLR-12930:


Commit c0d1f302360ef97b5cfdcbdf82365f8ec1d6c2ed in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0d1f30 ]

SOLR-12930: Exclude dev-docs from binary archive.


> Add developer documentation to source repo
> --
>
> Key: SOLR-12930
> URL: https://issues.apache.org/jira/browse/SOLR-12930
> Project: Solr
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Mark Miller
>Priority: Major
> Attachments: solr-dev-docs.zip
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-12930) Add developer documentation to source repo

2020-02-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032250#comment-17032250
 ] 

ASF subversion and git services commented on SOLR-12930:


Commit d62f63076585769f757dcaf9919d2f07fab113d3 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d62f630 ]

SOLR-12930: Exclude dev-docs from binary archive.


> Add developer documentation to source repo
> --
>
> Key: SOLR-12930
> URL: https://issues.apache.org/jira/browse/SOLR-12930
> Project: Solr
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Mark Miller
>Priority: Major
> Attachments: solr-dev-docs.zip
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

84 matches

Mail list logo