[jira] [Resolved] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14248. -- Resolution: Fixed > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032834#comment-17032834 ] ASF subversion and git services commented on SOLR-14248: Commit e623eb53207b8dabfe36d6a9679b7590ec4a1d20 in lucene-solr's branch refs/heads/branch_8x from Shalin Shekhar Mangar [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e623eb5 ] SOLR-14248: Improve ClusterStateMockUtil and make its methods public (cherry picked from commit f5c132be6d3fc20f689e630517e7c6be2166f17e) > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032830#comment-17032830 ] ASF subversion and git services commented on SOLR-14248: Commit f5c132be6d3fc20f689e630517e7c6be2166f17e in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f5c132b ] SOLR-14248: Improve ClusterStateMockUtil and make its methods public > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032824#comment-17032824 ] Shalin Shekhar Mangar commented on SOLR-14248: -- The latest patch adds support for replica types and resolves a conflict introduced by SOLR-14245. It also adds a test for this class. This is ready to go. > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14248: - Attachment: SOLR-14248.patch > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032804#comment-17032804 ] Robert Muir commented on LUCENE-9201: - {quote} Package summary: "ant documentation" uses "package.html" as package summary description, but "gradlew javadoc" ignores "package.html" (so some packages lacks summary description in "package-summary.html" when building javadocs by Gradle). We might be able to make Gradle Javadoc task to properly handle "package.html" files with some options. Or, should we replace all "package.html" with "package-info.java" at this time? {quote} I found the answer to this. Gradle is fundamentally broken here, its not possible to fix it. When ant runs javadocs, we supply just a source directory (-sourcepath) and a list of packages: {noformat} javadoc -sourcepath /home/rmuir/workspace/lucene-solr/lucene/core/src/java org.apache.lucene org.apache.lucene.analysis org.apache.lucene.analysis.standard ... {noformat} When gradle runs javadocs, it does not do this, it passes each .java file individually: {noformat} javadoc '/home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/search/SearcherFactory.java' '/home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/search/QueryCache.java' ... {noformat} it seems the whole design is to make it work with their SourceTask/FileTree crap. And you can't pass individual html files to the javadoc tool to workaround it. It takes only source files or package names. I can't see any way to pass their task package list the way we do with ant: it *REALLY* wants to be based on the FileTree. Maybe we should call the ant task from gradle? They really messed this up. The other thing that seems really broken is the missing linkoffline. There are links between the modules (e.g. lucene-analyzers and lucene-core) and the linkoffline makes that work. But it seems the gradle build is structured to make per-module output dirs which won't work here. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9149) Increase data dimension limit in BKD
[ https://issues.apache.org/jira/browse/LUCENE-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032748#comment-17032748 ] ASF subversion and git services commented on LUCENE-9149: - Commit 0bd2496205a1319c34df2b8a236fb87f329bb3f4 in lucene-solr's branch refs/heads/branch_8x from Nicholas Knize [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bd2496 ] LUCENE-9149: Increase data dimension limit in BKD > Increase data dimension limit in BKD > > > Key: LUCENE-9149 > URL: https://issues.apache.org/jira/browse/LUCENE-9149 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Nick Knize >Priority: Major > Attachments: LUCENE-9149.patch > > Time Spent: 20m > Remaining Estimate: 0h > > LUCENE-8496 added selective indexing; the ability to designate the first K <= > N dimensions for driving the construction of the BKD internal nodes. Follow > on work stored the "data dimensions" for only the leaf nodes and only the > "index dimensions" are stored for the internal nodes. While > {{maxPointsInLeafNode}} is still important for managing the BKD heap memory > footprint (thus we don't want this to get too large), I'd like to propose > increasing the {{MAX_DIMENSIONS}} limit (to something not too crazy like 16; > effectively doubling the index dimension limit) while maintaining the > {{MAX_INDEX_DIMENSIONS}} at 8. > Doing this will enable us to encode higher dimension data within a lower > dimension index (e.g., 3D tessellated triangles as a 10 dimension point using > only the first 6 dimensions for index construction) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14250) Solr tries to read request body after error response is sent
[ https://issues.apache.org/jira/browse/SOLR-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-14250: --- Description: If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 continue}} response, then the client will send the body which will be processed and then a final response is sent by Solr. However, if such a request leads to an error (e.g. 404 or 401), then Solr will skip the 100 response and instead send the error response directly. The very last ation of {{SolrDispatchFilter#doFilter}} is to call {{consumeInputFully()}}. However, this should not be done in case an error response has already been sent, else you'll provoke an exception in Jetty's HTTP lib: {noformat} 2020-02-07 23:13:26.459 INFO (qtp403547747-24) [ ] o.a.s.s.SolrDispatchFilter Could not consume full client request => java.io.IOException: Committed before 100 Continues at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362) java.io.IOException: Committed before 100 Continues at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362) ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610] at org.eclipse.jetty.server.Request.getInputStream(Request.java:872) ~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610] at javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185) ~[javax.servlet-api-3.1.0.jar:3.1.0] at org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28] at org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28] at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28] {noformat} was: If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 continue}} response, then the client will send the body which will be processed and then a final response is sent by Solr. However, if such a request leads to an error (e.g. 404 or 401), then Solr will skip the 100 response and instead send the error response directly. The very last ation of {{SolrDispatchFilter#doFilter}} is to call {{consumeInputFully()}}. However, this should not be done in case an error response has already been sent, else you'll provoke an exception in Jetty's HTTP lib: {noformat} solr1_1 | 2020-02-07 23:13:26.459 INFO (qtp403547747-24) [ ] o.a.s.s.SolrDispatchFilter Could not consume full client request => java.io.IOException: Committed before 100 Continuessolr1_1 | 2020-02-07 23:13:26.459 INFO (qtp403547747-24) [ ] o.a.s.s.SolrDispatchFilter Could not consume full client request => java.io.IOException: Committed before 100 Continuessolr1_1 | at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)solr1_1 | java.io.IOException: Committed before 100 Continuessolr1_1 | at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362) ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1 | at org.eclipse.jetty.server.Request.getInputStream(Request.java:872) ~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1 | at javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185) ~[javax.servlet-api-3.1.0.jar:3.1.0]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28] {noformat} > Solr tries to read request body after error response is sent > > > Key: SOLR-14250 > URL: https://issues.apache.org/jira/browse/SOLR-14250 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Jan Høydahl >
[jira] [Created] (SOLR-14250) Solr tries to read request body after error response is sent
Jan Høydahl created SOLR-14250: -- Summary: Solr tries to read request body after error response is sent Key: SOLR-14250 URL: https://issues.apache.org/jira/browse/SOLR-14250 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Jan Høydahl If a client sends a {{HTTP POST}} request with header {{Expect: 100-continue}} the normal flow is for Solr (Jetty) to first respond with a {{HTTP 100 continue}} response, then the client will send the body which will be processed and then a final response is sent by Solr. However, if such a request leads to an error (e.g. 404 or 401), then Solr will skip the 100 response and instead send the error response directly. The very last ation of {{SolrDispatchFilter#doFilter}} is to call {{consumeInputFully()}}. However, this should not be done in case an error response has already been sent, else you'll provoke an exception in Jetty's HTTP lib: {noformat} solr1_1 | 2020-02-07 23:13:26.459 INFO (qtp403547747-24) [ ] o.a.s.s.SolrDispatchFilter Could not consume full client request => java.io.IOException: Committed before 100 Continuessolr1_1 | 2020-02-07 23:13:26.459 INFO (qtp403547747-24) [ ] o.a.s.s.SolrDispatchFilter Could not consume full client request => java.io.IOException: Committed before 100 Continuessolr1_1 | at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362)solr1_1 | java.io.IOException: Committed before 100 Continuessolr1_1 | at org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.continue100(HttpChannelOverHTTP2.java:362) ~[http2-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1 | at org.eclipse.jetty.server.Request.getInputStream(Request.java:872) ~[jetty-server-9.4.19.v20190610.jar:9.4.19.v20190610]solr1_1 | at javax.servlet.ServletRequestWrapper.getInputStream(ServletRequestWrapper.java:185) ~[javax.servlet-api-3.1.0.jar:3.1.0]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter$1.getInputStream(SolrDispatchFilter.java:612) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter.consumeInputFully(SolrDispatchFilter.java:454) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28]solr1_1 | at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:445) ~[solr-core-8.4.1.jar:8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032732#comment-17032732 ] ASF subversion and git services commented on LUCENE-9146: - Commit 7c20f6b8c5ec46cdd3f8f32a2fedcb5b0406ba3b in lucene-solr's branch refs/heads/master from Anshum Gupta [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c20f6b ] LUCENE-9146: Create gradle precommit action (#1245) > Switch GitHub PR test from ant precommit to gradle > -- > > Key: LUCENE-9146 > URL: https://issues.apache.org/jira/browse/LUCENE-9146 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mike Drob >Assignee: Anshum Gupta >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032734#comment-17032734 ] Anshum Gupta commented on LUCENE-9146: -- Merged into master. > Switch GitHub PR test from ant precommit to gradle > -- > > Key: LUCENE-9146 > URL: https://issues.apache.org/jira/browse/LUCENE-9146 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mike Drob >Assignee: Anshum Gupta >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] anshumg merged pull request #1245: LUCENE-9146: Create gradle precommit action
anshumg merged pull request #1245: LUCENE-9146: Create gradle precommit action URL: https://github.com/apache/lucene-solr/pull/1245 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests
[ https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723 ] Kevin Risden edited comment on SOLR-14249 at 2/7/20 10:56 PM: -- So I haven't personally looked at Krb5HttpClientBuilder recently, other than completely unrelated SOLR-13726. Part of the reason that a lot of clients buffer is due to how Kerberos SPNEGO authentication works. There are 2 parts typically * a request without authentication where the server returns a 401 with a negotiate response * a request with authentication in response to the negotiate which the server can verify If you don't put any optimizations in place every request becomes two. A lot of times a cookie is used here to limit the amount of HTTP requests. The reason the 401 and second request is an issue - is if the request is a non repeatable one - like a POST body. The client ends up sending the body and gets a 401 then goes o crap I need to send the body again and can't - because its non repeatable. So a lot of times the super simple workaround is to buffer the request - do the 401 check dance and then proceed. This is a way to make a non repeatable request semi repeatable. This buffering has issues though as you found where the buffer should be limited in size which then limits the usefulness of this technique. There are a few alternatives to buffering: * Authenticate upfront with say an OPTIONS request - which will get the cookie. the next request say a POST won't have any issue and won't do the 401 dance * "preemptively" does SPNEGO authorization if you know the SPN needed and create the right authorization header - this also skips the 401 and server can check the header * Use "Expect: 100-continue" header which asks the server if it can handle the request without the body and if it can then send the body. This actually holds the data from being sent in the first place if possible. ** Curl automatically activates "Expect: 100-continue" under a few conditions- https://gms.tf/when-curl-sends-100-continue.html ** Apache HttpClient does NOT do any special handling of "Expect: 100-continue" ** not sure if Jetty HttpClient does anything with "Expect: 100-continue" So long story short - yes buffering is a problem. was (Author: risdenk): So I haven't personally looked at Krb5HttpClientBuilder recently, other than completely unrelated SOLR-13726. Part of the reason that a lot of clients buffer is due to how Kerberos SPNEGO authentication works. There are 2 parts typically * a request without authentication where the server returns a 401 with a negotiate response * a request with authentication in response to the negotiate which the server can verify If you don't put any optimizations in place every request becomes two. A lot of times a cookie is used here to limit the amount of HTTP requests. The reason the 401 and second request is an issue - is if the request is a non repeatable one - like a POST body. The client ends up sending the body and gets a 401 then goes o crap I need to send the body again and can't - because its non repeatable. So a lot of times the super simple workaround is to buffer the request - do the 401 check dance and then proceed. This is a way to make a non repeatable request semi repeatable. This buffering has issues though as you found where the buffer should be limited in size which then limits the usefulness of this technique. There are a few alternatives to buffering: * Authenticate upfront with say an OPTIONS request - which will get the cookie. the next request say a POST won't have any issue and won't do the 401 dance * Use "Expect: 100-continue" header which asks the server if it can handle the request without the body and if it can then send the body. This actually holds the data from being sent in the first place if possible. ** Curl automatically activates "Expect: 100-continue" under a few conditions- https://gms.tf/when-curl-sends-100-continue.html ** Apache HttpClient does NOT do any special handling of "Expect: 100-continue" ** not sure if Jetty HttpClient does anything with "Expect: 100-continue" So long story short - yes buffering is a problem. > Krb5HttpClientBuilder should not buffer requests > - > > Key: SOLR-14249 > URL: https://issues.apache.org/jira/browse/SOLR-14249 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Authentication, SolrJ >Affects Versions: 7.4, master (9.0), 8.4.1 >Reporter: Jason Gerlowski >Priority: Major > Attachments: SOLR-14249-reproduction.patch > > > When SolrJ clients enable Kerberos authentication, a request interceptor is > set up which wraps the actual HttpEntity in a
[jira] [Comment Edited] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests
[ https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723 ] Kevin Risden edited comment on SOLR-14249 at 2/7/20 10:53 PM: -- So I haven't personally looked at Krb5HttpClientBuilder recently, other than completely unrelated SOLR-13726. Part of the reason that a lot of clients buffer is due to how Kerberos SPNEGO authentication works. There are 2 parts typically * a request without authentication where the server returns a 401 with a negotiate response * a request with authentication in response to the negotiate which the server can verify If you don't put any optimizations in place every request becomes two. A lot of times a cookie is used here to limit the amount of HTTP requests. The reason the 401 and second request is an issue - is if the request is a non repeatable one - like a POST body. The client ends up sending the body and gets a 401 then goes o crap I need to send the body again and can't - because its non repeatable. So a lot of times the super simple workaround is to buffer the request - do the 401 check dance and then proceed. This is a way to make a non repeatable request semi repeatable. This buffering has issues though as you found where the buffer should be limited in size which then limits the usefulness of this technique. There are a few alternatives to buffering: * Authenticate upfront with say an OPTIONS request - which will get the cookie. the next request say a POST won't have any issue and won't do the 401 dance * Use "Expect: 100-continue" header which asks the server if it can handle the request without the body and if it can then send the body. This actually holds the data from being sent in the first place if possible. ** Curl automatically activates "Expect: 100-continue" under a few conditions- https://gms.tf/when-curl-sends-100-continue.html ** Apache HttpClient does NOT do any special handling of "Expect: 100-continue" ** not sure if Jetty HttpClient does anything with "Expect: 100-continue" So long story short - yes buffering is a problem. was (Author: risdenk): So I haven't personally looked at Krb5HttpClientBuilder recently, other than completely unrelated SOLR-13726. Part of the reason that a lot of clients buffer is due to how Kerberos SPNEGO authentication works. There are 2 parts typically * a request without authentication where the server returns a 401 with a negotiate response * a request with authentication in response to the negotiate which the server can verify If you don't put any optimizations in place every request becomes two. A lot of times a cookie is used here to limit the amount of HTTP requests. The reason the 401 and second request is an issue - is if the request is a non repeatable one - like a POST body. So a lot of times the super simple workaround is to buffer the request - do the 401 check dance and then proceed. This is a way to make a non repeatable request semi repeatable. This buffering has issues though as you found where the buffer should be limited in size which then limits the usefulness of this technique. There are a few alternatives to buffering: * Authenticate upfront with say an OPTIONS request - which will get the cookie. the next request say a POST won't have any issue and won't do the 401 dance * Use "Expect: 100-continue" header which asks the server if it can handle the request without the body and if it can then send the body. This actually holds the data from being sent in the first place if possible. ** Curl automatically activates "Expect: 100-continue" under a few conditions- https://gms.tf/when-curl-sends-100-continue.html ** Apache HttpClient does NOT do any special handling of "Expect: 100-continue" ** not sure if Jetty HttpClient does anything with "Expect: 100-continue" So long story short - yes buffering is a problem. > Krb5HttpClientBuilder should not buffer requests > - > > Key: SOLR-14249 > URL: https://issues.apache.org/jira/browse/SOLR-14249 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Authentication, SolrJ >Affects Versions: 7.4, master (9.0), 8.4.1 >Reporter: Jason Gerlowski >Priority: Major > Attachments: SOLR-14249-reproduction.patch > > > When SolrJ clients enable Kerberos authentication, a request interceptor is > set up which wraps the actual HttpEntity in a BufferedHttpEntity. This > BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can > be repeated if needed. This works fine for small requests, but when requests > get large storing the entire request in memory causes contention or > OutOfMemoryErrors. > The easiest way for this to manifest is to
[jira] [Commented] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests
[ https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032723#comment-17032723 ] Kevin Risden commented on SOLR-14249: - So I haven't personally looked at Krb5HttpClientBuilder recently, other than completely unrelated SOLR-13726. Part of the reason that a lot of clients buffer is due to how Kerberos SPNEGO authentication works. There are 2 parts typically * a request without authentication where the server returns a 401 with a negotiate response * a request with authentication in response to the negotiate which the server can verify If you don't put any optimizations in place every request becomes two. A lot of times a cookie is used here to limit the amount of HTTP requests. The reason the 401 and second request is an issue - is if the request is a non repeatable one - like a POST body. So a lot of times the super simple workaround is to buffer the request - do the 401 check dance and then proceed. This is a way to make a non repeatable request semi repeatable. This buffering has issues though as you found where the buffer should be limited in size which then limits the usefulness of this technique. There are a few alternatives to buffering: * Authenticate upfront with say an OPTIONS request - which will get the cookie. the next request say a POST won't have any issue and won't do the 401 dance * Use "Expect: 100-continue" header which asks the server if it can handle the request without the body and if it can then send the body. This actually holds the data from being sent in the first place if possible. ** Curl automatically activates "Expect: 100-continue" under a few conditions- https://gms.tf/when-curl-sends-100-continue.html ** Apache HttpClient does NOT do any special handling of "Expect: 100-continue" ** not sure if Jetty HttpClient does anything with "Expect: 100-continue" So long story short - yes buffering is a problem. > Krb5HttpClientBuilder should not buffer requests > - > > Key: SOLR-14249 > URL: https://issues.apache.org/jira/browse/SOLR-14249 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Authentication, SolrJ >Affects Versions: 7.4, master (9.0), 8.4.1 >Reporter: Jason Gerlowski >Priority: Major > Attachments: SOLR-14249-reproduction.patch > > > When SolrJ clients enable Kerberos authentication, a request interceptor is > set up which wraps the actual HttpEntity in a BufferedHttpEntity. This > BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can > be repeated if needed. This works fine for small requests, but when requests > get large storing the entire request in memory causes contention or > OutOfMemoryErrors. > The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, > which opens a connection to Solr and streams documents out in an ever > increasing request entity until the doc queue held by the client is emptied. > I ran into this while troubleshooting a DIH run that would reproducibly load > a few hundred thousand documents before progress stalled out. Solr never > crashed and the DIH thread was still alive, but the > ConcurrentUpdateSolrClient used by DIH had its "Runner" thread disappear > around the time of the stall and an OOM like the one below could be seen in > solr-8983-console.log: > {code} > WARNING: Uncaught exception in thread: > Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering] > java.lang.OutOfMemoryError: Java heap space > at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0) > at java.base/java.util.Arrays.copyOf(Arrays.java:3745) > at > java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) > at > java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) > at > java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) > at > org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213) > at > org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94) > at > org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145) > at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848) > at > org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at > org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355) > at
[jira] [Commented] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032717#comment-17032717 ] Anshum Gupta commented on LUCENE-9146: -- https://github.com/apache/lucene-solr/pull/1245 > Switch GitHub PR test from ant precommit to gradle > -- > > Key: LUCENE-9146 > URL: https://issues.apache.org/jira/browse/LUCENE-9146 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mike Drob >Assignee: Anshum Gupta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-9146) Switch GitHub PR test from ant precommit to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anshum Gupta reassigned LUCENE-9146: Assignee: Anshum Gupta > Switch GitHub PR test from ant precommit to gradle > -- > > Key: LUCENE-9146 > URL: https://issues.apache.org/jira/browse/LUCENE-9146 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mike Drob >Assignee: Anshum Gupta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9213: Fix Version/s: master (9.0) > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-9213. - Resolution: Fixed > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032706#comment-17032706 ] ASF subversion and git services commented on LUCENE-9213: - Commit 69f26d099ec36adec251cbf36594ea375d7fc620 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=69f26d0 ] LUCENE-9213: fix documentation-lint (and finally precommit) to work on java 12 and 13 the "missing javadocs" checker needed tweaks to work with the format changes of java 13. As a followup we may investigate javadoc (maybe the new doclet api). It has its own missing checks too now, but they are black vs white (either fully documented or not checked), whereas this python tool allows us to "improve", e.g. enforce that all classes have doc, even if all methods do not yet. > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032702#comment-17032702 ] Robert Muir commented on LUCENE-9213: - I want to followup and investigate the new doclet api, I am concerned about all the format-changes in the html and jdk releases too fast. Maybe it can do the checks we need easily. But for now this gets {{ant precommit}} working with java 12 and 13 (the original issue i wanted to solve). > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9149) Increase data dimension limit in BKD
[ https://issues.apache.org/jira/browse/LUCENE-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032700#comment-17032700 ] ASF subversion and git services commented on LUCENE-9149: - Commit 206a70e7b79050db0d351135e406cfb997cbeee1 in lucene-solr's branch refs/heads/master from Nicholas Knize [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=206a70e ] LUCENE-9149: Increase data dimension limit in BKD > Increase data dimension limit in BKD > > > Key: LUCENE-9149 > URL: https://issues.apache.org/jira/browse/LUCENE-9149 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Nick Knize >Priority: Major > Attachments: LUCENE-9149.patch > > Time Spent: 10m > Remaining Estimate: 0h > > LUCENE-8496 added selective indexing; the ability to designate the first K <= > N dimensions for driving the construction of the BKD internal nodes. Follow > on work stored the "data dimensions" for only the leaf nodes and only the > "index dimensions" are stored for the internal nodes. While > {{maxPointsInLeafNode}} is still important for managing the BKD heap memory > footprint (thus we don't want this to get too large), I'd like to propose > increasing the {{MAX_DIMENSIONS}} limit (to something not too crazy like 16; > effectively doubling the index dimension limit) while maintaining the > {{MAX_INDEX_DIMENSIONS}} at 8. > Doing this will enable us to encode higher dimension data within a lower > dimension index (e.g., 3D tessellated triangles as a 10 dimension point using > only the first 6 dimensions for index construction) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] asfgit merged pull request #1182: LUCENE-9149: Increase data dimension limit in BKD
asfgit merged pull request #1182: LUCENE-9149: Increase data dimension limit in BKD URL: https://github.com/apache/lucene-solr/pull/1182 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] anshumg opened a new pull request #1245: Create gradle precommit action
anshumg opened a new pull request #1245: Create gradle precommit action URL: https://github.com/apache/lucene-solr/pull/1245 This adds a gradle precommit action w/ Java11 for all branches. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032680#comment-17032680 ] Robert Muir commented on LUCENE-9213: - I tested and got documentation-lint BUILD SUCCESSFUL for lucene and solr with javas 11, 12, 13. > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032664#comment-17032664 ] Robert Muir commented on LUCENE-9213: - I had to tweak slightly for that case (generics). Now everything passes on Java 13: {noformat} -documentation-lint: [echo] Checking for broken links... [exec] [exec] Crawl/parse... [exec] [exec] Verify... [echo] Checking for missing docs... BUILD SUCCESSFUL {noformat} > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9213: Attachment: LUCENE-9213.patch > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch, LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032651#comment-17032651 ] Robert Muir commented on LUCENE-9213: - Seems we still have one bug left at least. I hope it does not involve generics... {noformat} [exec] Verify... [echo] Checking for missing docs... [exec] [exec] build/docs/core/org/apache/lucene/analysis/CharArrayMap.html [exec] missing Methods: put(java.lang.Object,V) [exec] [exec] Missing javadocs were found! {noformat} > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032648#comment-17032648 ] Robert Muir commented on LUCENE-9213: - Attached is the current patch I am testing now. cc [~mikemccand] > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9213) fix documentation-lint on recent java
[ https://issues.apache.org/jira/browse/LUCENE-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9213: Attachment: LUCENE-9213.patch > fix documentation-lint on recent java > - > > Key: LUCENE-9213 > URL: https://issues.apache.org/jira/browse/LUCENE-9213 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9213.patch > > > Currently this is disabled unless you use java 11. It works with java 12. For > java 13, it the python checker needs some slight tweaks. > Javadocs are formatted differently in each release but the changes between 12 > and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9213) fix documentation-lint on recent java
Robert Muir created LUCENE-9213: --- Summary: fix documentation-lint on recent java Key: LUCENE-9213 URL: https://issues.apache.org/jira/browse/LUCENE-9213 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Currently this is disabled unless you use java 11. It works with java 12. For java 13, it the python checker needs some slight tweaks. Javadocs are formatted differently in each release but the changes between 12 and 13 were enough to anger the checker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests
[ https://issues.apache.org/jira/browse/SOLR-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gerlowski updated SOLR-14249: --- Attachment: SOLR-14249-reproduction.patch > Krb5HttpClientBuilder should not buffer requests > - > > Key: SOLR-14249 > URL: https://issues.apache.org/jira/browse/SOLR-14249 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Authentication, SolrJ >Affects Versions: 7.4, master (9.0), 8.4.1 >Reporter: Jason Gerlowski >Priority: Major > Attachments: SOLR-14249-reproduction.patch > > > When SolrJ clients enable Kerberos authentication, a request interceptor is > set up which wraps the actual HttpEntity in a BufferedHttpEntity. This > BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can > be repeated if needed. This works fine for small requests, but when requests > get large storing the entire request in memory causes contention or > OutOfMemoryErrors. > The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, > which opens a connection to Solr and streams documents out in an ever > increasing request entity until the doc queue held by the client is emptied. > I ran into this while troubleshooting a DIH run that would reproducibly load > a few hundred thousand documents before progress stalled out. Solr never > crashed and the DIH thread was still alive, but the > ConcurrentUpdateSolrClient used by DIH had its "Runner" thread disappear > around the time of the stall and an OOM like the one below could be seen in > solr-8983-console.log: > {code} > WARNING: Uncaught exception in thread: > Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering] > java.lang.OutOfMemoryError: Java heap space > at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0) > at java.base/java.util.Arrays.copyOf(Arrays.java:3745) > at > java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) > at > java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) > at > java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) > at > org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213) > at > org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94) > at > org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145) > at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848) > at > org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at > org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at > org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:764) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at > org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:705) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:367) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at > org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:223) > at > org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:330) > at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) > at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:155) > at > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:91) > at > org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:264) > at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73) > at > org.apache.http.entity.BufferedHttpEntity.(BufferedHttpEntity.java:62) > at > org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder.lambda$new$3(Krb5HttpClientBuilder.java:155) > at > org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder$$Lambda$459/0x000800623840.process(Unknown > Source) > at > org.apache.solr.client.solrj.impl.HttpClientUtil$DynamicInterceptor$1.accept(HttpClientUtil.java:177) > {code} > We took
[jira] [Created] (SOLR-14249) Krb5HttpClientBuilder should not buffer requests
Jason Gerlowski created SOLR-14249: -- Summary: Krb5HttpClientBuilder should not buffer requests Key: SOLR-14249 URL: https://issues.apache.org/jira/browse/SOLR-14249 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Authentication, SolrJ Affects Versions: 8.4.1, 7.4, master (9.0) Reporter: Jason Gerlowski When SolrJ clients enable Kerberos authentication, a request interceptor is set up which wraps the actual HttpEntity in a BufferedHttpEntity. This BufferedHttpEntity, well, buffers the request body in a {{byte[]}} so it can be repeated if needed. This works fine for small requests, but when requests get large storing the entire request in memory causes contention or OutOfMemoryErrors. The easiest way for this to manifest is to use ConcurrentUpdateSolrClient, which opens a connection to Solr and streams documents out in an ever increasing request entity until the doc queue held by the client is emptied. I ran into this while troubleshooting a DIH run that would reproducibly load a few hundred thousand documents before progress stalled out. Solr never crashed and the DIH thread was still alive, but the ConcurrentUpdateSolrClient used by DIH had its "Runner" thread disappear around the time of the stall and an OOM like the one below could be seen in solr-8983-console.log: {code} WARNING: Uncaught exception in thread: Thread[concurrentUpdateScheduler-28-thread-1,5,TGRP-TestKerberosClientBuffering] java.lang.OutOfMemoryError: Java heap space at __randomizedtesting.SeedInfo.seed([371A00FBA76D31DF]:0) at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:94) at org.apache.solr.common.util.ByteUtils.writeUTF16toUTF8(ByteUtils.java:145) at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:848) at org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:932) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:328) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) at org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:616) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) at org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:764) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) at org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:705) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:367) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) at org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:223) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:330) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:228) at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:155) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:91) at org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:264) at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73) at org.apache.http.entity.BufferedHttpEntity.(BufferedHttpEntity.java:62) at org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder.lambda$new$3(Krb5HttpClientBuilder.java:155) at org.apache.solr.client.solrj.impl.Krb5HttpClientBuilder$$Lambda$459/0x000800623840.process(Unknown Source) at org.apache.solr.client.solrj.impl.HttpClientUtil$DynamicInterceptor$1.accept(HttpClientUtil.java:177) {code} We took heap dumps and were able to confirm that the entire 8gb heap was taken up with a single massive CUSC request body that was being buffered! (As an aside, I had no idea that OutOfMemoryError's could happen without killing the entire JVM. But apparently they can. CUSC.Runner propagates the OOM as it should and the OOM kills the Runner thread. Since that thread is the gc-root for the massive BufferedHttpEntity though, a garbage collection frees
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032615#comment-17032615 ] Robert Muir commented on LUCENE-9201: - OK, thanks for working on the PR. At a glance it looks good to me. But we may get better feedback from Dawid Weiss when he is back online in a few days. I will try to investigate more of the problems that you uncovered... > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032610#comment-17032610 ] Ilan Ginzburg commented on SOLR-5146: - Thanks [~erickerickson] for the wider context overview. If we solve the leader issue, ensuring index is up to date (and making it so if it's not) is likely a lot easier with SHARED collections and replicas, i.e. index files written to a Blob storage that becomes the "source of truth" (https://github.com/apache/lucene-solr/tree/jira/SOLR-13101). My understanding [~dsmiley] is that a replica being unloaded totally, i.e. files are on disk but nothing in memory, would require changes to the current strategy of always having replica specific Zookeeper connections/state for the leader election process. > Figure out what it would take for lazily-loaded cores to play nice with > SolrCloud > - > > Key: SOLR-5146 > URL: https://issues.apache.org/jira/browse/SOLR-5146 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 4.5, 6.0 >Reporter: Erick Erickson >Assignee: David Smiley >Priority: Major > > The whole lazy-load core thing was implemented with non-SolrCloud use-cases > in mind. There are several user-list threads that ask about using lazy cores > with SolrCloud, especially in multi-tenant use-cases. > This is a marker JIRA to investigate what it would take to make lazy-load > cores play nice with SolrCloud. It's especially interesting how this all > works with shards, replicas, leader election, recovery, etc. > NOTE: This is pretty much totally unexplored territory. It may be that a few > trivial modifications are all that's needed. OTOH, It may be that we'd have > to rip apart SolrCloud to handle this case. Until someone dives into the > code, we don't know. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032601#comment-17032601 ] Tomoko Uchida commented on LUCENE-9201: --- Thank you [~rcmuir] for your work and comments. I updated the PR (refactored the gradle tasks and ported ant build details, as much as I can). I hope it is a good starting point, if not perfect. There are still not ported the ant scripts' hacks, especially around "ecj-macro" stuff, that I cannot figure out how to copy to gradle. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9194) Simplify XYShapeXQuery API
[ https://issues.apache.org/jira/browse/LUCENE-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9194. -- Fix Version/s: 8.5 Assignee: Ignacio Vera Resolution: Fixed master: 73dbf6d06108e9f18423521e339230bda37f8524 branch 8.x: 5c1f2ca22a756b16f0e35aa5dde221578fe1ce76 > Simplify XYShapeXQuery API > --- > > Key: LUCENE-9194 > URL: https://issues.apache.org/jira/browse/LUCENE-9194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Minor > Fix For: 8.5 > > Time Spent: 10m > Remaining Estimate: 0h > > Similar to what was done in LUCENE-9141 simplify XYShape queries. > > This change will allow as well to make most of the internal geo classes > package private. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood edited a comment on issue #1234: Add compression for Binary doc value fields
markharwood edited a comment on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216 >Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible? I may have been too hasty in that reply - I've not been able to reproduce that and the raw vs compressed timings are very similar in the additional tests I've done so echo what @jpountz expects. My first (faster) run had random bytes selected in the range 0-20 and not the 0-127 range where I'm seeing parity This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase merged pull request #1224: LUCENE-9194: Simplify XYShapeQuery API
iverase merged pull request #1224: LUCENE-9194: Simplify XYShapeQuery API URL: https://github.com/apache/lucene-solr/pull/1224 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood edited a comment on issue #1234: Add compression for Binary doc value fields
markharwood edited a comment on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216 >Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible? I may have been too hasty in that reply - I've not been able to reproduce that and the timings are very similar in the additional tests I've done so echo what @jpountz expects. My first (faster) run had random bytes selected in the range 0-20 and not the 0-127 range where I'm seeing parity This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields
markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583539216 >Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible? I may have been too hasty in that reply - I've not been able to reproduce that and the timings are very similar in the additional tests I've done so echo what @jpountz expects This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on issue #1234: Add compression for Binary doc value fields
msokolov commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583538389 Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on issue #1234: Add compression for Binary doc value fields
jpountz commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583536606 @msokolov FWIW LZ4 only removes duplicate strings from a stream: when it finds one it inserts a reference to a previous sequence of bytes. In the special case that the content in incompressible, the LZ4 compressed data just consists of the number of bytes followed by the bytes, so the only overhead compared to reading the bytes directly is the decoding of the number of bytes, which should be rather low. I don't have a preference regarding whether we should have an explicit "not-compressed" case, but I understand how not having one helps keep things simpler. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376529195 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +private int uncompressedBlockLength = 0; +private int numDocsInBlock = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; + int docInBlockId = docNumber % Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + assert docInBlockId < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + + + // already read and uncompressed? + if (blockId != lastBlockId) { +lastBlockId = blockId; +long blockStartOffset = addresses.get(blockId); +compressedData.seek(blockStartOffset); + +numDocsInBlock = compressedData.readVInt(); +assert numDocsInBlock <= Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; +uncompressedDocEnds = new int[numDocsInBlock]; +uncompressedBlockLength = 0; + +int onlyLength = -1; +for (int i = 0; i < numDocsInBlock; i++) { + if (i == 0) { +// The first length value is special. It is shifted and has a bit to denote if +// all other values are the same length +int lengthPlusSameInd = compressedData.readVInt(); +int sameIndicator = lengthPlusSameInd & 1; +int firstValLength = lengthPlusSameInd >>1; Review comment: Since you are stealing a bit, we should do an unsigned shift (`>>>`) instead. This would never be a problem in practice, but imagine than the length was a 31-bits integer. Shifting by one bit on the left at index time would make this number negative. So here we need an unsigned shift rather than a signed shift that preserves the sign. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376527753 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +private int uncompressedBlockLength = 0; +private int numDocsInBlock = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; + int docInBlockId = docNumber % Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + assert docInBlockId < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + + + // already read and uncompressed? + if (blockId != lastBlockId) { +lastBlockId = blockId; +long blockStartOffset = addresses.get(blockId); +compressedData.seek(blockStartOffset); + +numDocsInBlock = compressedData.readVInt(); +assert numDocsInBlock <= Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; +uncompressedDocEnds = new int[numDocsInBlock]; +uncompressedBlockLength = 0; + +int onlyLength = -1; +for (int i = 0; i < numDocsInBlock; i++) { + if (i == 0) { +// The first length value is special. It is shifted and has a bit to denote if +// all other values are the same length +int lengthPlusSameInd = compressedData.readVInt(); +int sameIndicator = lengthPlusSameInd & 1; +int firstValLength = lengthPlusSameInd >>1; +if (sameIndicator == 1) { + onlyLength = firstValLength; +} +uncompressedBlockLength += firstValLength; + } else { +if (onlyLength == -1) { + // Various lengths are stored - read each from disk + uncompressedBlockLength += compressedData.readVInt(); +} else { + // Only one length + uncompressedBlockLength += onlyLength; +} + } + uncompressedDocEnds[i] = uncompressedBlockLength; Review comment: maybe we could call it `uncompressedDocStarts` and set the index at `i+1` which would then help below to remove the else block of the `docInBlockId > 0` condition below? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376532189 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +private int uncompressedBlockLength = 0; +private int numDocsInBlock = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; + int docInBlockId = docNumber % Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + assert docInBlockId < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + + + // already read and uncompressed? + if (blockId != lastBlockId) { +lastBlockId = blockId; +long blockStartOffset = addresses.get(blockId); +compressedData.seek(blockStartOffset); + +numDocsInBlock = compressedData.readVInt(); +assert numDocsInBlock <= Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; +uncompressedDocEnds = new int[numDocsInBlock]; Review comment: can we reuse the same array across blocks? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376531952 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; Review comment: in the past we've put these constants in the meta file and BinaryEntry so that it's easier to change values over time This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r376528169 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +755,131 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +private int uncompressedBlockLength = 0; +private int numDocsInBlock = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; + int docInBlockId = docNumber % Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + assert docInBlockId < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; + + + // already read and uncompressed? + if (blockId != lastBlockId) { +lastBlockId = blockId; +long blockStartOffset = addresses.get(blockId); +compressedData.seek(blockStartOffset); + +numDocsInBlock = compressedData.readVInt(); Review comment: do we really need to record the number of documents in the block? It should be 32 for all blocks except for the last one? Maybe at index-time we could append dummy values to the last block to make sure it has 32 values too, and we wouldn't need this vInt anymore? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] rmuir commented on issue #1236: Add back assertions removed by LUCENE-9187.
rmuir commented on issue #1236: Add back assertions removed by LUCENE-9187. URL: https://github.com/apache/lucene-solr/pull/1236#issuecomment-583534489 +1, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields
markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583529462 >Did you also test read performance in this incompressible case? Just tried it and it does look 4x faster reading raw random bytes Vs compressed random bytes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on issue #1234: Add compression for Binary doc value fields
jpountz commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583529199 In the case of content that can't be compressed, the compressed data will consist of the number of bytes, followed by the bytes. So decompressing consists of decoding the length and then reading the bytes. The only overhead compared to reading bytes directly is the decoding of the number of bytes, so I would believe that the overhead is rather small. I don't have a strong preference regarding whether this case should be handled explicitly or not. It's true that not having a special "not-compressed" case helps keep the logic simpler. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9194) Simplify XYShapeXQuery API
[ https://issues.apache.org/jira/browse/LUCENE-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032560#comment-17032560 ] Ignacio Vera commented on LUCENE-9194: -- Pr related to this change: https://github.com/apache/lucene-solr/pull/1224 > Simplify XYShapeXQuery API > --- > > Key: LUCENE-9194 > URL: https://issues.apache.org/jira/browse/LUCENE-9194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Minor > > Similar to what was done in LUCENE-9141 simplify XYShape queries. > > This change will allow as well to make most of the internal geo classes > package private. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov edited a comment on issue #1234: Add compression for Binary doc value fields
msokolov edited a comment on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583519622 > The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts Did you also test read performance in this incompressible case? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob opened a new pull request #1244: SOLR-14247 Remove unneeded sleeps
madrob opened a new pull request #1244: SOLR-14247 Remove unneeded sleeps URL: https://github.com/apache/lucene-solr/pull/1244 This test is slow because it sleeps a lot. Removing the sleeps, it still passes consistently on my machine, but I would like other folks to confirm this on their different hardware as well. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `ant precommit` and the appropriate test suite. - [ ] ~I have added tests for my changes.~ - [ ] ~I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on issue #1234: Add compression for Binary doc value fields
msokolov commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583519622 > The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts Did you also test read performance in this incompressible case? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-583518344 I have applied the changes to solve the feedback points and consequentially added additional tests to cover some missing scenario. We should be almost ready to go :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376513280 ## File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java ## @@ -450,9 +485,13 @@ protected Query analyzePhrase(String field, TokenStream stream, int slop) throws position += 1; } builder.add(new Term(field, termAtt.getBytesRef()), position); + phraseBoost = boostAtt.getBoost(); Review comment: I implemented a simple multiplicative boost. It's back compatible with the designed use case (multi term synonym -> single concept -> single boost -> e.g. panthera onca => jaguar|0.95, big cat|0.85, black panther|0.65)) But it's also compatible in not synonym cases, if the user needs a boost per token in phrase and span queries. It's in the upcoming commit, let me know if you believe something different is necessary This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032531#comment-17032531 ] ASF subversion and git services commented on SOLR-14245: Commit f8163439ffbb36876f236551f8322a5e5851ba87 in lucene-solr's branch refs/heads/branch_8x from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f816343 ] SOLR-14245: Validate Replica / ReplicaInfo on creation. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376503587 ## File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java ## @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, TokenStream source, BooleanCla end = articulationPoints[i]; } lastState = end; - final Query queryPos; + final Query positionalQuery; if (graph.hasSidePath(start)) { -final Iterator it = graph.getFiniteStrings(start, end); +final Iterator sidePathsIterator = graph.getFiniteStrings(start, end); Iterator queries = new Iterator() { @Override public boolean hasNext() { -return it.hasNext(); +return sidePathsIterator.hasNext(); } @Override public Query next() { -TokenStream ts = it.next(); -return createFieldQuery(ts, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); +TokenStream sidePath = sidePathsIterator.next(); +return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); } }; -queryPos = newGraphSynonymQuery(queries); +positionalQuery = newGraphSynonymQuery(queries); } else { -Term[] terms = graph.getTerms(field, start); +List attributes = graph.getTerms(start); Review comment: a tentative change is coming in the next commit, I added also few tests to cover that else coding branch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032528#comment-17032528 ] ASF subversion and git services commented on SOLR-14245: Commit 9a190935869a5fba8c4935f85988fe712066c465 in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9a19093 ] SOLR-14245: Validate Replica / ReplicaInfo on creation. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376478661 ## File path: solr/core/src/test-files/solr/collection1/conf/schema12.xml ## @@ -238,6 +227,18 @@ + Review comment: Fixed in the next coming commit! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376476976 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.boost; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.search.BoostAttribute; + +import java.io.IOException; + + +/** + * Characters before the delimiter are the "token", those after are the boost. + * + * For example, if the delimiter is '|', then for the string "foo|0.7", foo is the token + * and 0.7 is the boost. + * + * Note make sure your Tokenizer doesn't split on the delimiter, or this won't work + */ +public final class DelimitedBoostTokenFilter extends TokenFilter { + private final char delimiter; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class); + + public DelimitedBoostTokenFilter(TokenStream input, char delimiter) { +super(input); +this.delimiter = delimiter; + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + final char[] buffer = termAtt.buffer(); + final int length = termAtt.length(); + for (int i = 0; i < length; i++) { +if (buffer[i] == delimiter) { + float boost = Float.parseFloat(new String(buffer, i + 1, (length - (i + 1; + boostAtt.setBoost(boost); + termAtt.setLength(i); + return true; +} + } + // we have not seen the delimiter + boostAtt.setBoost(1.0f); Review comment: Fixed in the next coming commit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376476198 ## File path: solr/core/src/java/org/apache/solr/schema/TextField.java ## @@ -43,6 +43,7 @@ public class TextField extends FieldType { protected boolean autoGeneratePhraseQueries; protected boolean enableGraphQueries; + protected boolean synonymBoostByPayload; Review comment: agreed and fixed! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376473778 ## File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java ## @@ -450,9 +485,13 @@ protected Query analyzePhrase(String field, TokenStream stream, int slop) throws position += 1; } builder.add(new Term(field, termAtt.getBytesRef()), position); + phraseBoost = boostAtt.getBoost(); Review comment: I think this isn't quite right, because we need to combine boosts together somehow; currently your phrase boost is just the boost of the last term in the phrase. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
romseygeek commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376474333 ## File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java ## @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, TokenStream source, BooleanCla end = articulationPoints[i]; } lastState = end; - final Query queryPos; + final Query positionalQuery; if (graph.hasSidePath(start)) { -final Iterator it = graph.getFiniteStrings(start, end); +final Iterator sidePathsIterator = graph.getFiniteStrings(start, end); Iterator queries = new Iterator() { @Override public boolean hasNext() { -return it.hasNext(); +return sidePathsIterator.hasNext(); } @Override public Query next() { -TokenStream ts = it.next(); -return createFieldQuery(ts, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); +TokenStream sidePath = sidePathsIterator.next(); +return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); } }; -queryPos = newGraphSynonymQuery(queries); +positionalQuery = newGraphSynonymQuery(queries); } else { -Term[] terms = graph.getTerms(field, start); +List attributes = graph.getTerms(start); Review comment: This is what GraphTokenStreamFiniteStrings returns currently, for multiple tokens at the same position. Maybe `TermAndBoost[]` would make more sense though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload
alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-583474019 hi @romseygeek , @dsmiley , first of all, thank you again for your patience and very useful insights. I have incorporated Alan's changes and cleaned everything up. My un-resolved questions: - boostAttribute doesn’t use BytesRef but directly float, is it a concern? We are expected to use it at query time, so we could actually see a query time minimal benefit in not encoding/decoding? - Alan expressed concerns over SpanBoostQuery, mentioning they are sort of broken, what should we do in that regard? right now the create span query seems to work as expected with boosted synonyms(see the related test), I suspect if SpanBoostQuery are broken , they should get resolved in another ticket? - from an original comment in the test code org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle: "confirm autoGeneratePhraseQueries always builds OR queries" I changed that, was there any reason for that behaviour? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload
[ https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032489#comment-17032489 ] Alessandro Benedetti commented on SOLR-12238: - hi [~dsmiley], [~romseygeek], first of all, thank you again for your patience and very useful insights. The child Lucene issue and pull request have been updated incorporating Alan's suggestions. > Synonym Query Style Boost By Payload > > > Key: SOLR-12238 > URL: https://issues.apache.org/jira/browse/SOLR-12238 > Project: Solr > Issue Type: Improvement > Components: query parsers >Affects Versions: 7.2 >Reporter: Alessandro Benedetti >Priority: Major > Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, > SOLR-12238.patch > > Time Spent: 2h 50m > Remaining Estimate: 0h > > This improvement is built on top of the Synonym Query Style feature and > brings the possibility of boosting synonym queries using the payload > associated. > It introduces two new modalities for the Synonym Query Style : > PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses > boosted by payload > AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses > boosted by payload > This new synonym query styles will assume payloads are available so they must > be used in conjunction with a token filter able to produce payloads. > An synonym.txt example could be : > # Synonyms used by Payload Boost > tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9 > leopard => leopard, Big_Cat|0.8, Bagheera|0.9 > lion => lion|1.0, panthera leo|0.99, Simba|0.8 > snow_leopard => panthera uncia|0.99, snow leopard|1.0 > A simple token filter to populate the payloads from such synonym.txt is : > delimiter="|"/> -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376460611 ## File path: solr/core/src/test-files/solr/collection1/conf/schema12.xml ## @@ -238,6 +227,18 @@ + Review comment: You can remove "payload" everywhere from this PR now; no? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376450137 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.boost; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.search.BoostAttribute; + +import java.io.IOException; + + +/** + * Characters before the delimiter are the "token", those after are the boost. + * + * For example, if the delimiter is '|', then for the string "foo|0.7", foo is the token + * and 0.7 is the boost. + * + * Note make sure your Tokenizer doesn't split on the delimiter, or this won't work + */ +public final class DelimitedBoostTokenFilter extends TokenFilter { + private final char delimiter; + private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); + private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class); + + public DelimitedBoostTokenFilter(TokenStream input, char delimiter) { +super(input); +this.delimiter = delimiter; + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + final char[] buffer = termAtt.buffer(); + final int length = termAtt.length(); + for (int i = 0; i < length; i++) { +if (buffer[i] == delimiter) { + float boost = Float.parseFloat(new String(buffer, i + 1, (length - (i + 1; + boostAtt.setBoost(boost); + termAtt.setLength(i); + return true; +} + } + // we have not seen the delimiter + boostAtt.setBoost(1.0f); Review comment: Shouldn't be needed; leave the boost be -- defaults to 1.0 any way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376455962 ## File path: lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java ## @@ -509,33 +549,40 @@ protected Query analyzeGraphBoolean(String field, TokenStream source, BooleanCla end = articulationPoints[i]; } lastState = end; - final Query queryPos; + final Query positionalQuery; if (graph.hasSidePath(start)) { -final Iterator it = graph.getFiniteStrings(start, end); +final Iterator sidePathsIterator = graph.getFiniteStrings(start, end); Iterator queries = new Iterator() { @Override public boolean hasNext() { -return it.hasNext(); +return sidePathsIterator.hasNext(); } @Override public Query next() { -TokenStream ts = it.next(); -return createFieldQuery(ts, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); +TokenStream sidePath = sidePathsIterator.next(); +return createFieldQuery(sidePath, BooleanClause.Occur.MUST, field, getAutoGenerateMultiTermSynonymsPhraseQuery(), 0); } }; -queryPos = newGraphSynonymQuery(queries); +positionalQuery = newGraphSynonymQuery(queries); } else { -Term[] terms = graph.getTerms(field, start); +List attributes = graph.getTerms(start); Review comment: I think I mentioned a List of AttributeSource is weird (I've never seen this) and it's heavyweight. Why not a TokenStream or TermAndBoost[] ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload
dsmiley commented on a change in pull request #357: [SOLR-12238] Synonym Queries boost by payload URL: https://github.com/apache/lucene-solr/pull/357#discussion_r376459427 ## File path: solr/core/src/java/org/apache/solr/schema/TextField.java ## @@ -43,6 +43,7 @@ public class TextField extends FieldType { protected boolean autoGeneratePhraseQueries; protected boolean enableGraphQueries; + protected boolean synonymBoostByPayload; Review comment: I thought we switched the approach from a payload to boost attribute? Besides; it's not clear we need this toggle at all since the user could arrange for this behavior simply by having the new DelimitedBoost filter thing in the chain. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload
[ https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032462#comment-17032462 ] Alessandro Benedetti commented on LUCENE-9171: -- hi [~romseygeek], first of all, thank you again for your patience and very useful insights. I have incorporated your changes and cleaned everything up. You find the original PR updated. My un-resolved questions: - boostAttribute doesn’t use BytesRef but directly float, is it a concern? We are expected to use it at query time, so we could actually see a query time minimal benefit in not encoding/decoding? - you expressed concerns over SpanBoostQuery, mentioning they are sort of broken, what should we do in that regard? right now the create span query seems to work as expected with boosted synonyms(see the related test), I suspect if SpanBoostQuery are broken , they should get resolved in another ticket? - from an original comment in the test code org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle: "confirm autoGeneratePhraseQueries always builds OR queries" I changed that, was there any reason for that? > Synonyms Boost by Payload > - > > Key: LUCENE-9171 > URL: https://issues.apache.org/jira/browse/LUCENE-9171 > Project: Lucene - Core > Issue Type: New Feature > Components: core/queryparser >Reporter: Alessandro Benedetti >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I have been working in the additional capability of boosting queries by terms > payload through a parameter to enable it in Lucene Query Builder. > This has been done targeting the Synonyms Query. > It is parametric, so it meant to see no difference unless the feature is > enabled. > Solr has its bits to comply thorugh its SynonymsQueryStyles -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields
markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583449275 There was a suggestion from @jimczi that we fall back to writing raw data if content doesn't compress well. I'm not sure this logic is worth developing for the reasons outlined below: I wrote a [compression buffer](https://gist.github.com/markharwood/91cc8d96d6611ad97df11f244b1b1d0f) to see what the compression algo outputs before deciding whether to write the compressed or raw data to disk. I tested with the most uncompressible content I could imagine: public static void fillRandom(byte[] buffer, int length) { for (int i = 0; i < length; i++) { buffer[i] = (byte) (Math.random() * Byte.MAX_VALUE); } } The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts (adding 0.4% overhead to the original content e.g. 96,921 compressed vs 96,541 raw bytes). On that basis I'm not sure if it's worth doubling the memory costs of the indexing logic (we would require a temporary output buffer that is at least the same size as the raw data being compressed) and additional byte shuffling. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032436#comment-17032436 ] Robert Muir commented on LUCENE-9201: - now that overview.html works, i tried investigating package.html problems. i can reproduce it and the problem is specific to gradle. switching to package-info.java is definitely a solution, but i can't stand unexplained mysteries. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032431#comment-17032431 ] Shalin Shekhar Mangar commented on SOLR-14248: -- This patch fixes all the problems except for #5. The way it fixes #3 is a hack but that's the best I could do without creating a builder class for DocCollection. I've left a todo comment in there to describe the hack and eventual fix. > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14248: - Attachment: SOLR-14248.patch > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
Shalin Shekhar Mangar created SOLR-14248: Summary: Improve ClusterStateMockUtil and make its methods public Key: SOLR-14248 URL: https://issues.apache.org/jira/browse/SOLR-14248 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Tests Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Fix For: master (9.0), 8.5 While working on SOLR-13996, I had the need to mock the cluster state for various configurations and I used ClusterStateMockUtil. However, I ran into a few issues that needed to be fixed: 1. The methods in this class are protected making it useful only within the same package 2. A null router was set for DocCollection objects 3. The DocCollection object is created before the slices so the DocCollection.getActiveSlices method returns empty list because the active slices map is created inside the DocCollection constructor 4. It did not set core name for the replicas it created 5. It has no support for replica types so it only creates nrt replicas I will use this Jira to fix these problems and make the methods in that class public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] romseygeek opened a new pull request #1243: LUCENE-9212: Intervals.multiterm() should take CompiledAutomaton
romseygeek opened a new pull request #1243: LUCENE-9212: Intervals.multiterm() should take CompiledAutomaton URL: https://github.com/apache/lucene-solr/pull/1243 Currently it takes `Automaton` and then compiles it internally, but we need to do things like check for binary-vs-unicode status; it should just take `CompiledAutomaton` instead, and put responsibility for determinization, binaryness, etc, on the caller. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9212) Intervals.multiterm() should take a CompiledAutomaton
Alan Woodward created LUCENE-9212: - Summary: Intervals.multiterm() should take a CompiledAutomaton Key: LUCENE-9212 URL: https://issues.apache.org/jira/browse/LUCENE-9212 Project: Lucene - Core Issue Type: Improvement Reporter: Alan Woodward Assignee: Alan Woodward LUCENE-9028 added a `multiterm` factory method for intervals that accepts an arbitrary Automaton, and converts it internally into a CompiledAutomaton. This isn't necessarily correct behaviour, however, because Automatons can be defined in both binary and unicode space, and there's no way of telling which it is when it comes to compiling them. In particular, for automatons produced by FuzzyTermsEnum, we need to convert them to unicode before compilation. The `multiterm` factory should just take `CompiledAutomaton` directly, and we should deprecate the methods that take `Automaton` and remove in master. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin-Chun Zhang updated LUCENE-9136: --- Description: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C++ [[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as KD-tree; # Hashing methods, such as LSH (Local Sensitive Hashing); # Product quantization based algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; where IVFFlat and HNSW are the most popular ones among all the VR algorithms. Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. The recall ratio of IVFFlat could be gradually increased by adjusting the query parameter (nprobe), while it's hard for HNSW to improve the accuracy. In theory, IVFFlat could achieve 100% recall ratio. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both algorithms have their merits and demerits. Since HNSW is now under development, it may be better to provide both implementations (HNSW && IVFFlat) for potential users who are faced with very different scenarios and want to more choices. was: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C++ [[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as KD-tree; # Hashing methods, such as LSH (Local Sensitive Hashing); # Product quantization based algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; where IVFFlat and HNSW are the most popular ones among all the VR algorithms. Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both
[jira] [Created] (LUCENE-9211) Adding compression to BinaryDocValues storage
Mark Harwood created LUCENE-9211: Summary: Adding compression to BinaryDocValues storage Key: LUCENE-9211 URL: https://issues.apache.org/jira/browse/LUCENE-9211 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Mark Harwood Assignee: Mark Harwood While SortedSetDocValues can be used today to store identical values in a compact form this is not effective for data with many unique values. The proposal is that BinaryDocValues should be stored in LZ4 compressed blocks which can dramatically reduce disk storage costs in many cases. The proposal is blocks of a number of documents are stored as a single compressed blob along with metadata that records offsets where the original document values can be found in the uncompressed content. There's a trade-off here between efficient compression (more docs-per-block = better compression) and fast retrieval times (fewer docs-per-block = faster read access for single values). A fixed block size of 32 docs seems like it would be a reasonable compromise for most scenarios. A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields
markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583313015 I've reclaimed my Jira log-in and opened https://issues.apache.org/jira/browse/LUCENE-9211 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12930) Add developer documentation to source repo
[ https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032251#comment-17032251 ] ASF subversion and git services commented on SOLR-12930: Commit c0d1f302360ef97b5cfdcbdf82365f8ec1d6c2ed in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0d1f30 ] SOLR-12930: Exclude dev-docs from binary archive. > Add developer documentation to source repo > -- > > Key: SOLR-12930 > URL: https://issues.apache.org/jira/browse/SOLR-12930 > Project: Solr > Issue Type: Improvement > Components: Tests >Reporter: Mark Miller >Priority: Major > Attachments: solr-dev-docs.zip > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12930) Add developer documentation to source repo
[ https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032250#comment-17032250 ] ASF subversion and git services commented on SOLR-12930: Commit d62f63076585769f757dcaf9919d2f07fab113d3 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d62f630 ] SOLR-12930: Exclude dev-docs from binary archive. > Add developer documentation to source repo > -- > > Key: SOLR-12930 > URL: https://issues.apache.org/jira/browse/SOLR-12930 > Project: Solr > Issue Type: Improvement > Components: Tests >Reporter: Mark Miller >Priority: Major > Attachments: solr-dev-docs.zip > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org