[jira] [Created] (LUCENE-9207) Don't build SpanQuery in QueryBuilder

2020-02-05 Thread Alan Woodward (Jira)
Alan Woodward created LUCENE-9207:
-

 Summary: Don't build SpanQuery in QueryBuilder
 Key: LUCENE-9207
 URL: https://issues.apache.org/jira/browse/LUCENE-9207
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Alan Woodward
Assignee: Alan Woodward


Subtask of LUCENE-9204.  QueryBuilder currently has special logic for graph 
phrase queries with no slop, constructing a spanquery that attempts to follow 
all paths using a combination of OR and NEAR queries.  Given the known bugs in 
this type of query (LUCENE-7398) and that we would like to move span queries 
out of core in any case, we should remove this logic and just build a 
disjunction of phrase queries, one phrase per path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] romseygeek opened a new pull request #1239: LUCENE-9207: Don't build span queries in QueryBuilder

2020-02-05 Thread GitBox
romseygeek opened a new pull request #1239: LUCENE-9207: Don't build span 
queries in QueryBuilder
URL: https://github.com/apache/lucene-solr/pull/1239
 
 
   QueryBuilder currently has special logic for graph phrase queries with no 
slop, 
   constructing a spanquery that attempts to follow all paths using a 
combination of 
   OR and NEAR queries.  Given the known bugs in this type of query 
(LUCENE-7398) 
   and that we would like to move span queries out of core in any case, we 
should 
   remove this logic and just build a disjunction of phrase queries, one phrase 
per path.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14242) Implement HdfsDirectory#createTempOutput

2020-02-05 Thread Adrien Grand (Jira)
Adrien Grand created SOLR-14242:
---

 Summary: Implement HdfsDirectory#createTempOutput
 Key: SOLR-14242
 URL: https://issues.apache.org/jira/browse/SOLR-14242
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Adrien Grand


The HdfsDirectory doesn't implement createTempOutput, meaning it can't index 
geo points, ranges, shapes. We should implement this method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

2020-02-05 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030530#comment-17030530
 ] 

Alan Woodward commented on LUCENE-9204:
---

Thanks David.  I opened LUCENE-9207 to separate out the QueryBuilder 
refactoring, as the changeset to move Spans entirely is fairly unwieldy.

> Move span queries to the queries module
> ---
>
> Key: LUCENE-9204
> URL: https://issues.apache.org/jira/browse/LUCENE-9204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz opened a new pull request #1240: SOLR-14242: HdfsDirectory#createTempOutput.

2020-02-05 Thread GitBox
jpountz opened a new pull request #1240: SOLR-14242: 
HdfsDirectory#createTempOutput.
URL: https://github.com/apache/lucene-solr/pull/1240
 
 
   JIRA: https://issues.apache.org/jira/browse/SOLR-14242


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-05 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030561#comment-17030561
 ] 

Alan Woodward commented on LUCENE-9077:
---

Thanks for everybody's hard work on this - the gradle build is so much nicer to 
work with than ant!

I have a question regarding precommit - it doesn't seem to be catching unused 
imports yet; I'm getting PRs failing precommit checks on unused imports which 
have passed locally when I run `./gradlew precommit`.  Is this a known issue?

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}

[jira] [Created] (SOLR-14243) ant clean-jars should not delete gradle-wrapper.jar

2020-02-05 Thread Andras Salamon (Jira)
Andras Salamon created SOLR-14243:
-

 Summary: ant clean-jars should not delete gradle-wrapper.jar
 Key: SOLR-14243
 URL: https://issues.apache.org/jira/browse/SOLR-14243
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Andras Salamon


Right now ant clean-jars deletes {{gradle/wrapper/gradle-wrapper.jar}}, so if I 
execute the following command to recreate the checksums it shows up as as 
deleted file in git:
{noformat}
$ ant clean-jars jar-checksums 
...
$ git status -s
 D gradle/wrapper/gradle-wrapper.jar{noformat}

I don't think we should delete the gradle-wrapper.jar here




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14243) ant clean-jars should not delete gradle-wrapper.jar

2020-02-05 Thread Andras Salamon (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Salamon updated SOLR-14243:
--
Status: Patch Available  (was: Open)

> ant clean-jars should not delete gradle-wrapper.jar
> ---
>
> Key: SOLR-14243
> URL: https://issues.apache.org/jira/browse/SOLR-14243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Major
> Attachments: SOLR-14243-01.patch
>
>
> Right now ant clean-jars deletes {{gradle/wrapper/gradle-wrapper.jar}}, so if 
> I execute the following command to recreate the checksums it shows up as as 
> deleted file in git:
> {noformat}
> $ ant clean-jars jar-checksums 
> ...
> $ git status -s
>  D gradle/wrapper/gradle-wrapper.jar{noformat}
> I don't think we should delete the gradle-wrapper.jar here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14243) ant clean-jars should not delete gradle-wrapper.jar

2020-02-05 Thread Andras Salamon (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Salamon updated SOLR-14243:
--
Attachment: SOLR-14243-01.patch
Status: Open  (was: Open)

> ant clean-jars should not delete gradle-wrapper.jar
> ---
>
> Key: SOLR-14243
> URL: https://issues.apache.org/jira/browse/SOLR-14243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Major
> Attachments: SOLR-14243-01.patch
>
>
> Right now ant clean-jars deletes {{gradle/wrapper/gradle-wrapper.jar}}, so if 
> I execute the following command to recreate the checksums it shows up as as 
> deleted file in git:
> {noformat}
> $ ant clean-jars jar-checksums 
> ...
> $ git status -s
>  D gradle/wrapper/gradle-wrapper.jar{noformat}
> I don't think we should delete the gradle-wrapper.jar here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14243) ant clean-jars should not delete gradle-wrapper.jar

2020-02-05 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030580#comment-17030580
 ] 

Lucene/Solr QA commented on SOLR-14243:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m  0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m  0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:black}{color} | {color:black} {color} | {color:black}  0m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-14243 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12992681/SOLR-14243-01.patch |
| Optional Tests |  compile  javac  unit  ratsources  validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 2d8428ec2e8 |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/679/testReport/ |
| modules | C: . U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/679/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> ant clean-jars should not delete gradle-wrapper.jar
> ---
>
> Key: SOLR-14243
> URL: https://issues.apache.org/jira/browse/SOLR-14243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Major
> Attachments: SOLR-14243-01.patch
>
>
> Right now ant clean-jars deletes {{gradle/wrapper/gradle-wrapper.jar}}, so if 
> I execute the following command to recreate the checksums it shows up as as 
> deleted file in git:
> {noformat}
> $ ant clean-jars jar-checksums 
> ...
> $ git status -s
>  D gradle/wrapper/gradle-wrapper.jar{noformat}
> I don't think we should delete the gradle-wrapper.jar here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on issue #1234: Add compression for Binary doc value fields

2020-02-05 Thread GitBox
markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-582367395
 
 
   Thanks for looking at this, Mike.
   
   >LOL, that's crazy -- you should go introduce yourself to the other markh ;)
   
   I already reached out and we're working out the divorce proceedings :)
   
   >@markharwood how can we reproduce these benchmarks? What were the log data 
documents storing as BINARY doc values fields?
   
   These were elasticsearch log file entries - so each value was a string which 
could be something short like  `[instance-48] users file 
[/app/config/users] changed. updating users... )` or an error with a whole 
stack trace.
   My test rig is 
[here](https://gist.github.com/markharwood/724009754c89e7f245625120e71f60d7) if 
you want to try with some other data files
   
   >And how can indexing and searching get so much faster when 
compress/decompress is in the path!
   
   This was a test on my macbook with SSD and encrypted FS so perhaps not the 
best benchmarking setup. Maybe just writing more bytes = more overhead with the 
OS-level encryption?
   
   >I think our testing of BINARY doc values may not be great ... maybe add a 
randomized test that sometimes stores very compressible and very 
incompressible, large, BINARY doc values?
   
   Will do. @jimczi has suggested adding support for storing without 
compression when the content doesn't compress well. I guess that can be a 
combination of :
   1) A fast heuristic - e.g. if max value length for each of the docs in a 
block <=2 then store without compression and 
   2) "Try it and see" compression - buffer compression output to byte array 
and only write compressed form to disk if size is less than the uncompressed 
input
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14244) Remove ReplicaInfo

2020-02-05 Thread Andrzej Bialecki (Jira)
Andrzej Bialecki created SOLR-14244:
---

 Summary: Remove ReplicaInfo
 Key: SOLR-14244
 URL: https://issues.apache.org/jira/browse/SOLR-14244
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Andrzej Bialecki


SolrCloud uses {{Replica}} and {{ReplicaInfo}} beans more or less 
interchangeably and rather inconsistently across the code base. They seem to 
mean exactly the same thing.

We should get rid of one or the other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14245) Validate Replica / ReplicaInfo on creation

2020-02-05 Thread Andrzej Bialecki (Jira)
Andrzej Bialecki created SOLR-14245:
---

 Summary: Validate Replica / ReplicaInfo on creation
 Key: SOLR-14245
 URL: https://issues.apache.org/jira/browse/SOLR-14245
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
 Fix For: 8.5


Replica / ReplicaInfo should be immutable and their fields should be validated 
on creation.

Some users reported that very rarely during a failed collection CREATE or 
DELETE, or when the Overseer task queue becomes corrupted, Solr may write to ZK 
incomplete replica infos (eg. node_name = null).

This problem is difficult to reproduce but we should add safeguards anyway to 
prevent writing such corrupted replica info to ZK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9154) Remove encodeCeil() to encode bounding box queries

2020-02-05 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030626#comment-17030626
 ] 

Ignacio Vera commented on LUCENE-9154:
--

In a comment above, it has been said that the value on the index is a lat / lon 
value and that is not accurate as the value on the index is represented as a 
two dimensional integer. These integers represents a *range* of lat / lon 
values (all values that are encoded to that integer) which can be decoded to a 
single value using GeoEncodingUtils. The value to which is decoded is not the 
middle of the range (which I would expect to be the logical point to represent 
that range) but the lower value of the range.

I understand now that probably one of the reasons that value was chosen is to 
make this logic happy. If I change the decoded value to the middle of the 
range, all this logic fails as the implementation relies on how we decode the 
values from the index.

 

 

 

 

 

> Remove encodeCeil()  to encode bounding box queries
> ---
>
> Key: LUCENE-9154
> URL: https://issues.apache.org/jira/browse/LUCENE-9154
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently have the following logic in LatLonPoint#newBoxquery():
> {code:java}
>  // exact double values of lat=90.0D and lon=180.0D must be treated special 
> as they are not represented in the encoding
> // and should not drag in extra bogus junk! TODO: should encodeCeil just 
> throw ArithmeticException to be less trappy here?
> if (minLatitude == 90.0) {
>   // range cannot match as 90.0 can never exist
>   return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLatitude=90.0");
> }
> if (minLongitude == 180.0) {
>   if (maxLongitude == 180.0) {
> // range cannot match as 180.0 can never exist
> return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLongitude=maxLongitude=180.0");
>   } else if (maxLongitude < minLongitude) {
> // encodeCeil() with dateline wrapping!
> minLongitude = -180.0;
>   }
> }
> byte[] lower = encodeCeil(minLatitude, minLongitude);
> byte[] upper = encode(maxLatitude, maxLongitude);
> {code}
>  
> IMO opinion this is confusing and can lead to strange results. For example a 
> query with {{minLatitude = minLatitude = 90}} does not match points with 
> {{latitude = 90}}. On the other hand a query with {{minLatitude = 
> minLatitude}} = 89.9996}} will match points at latitude = 90.
> I don't really understand the statement that says: {{90.0 can never exist}} 
> as this is as well true for values > 89.9995809048 which is the maximum 
> quantize value. In this argument, this will be true for all values between 
> quantize coordinates as they do not exist in the index, why 90D is so 
> special? I guess because it cannot be ceil up without overflowing the 
> encoding.
> Another argument to remove this function is that it opens the room to have 
> false negatives in the result of the query. if a query has minLon = 
> 89.99957, it won't match points with longitude = 89.99957 as it is 
> rounded up to 89.9995809048.
> The only merit I can see in the current approach is that if you only index 
> points that are already quantize, then all queries would be exact. But does 
> it make sense for someone to only index quantize values and then query by 
> non-quantize bounding boxes?
>  
> I hope I am missing something, but my proposal is to remove encodeCeil all 
> together and remove all the special handling at the positive pole and 
> positive dateline.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-05 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030647#comment-17030647
 ] 

Erick Erickson commented on LUCENE-9077:


[~romseygeek] Not that I know of.  Go ahead and add it to the list in this Jira 
just to be sure. Thanks for reporting!

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
> Cao Mạnh Đạt but also applies lessons learned from those two efforts:
>  * *Do not try to do too many things at once*. If we deviate too far from 
> ma

[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030654#comment-17030654
 ] 

Erick Erickson commented on LUCENE-9206:


+1, in the default case this should be near instantaneous?

FWIW, IndexUpgraderTool also forcemerges to 1 segment, but the fix there would 
be more complicated 'cause at least TMP won't rewrite segments that are already 
max sized. And that's a whole 'nother Jira anyway.

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030658#comment-17030658
 ] 

Robert Muir commented on LUCENE-9206:
-

It will do some merges, maybe some big long ones: addIndexes() calls 
maybeMerge() at the end. But it will not be pathological and merge down to just 
one segment unless you ask.

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030664#comment-17030664
 ] 

Erick Erickson commented on LUCENE-9206:


You know how it is when you hit "send" and immediately realize you've missed 
the obvious? I do that a lot ;).

Hmmm. I suppose you could specify NoMergePolicy if you really cared, but then 
you might leave a lot of segments laying around or maybe create a backlog of 
merges that you'd hit as soon as you started indexing again.

Regardless, this patch gives a lot more control over the process where there 
was none before, way cool.

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9208) Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten

2020-02-05 Thread Nirmal Chidambaram (Jira)
Nirmal Chidambaram  created LUCENE-9208:
---

 Summary: Boolean Query with MatchNoDocs in MUST/FILTER could be 
rewritten
 Key: LUCENE-9208
 URL: https://issues.apache.org/jira/browse/LUCENE-9208
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nirmal Chidambaram 


Currently BooleanQuery rewrites to MatchNoDocs query if MUST_NOT clause 
contains MatchAllDocs . Same approach could be applied if FILTER/MUST clauses 
contains MatchNoDocs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson opened a new pull request #1241: Gradle util

2020-02-05 Thread GitBox
ErickErickson opened a new pull request #1241: Gradle util
URL: https://github.com/apache/lucene-solr/pull/1241
 
 
   This adds the generation targets for util/packed and util/automaton.
   
   Interestingly, for whatever reason my local Python doesn't do anything weird 
like it did when regenerating the html entities, the generated code is 
identical.
   
   One thing I'd like to draw attention to is that I had to change 
createLevAutomata.py to path to the new place moman is downloaded to.
   
   I'll merge upstream sometime over the weekend probably barring objections.
   
   I think this finishes off the regeneration work, so I'll close LUCENE-9134 
after merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9207) Don't build SpanQuery in QueryBuilder

2020-02-05 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030680#comment-17030680
 ] 

Michael Gibney commented on LUCENE-9207:


I think the special logic building SpanQueries for the slop=0 case was left in 
place by LUCENE-8531 because the resulting behavior is functionally identical 
to the MultiPhraseQuery approach, and SpanQueries for slop=0 are more efficient 
(potentially _vastly_ more efficient) than the exponential expansion that can 
result from MultiPhraseQuery over graph TokenStreams (e.g., for bigrams, 
synonyms, wdgf, etc.).

[~romseygeek], do you think the code simplification is worth the potential 
performance hit for the {{slop=0}} case? [~jim.ferenczi], [~sarowe], 
[~uschindler], I'm curious for your perspectives (having been involved in the 
discussion around LUCENE-8531). For heavily branching token streams (e.g., 
bigrams, certain tYpEs 0f 1nPuT to common WGDF configurations), the performance 
impact is substantial. I know of (and in fact personally know) many people who 
have been bitten by this in the form of SOLR-13336; but the underlying 
performance issue is not Solr-specific and is not directly addressed by the fix 
for SOLR-13336, which simply restores Lucene's maxBooleanClauses threshold for 
shortcircuiting individual queries.

FWIW, I think LUCENE-7398 is a bit of a red herring here; I'm shooting from the 
hip a bit, but I'm 90% confident that the LUCENE-7398 issues don't affect the 
slop=0 case for _query_-time graph TokenStreams; and to the extent that they 
affect _index_-time graph TokenStreams, they affect SpanQueries and 
MultiPhraseQuery equally (that's a whole separate question!).

> Don't build SpanQuery in QueryBuilder
> -
>
> Key: LUCENE-9207
> URL: https://issues.apache.org/jira/browse/LUCENE-9207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Subtask of LUCENE-9204.  QueryBuilder currently has special logic for graph 
> phrase queries with no slop, constructing a spanquery that attempts to follow 
> all paths using a combination of OR and NEAR queries.  Given the known bugs 
> in this type of query (LUCENE-7398) and that we would like to move span 
> queries out of core in any case, we should remove this logic and just build a 
> disjunction of phrase queries, one phrase per path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9077) Gradle build

2020-02-05 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030647#comment-17030647
 ] 

Erick Erickson edited comment on LUCENE-9077 at 2/5/20 2:08 PM:


[~romseygeek] Not that I know of.  NM I was editing the list and added that 
precommit doesn't catch unused imports.


was (Author: erickerickson):
[~romseygeek] Not that I know of.  Go ahead and add it to the list in this Jira 
just to be sure. Thanks for reporting!

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  * There is some python execution in che

[jira] [Updated] (LUCENE-9077) Gradle build

2020-02-05 Thread Erick Erickson (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-9077:
---
Description: 
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes below on why this respin is needed.

The code lives on *gradle-master* branch. It is kept with sync with *master*. 
Try running the following to see an overview of helper guides concerning 
typical workflow, testing and ant-migration helpers:

gradlew :help

A list of items that needs to be added or requires work. If you'd like to work 
on any of these, please add your name to the list. Once you have a patch/ pull 
request let me (dweiss) know - I'll try to coordinate the merges.
 * (/) Apply forbiddenAPIs
 * (/) Generate hardware-aware gradle defaults for parallelism (count of 
workers and test JVMs).
 * (/) Fail the build if --tests filter is applied and no tests execute during 
the entire build (this allows for an empty set of filtered tests at single 
project level).
 * (/) Port other settings and randomizations from common-build.xml
 * (/) Configure security policy/ sandboxing for tests.
 * (/) test's console output on -Ptests.verbose=true
 * (/) add a :helpDeps explanation to how the dependency system works (palantir 
plugin, lockfile) and how to retrieve structured information about current 
dependencies of a given module (in a tree-like output).
 * (/) jar checksums, jar checksum computation and validation. This should be 
done without intermediate folders (directly on dependency sets).
 * (/) verify min. JVM version and exact gradle version on build startup to 
minimize odd build side-effects
 * (/) Repro-line for failed tests/ runs.
 * (/) add a top-level README note about building with gradle (and the required 
JVM).
 * (/) add an equivalent of 'validate-source-patterns' 
(check-source-patterns.groovy) to precommit.
 * (/) add an equivalent of 'rat-sources' to precommit.
 * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) to 
precommit.
 * (/) javadoc compilation

Hard-to-implement stuff already investigated:
 * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
to be any way to do this in a reasonably efficient way. There are onOutput 
listeners but they're slow to operate and solr tests emit *tons* of output so 
it's an overkill.-
 * (!) (LUCENE-9120) *Tests working with security-debug logs or other JVM-early 
log output*. Gradle's test runner works by redirecting Java's stdout/ syserr so 
this just won't work. Perhaps we can spin the ant-based test runner for such 
corner-cases.

Of lesser importance:
 * Add an equivalent of 'documentation-lint" to precommit.
 * (/) Do not require files to be committed before running precommit. (staged 
files are fine).
 * (/) add rendering of javadocs (gradlew javadoc)
 * Attach javadocs to maven publications.
 * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
it'll be difficult to run it sensibly because gradle doesn't offer cwd 
separation for the forked test runners.
 * if you diff solr packaged distribution against ant-created distribution 
there are minor differences in library versions and some JARs are excluded/ 
moved around. I didn't try to force these as everything seems to work (tests, 
etc.) – perhaps these differences should  be fixed in the ant build instead.
 * [EOE] identify and port various "regenerate" tasks from ant builds (javacc, 
precompiled automata, etc.)
 * Fill in POM details in gradle/defaults-maven.gradle so that they reflect the 
previous content better (dependencies aside).
 * Add any IDE integration layers that should be added (I use IntelliJ and it 
imports the project out of the box, without the need for any special tuning).
 * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; currently 
XSLT...)
 * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
from a binary distribution? 
 * There is some python execution in check-broken-links and 
check-missing-javadocs, not sure if it's been ported
 * Nightly-smoke also have some python execution, not sure of the status.
 * Precommit doesn't catch unused imports

 

*{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
Cao Mạnh Đạt but also applies lessons learned from those two efforts:
 * *Do not try to do too many things at once*. If we deviate too far from 
master, the branch will be hard to merge.
 * *Do everything in baby-steps* and add small, independent build fragments 
replacing the old ant infrastructure.
 * *Try to engage people to run, test and contribute early*. It can't be a 
one-man effort. The more people understand and can contribute to the build, the 
more healthy it will be.

 

  was:
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes bel

[jira] [Updated] (LUCENE-9208) Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten

2020-02-05 Thread Nirmal Chidambaram (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nirmal Chidambaram  updated LUCENE-9208:

Attachment: LUCENE-9208.patch
Status: Open  (was: Open)

> Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten
> 
>
> Key: LUCENE-9208
> URL: https://issues.apache.org/jira/browse/LUCENE-9208
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nirmal Chidambaram 
>Priority: Trivial
>  Labels: patch
> Attachments: LUCENE-9208.patch
>
>
> Currently BooleanQuery rewrites to MatchNoDocs query if MUST_NOT clause 
> contains MatchAllDocs . Same approach could be applied if FILTER/MUST clauses 
> contains MatchNoDocs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9208) Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten

2020-02-05 Thread Nirmal Chidambaram (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nirmal Chidambaram  updated LUCENE-9208:

Status: Patch Available  (was: Open)

> Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten
> 
>
> Key: LUCENE-9208
> URL: https://issues.apache.org/jira/browse/LUCENE-9208
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nirmal Chidambaram 
>Priority: Trivial
>  Labels: patch
> Attachments: LUCENE-9208.patch
>
>
> Currently BooleanQuery rewrites to MatchNoDocs query if MUST_NOT clause 
> contains MatchAllDocs . Same approach could be applied if FILTER/MUST clauses 
> contains MatchNoDocs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9200) TieredMergePolicy's test fails with OB1 error after "toning down" (randomizing)

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030709#comment-17030709
 ] 

ASF subversion and git services commented on LUCENE-9200:
-

Commit 47386f8cca9ebebb65b276547f4b1a19e4ba71df in lucene-solr's branch 
refs/heads/master from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=47386f8 ]

LUCENE-9200: consistently use double (not float) math for TieredMergePolicy's 
decisions, to fix a corner-case bug uncovered by randomized tests


> TieredMergePolicy's test fails with OB1 error after "toning down" 
> (randomizing)
> ---
>
> Key: LUCENE-9200
> URL: https://issues.apache.org/jira/browse/LUCENE-9200
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-9200.patch, LUCENE-9200.patch
>
>
> I tried to reduce the overhead of MergePolicy simulation tests. Especially 
> TieredMergePolicy's testSimulateUpdates is one of the slowest lucene tests. 
> As a workaround it is NIGHTLY but we should fix that. It should "behave" on a 
> developer machine.
> As a part of of trying to improve this the fixed number of documents 
> exercised by the test was changed from 10 million to use "atLeast" so that it 
> would scale bigger in jenkins but be fast on your local machine.
> As well in the base class, the randomization is "tweaked" so that it 
> generally runs efficiently, but still exercises corner cases.
> Unfortunately TieredMP hates these changes and will randomly (under beasting) 
> fail with an OB1 error: 
> {noformat}
> org.apache.lucene.index.TestTieredMergePolicy > testSimulateUpdates FAILED
> java.lang.AssertionError: numSegments=57, allowed=56.0
> at 
> __randomizedtesting.SeedInfo.seed([E79E5C317D63A1E9:73780B8AD33B297D]:0)
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.assertSegmentInfos(TestTieredMergePolicy.java:88)
> at 
> org.apache.lucene.index.BaseMergePolicyTestCase.doTestSimulateUpdates(BaseMergePolicyTestCase.java:430)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.testSimulateUpdates(TestTieredMergePolicy.java:719)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:567)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> com.carrotsearch.randomizedtest

[jira] [Resolved] (LUCENE-9200) TieredMergePolicy's test fails with OB1 error after "toning down" (randomizing)

2020-02-05 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-9200.

Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed

Thanks [~rcmuir]!

> TieredMergePolicy's test fails with OB1 error after "toning down" 
> (randomizing)
> ---
>
> Key: LUCENE-9200
> URL: https://issues.apache.org/jira/browse/LUCENE-9200
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: LUCENE-9200.patch, LUCENE-9200.patch
>
>
> I tried to reduce the overhead of MergePolicy simulation tests. Especially 
> TieredMergePolicy's testSimulateUpdates is one of the slowest lucene tests. 
> As a workaround it is NIGHTLY but we should fix that. It should "behave" on a 
> developer machine.
> As a part of of trying to improve this the fixed number of documents 
> exercised by the test was changed from 10 million to use "atLeast" so that it 
> would scale bigger in jenkins but be fast on your local machine.
> As well in the base class, the randomization is "tweaked" so that it 
> generally runs efficiently, but still exercises corner cases.
> Unfortunately TieredMP hates these changes and will randomly (under beasting) 
> fail with an OB1 error: 
> {noformat}
> org.apache.lucene.index.TestTieredMergePolicy > testSimulateUpdates FAILED
> java.lang.AssertionError: numSegments=57, allowed=56.0
> at 
> __randomizedtesting.SeedInfo.seed([E79E5C317D63A1E9:73780B8AD33B297D]:0)
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.assertSegmentInfos(TestTieredMergePolicy.java:88)
> at 
> org.apache.lucene.index.BaseMergePolicyTestCase.doTestSimulateUpdates(BaseMergePolicyTestCase.java:430)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.testSimulateUpdates(TestTieredMergePolicy.java:719)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:567)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate

[jira] [Commented] (LUCENE-9200) TieredMergePolicy's test fails with OB1 error after "toning down" (randomizing)

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030720#comment-17030720
 ] 

ASF subversion and git services commented on LUCENE-9200:
-

Commit 3e63cd38ef0e5c70c2644322935e61e46b22263f in lucene-solr's branch 
refs/heads/branch_8x from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3e63cd3 ]

LUCENE-9200: consistently use double (not float) math for TieredMergePolicy's 
decisions, to fix a corner-case bug uncovered by randomized tests


> TieredMergePolicy's test fails with OB1 error after "toning down" 
> (randomizing)
> ---
>
> Key: LUCENE-9200
> URL: https://issues.apache.org/jira/browse/LUCENE-9200
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-9200.patch, LUCENE-9200.patch
>
>
> I tried to reduce the overhead of MergePolicy simulation tests. Especially 
> TieredMergePolicy's testSimulateUpdates is one of the slowest lucene tests. 
> As a workaround it is NIGHTLY but we should fix that. It should "behave" on a 
> developer machine.
> As a part of of trying to improve this the fixed number of documents 
> exercised by the test was changed from 10 million to use "atLeast" so that it 
> would scale bigger in jenkins but be fast on your local machine.
> As well in the base class, the randomization is "tweaked" so that it 
> generally runs efficiently, but still exercises corner cases.
> Unfortunately TieredMP hates these changes and will randomly (under beasting) 
> fail with an OB1 error: 
> {noformat}
> org.apache.lucene.index.TestTieredMergePolicy > testSimulateUpdates FAILED
> java.lang.AssertionError: numSegments=57, allowed=56.0
> at 
> __randomizedtesting.SeedInfo.seed([E79E5C317D63A1E9:73780B8AD33B297D]:0)
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.assertSegmentInfos(TestTieredMergePolicy.java:88)
> at 
> org.apache.lucene.index.BaseMergePolicyTestCase.doTestSimulateUpdates(BaseMergePolicyTestCase.java:430)
> at 
> org.apache.lucene.index.TestTieredMergePolicy.testSimulateUpdates(TestTieredMergePolicy.java:719)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:567)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> com.carrotsearch.randomizedt

[jira] [Commented] (LUCENE-9208) Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten

2020-02-05 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030732#comment-17030732
 ] 

Lucene/Solr QA commented on LUCENE-9208:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
7s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  5m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9208 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12992694/LUCENE-9208.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 47386f8cca9 |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/252/testReport/ |
| modules | C: lucene/core U: lucene/core |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/252/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Boolean Query with MatchNoDocs in MUST/FILTER could be rewritten
> 
>
> Key: LUCENE-9208
> URL: https://issues.apache.org/jira/browse/LUCENE-9208
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nirmal Chidambaram 
>Priority: Trivial
>  Labels: patch
> Attachments: LUCENE-9208.patch
>
>
> Currently BooleanQuery rewrites to MatchNoDocs query if MUST_NOT clause 
> contains MatchAllDocs . Same approach could be applied if FILTER/MUST clauses 
> contains MatchNoDocs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz merged pull request #1240: SOLR-14242: HdfsDirectory#createTempOutput.

2020-02-05 Thread GitBox
jpountz merged pull request #1240: SOLR-14242: HdfsDirectory#createTempOutput.
URL: https://github.com/apache/lucene-solr/pull/1240
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14242) Implement HdfsDirectory#createTempOutput

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030746#comment-17030746
 ] 

ASF subversion and git services commented on SOLR-14242:


Commit fe349ddcf2975fb7afe464574ff6120dd3f88b80 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fe349dd ]

SOLR-14242: HdfsDirectory#createTempOutput. (#1240)



> Implement HdfsDirectory#createTempOutput
> 
>
> Key: SOLR-14242
> URL: https://issues.apache.org/jira/browse/SOLR-14242
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The HdfsDirectory doesn't implement createTempOutput, meaning it can't index 
> geo points, ranges, shapes. We should implement this method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14242) Implement HdfsDirectory#createTempOutput

2020-02-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved SOLR-14242.
-
Fix Version/s: 8.5
   Resolution: Fixed

> Implement HdfsDirectory#createTempOutput
> 
>
> Key: SOLR-14242
> URL: https://issues.apache.org/jira/browse/SOLR-14242
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The HdfsDirectory doesn't implement createTempOutput, meaning it can't index 
> geo points, ranges, shapes. We should implement this method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14242) Implement HdfsDirectory#createTempOutput

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030750#comment-17030750
 ] 

ASF subversion and git services commented on SOLR-14242:


Commit d007470bda2f70ba4e1c407ac624e21288947128 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d007470 ]

SOLR-14242: HdfsDirectory#createTempOutput. (#1240)



> Implement HdfsDirectory#createTempOutput
> 
>
> Key: SOLR-14242
> URL: https://issues.apache.org/jira/browse/SOLR-14242
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The HdfsDirectory doesn't implement createTempOutput, meaning it can't index 
> geo points, ranges, shapes. We should implement this method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9207) Don't build SpanQuery in QueryBuilder

2020-02-05 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030773#comment-17030773
 ] 

Alan Woodward commented on LUCENE-9207:
---

LUCENE-7398 is triggered by precisely this situation, no?  You have a synonym 
mapping of 'gene -> genome sequence', and a search for `"human genome sequence 
reader"` won't find documents containing that exact phrase because of span 
minimization.

In terms of exponential expansion, we are at least guarded here by the fact 
that we build a boolean query to hold all the possible paths, and so there is 
the usual maxBooleanClauses protection.  If you have a heavily branching token 
stream then it's going to produce an unwieldy query whatever we do, really...

> Don't build SpanQuery in QueryBuilder
> -
>
> Key: LUCENE-9207
> URL: https://issues.apache.org/jira/browse/LUCENE-9207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Subtask of LUCENE-9204.  QueryBuilder currently has special logic for graph 
> phrase queries with no slop, constructing a spanquery that attempts to follow 
> all paths using a combination of OR and NEAR queries.  Given the known bugs 
> in this type of query (LUCENE-7398) and that we would like to move span 
> queries out of core in any case, we should remove this logic and just build a 
> disjunction of phrase queries, one phrase per path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9154) Remove encodeCeil() to encode bounding box queries

2020-02-05 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030789#comment-17030789
 ] 

Robert Muir commented on LUCENE-9154:
-

{quote}
But then I do not really understand why we are trying to match our custom 
numerical representation against full doubles. 
{quote}

Easy: the java language only supports double and float. it has casts and 
conversion rules around that so programmers don't hit surprises.

If we changed this field to simply encode a float, and used float data type, 
lucene wouldn't be creating any inaccuracy anywhere. The user's compiler would 
guide them and it would be intuitive. The tradeoff is loss of more precision 
(in exchange for expanded range which is not useful).

On the other hand, if a user wants to try that out, they can index 2D 
FloatPoint today and issue bounding box (2-D) against it very easily. 

Today the user passes double, but gets precision that is between a float and a 
double, using only the space of a float: that's how this field was designed, to 
specialize for a specific use-case. 

Such precision loss only needs to happen at index time, that is when it is 
stored. And it is transparent to the user (or developer debugging tests) 
because they can look at the docvalues field to see what the value became. 
There is no need to arbitrarily introduce more inaccuracy at query-time, in 
fact it is necessary NOT TO: tests can be exact and not have "fudge factors" 
and so on.

> Remove encodeCeil()  to encode bounding box queries
> ---
>
> Key: LUCENE-9154
> URL: https://issues.apache.org/jira/browse/LUCENE-9154
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently have the following logic in LatLonPoint#newBoxquery():
> {code:java}
>  // exact double values of lat=90.0D and lon=180.0D must be treated special 
> as they are not represented in the encoding
> // and should not drag in extra bogus junk! TODO: should encodeCeil just 
> throw ArithmeticException to be less trappy here?
> if (minLatitude == 90.0) {
>   // range cannot match as 90.0 can never exist
>   return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLatitude=90.0");
> }
> if (minLongitude == 180.0) {
>   if (maxLongitude == 180.0) {
> // range cannot match as 180.0 can never exist
> return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLongitude=maxLongitude=180.0");
>   } else if (maxLongitude < minLongitude) {
> // encodeCeil() with dateline wrapping!
> minLongitude = -180.0;
>   }
> }
> byte[] lower = encodeCeil(minLatitude, minLongitude);
> byte[] upper = encode(maxLatitude, maxLongitude);
> {code}
>  
> IMO opinion this is confusing and can lead to strange results. For example a 
> query with {{minLatitude = minLatitude = 90}} does not match points with 
> {{latitude = 90}}. On the other hand a query with {{minLatitude = 
> minLatitude}} = 89.9996}} will match points at latitude = 90.
> I don't really understand the statement that says: {{90.0 can never exist}} 
> as this is as well true for values > 89.9995809048 which is the maximum 
> quantize value. In this argument, this will be true for all values between 
> quantize coordinates as they do not exist in the index, why 90D is so 
> special? I guess because it cannot be ceil up without overflowing the 
> encoding.
> Another argument to remove this function is that it opens the room to have 
> false negatives in the result of the query. if a query has minLon = 
> 89.99957, it won't match points with longitude = 89.99957 as it is 
> rounded up to 89.9995809048.
> The only merit I can see in the current approach is that if you only index 
> points that are already quantize, then all queries would be exact. But does 
> it make sense for someone to only index quantize values and then query by 
> non-quantize bounding boxes?
>  
> I hope I am missing something, but my proposal is to remove encodeCeil all 
> together and remove all the special handling at the positive pole and 
> positive dateline.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12930) Add developer documentation to source repo

2020-02-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030799#comment-17030799
 ] 

Adrien Grand commented on SOLR-12930:
-

The smoketester has been failing for some time because of this change, see e.g. 
https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1584. Am I 
correct that we're not expecting these guides to be included in the binary 
artifacts? If so we should fix the build to not include them, and otherwise fix 
the smoketester.

> Add developer documentation to source repo
> --
>
> Key: SOLR-12930
> URL: https://issues.apache.org/jira/browse/SOLR-12930
> Project: Solr
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Mark Miller
>Priority: Major
> Attachments: solr-dev-docs.zip
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz merged pull request #1179: LUCENE-9147: Move the stored fields index off-heap.

2020-02-05 Thread GitBox
jpountz merged pull request #1179: LUCENE-9147: Move the stored fields index 
off-heap.
URL: https://github.com/apache/lucene-solr/pull/1179
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030849#comment-17030849
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 136dcbdbbced7c2d32b4d244ca99ace2c59baee8 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=136dcbd ]

LUCENE-9147: Move the stored fields index off-heap. (#1179)

This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.

`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14241) Streaming Expression for deleting documents by IDs (from tuples)

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030868#comment-17030868
 ] 

ASF subversion and git services commented on SOLR-14241:


Commit c5d0391df9c821dc842287d8c769c6f73275a423 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c5d0391 ]

SOLR-14241: New delete() Stream Decorator


> Streaming Expression for deleting documents by IDs (from tuples)
> 
>
> Key: SOLR-14241
> URL: https://issues.apache.org/jira/browse/SOLR-14241
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-14241.patch
>
>
> Streaming expressions currently supports an {{update(...)}} decorator 
> function for wrapping another stream and treating each Tuple from the inner 
> stream as a document to be added to an index.
> I've implemented an analogous subclass of the {{UpdateStream}} called 
> {{DeleteStream}} that uses the tuples from the inner stream to identify the 
> uniqueKeys of documents that should be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14058) AIOOBE in PeerSync

2020-02-05 Thread Yonik Seeley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-14058:

   Attachment: SOLR-14058.patch
Affects Version/s: master (9.0)
 Assignee: Yonik Seeley
   Status: Open  (was: Open)

We don't have a test that tickles this bug, but after reviewing the code, the 
fix (attached) is relatively straightforward.  

otherUpdatesIndex is initialized to be less than otherVersions.size(), and it 
is only ever decremented (and otherVersions is not modified), hence the correct 
check is otherUpdatesIndex >= 0.


> AIOOBE in PeerSync
> --
>
> Key: SOLR-14058
> URL: https://issues.apache.org/jira/browse/SOLR-14058
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 8.3, master (9.0)
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-14058.patch
>
>
> We hit an exception with 8.3 that someone else also hit on stackoverflow:
> https://stackoverflow.com/questions/58891563/problem-in-syncing-replicas-with-solr-8-3-with-zookeeper-3-5-6
> {quote}
> I recently converted a solr 7.x + zookeeper 3.4.14 to solr 8.3 + zk 3.5.6, 
> and depending on how I start the solr nodes I'm geting a sync exception.
> My setup uses 3 zk nodes and 2 solr nodes (let's call it A and B). The 
> collection that has this problem has 1 shard and 2 replicas. I've noticed 2 
> situations: (1) which works fine and (2) which does not work.
> 1) This works: I start solr node A, and wait until it's replica is elected 
> leader ("green" in the Solr interface 'Cloud'->'Graph') - which takes about 2 
> min; and only then start solr node B. Both replicas are active and the one in 
> A is the leader.
> 2) This does NOT work: I start solr node A, and a few secs after I star solr 
> node B (that is, before the 'A' replica is elected leader - still "Down" in 
> the solr interface). In this case I get the following exception:
> ERROR (coreZkRegister-1-thread-2-processing-n:192.168.15.20:8986_solr 
> x:alldata_shard1_replica_n1 c:alldata s:shard1 r:core_node3) [c:alldata 
> s:shard1 r:core_node3 x:alldata_shard1_replica_n1] o.a.s.c.SyncStrategy Sync 
> Failed:java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 
> 99
> It seems that if both solr node are started soon after each other, then ZK 
> cannot elect one as leader. This error only appears in the solr.log of node 
> A, even if I invert the order of starting nodes.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14058) IOOBE in PeerSync

2020-02-05 Thread Yonik Seeley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-14058:

Summary: IOOBE in PeerSync  (was: AIOOBE in PeerSync)

> IOOBE in PeerSync
> -
>
> Key: SOLR-14058
> URL: https://issues.apache.org/jira/browse/SOLR-14058
> Project: Solr
>  Issue Type: Bug
>Affects Versions: master (9.0), 8.3
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-14058.patch
>
>
> We hit an exception with 8.3 that someone else also hit on stackoverflow:
> https://stackoverflow.com/questions/58891563/problem-in-syncing-replicas-with-solr-8-3-with-zookeeper-3-5-6
> {quote}
> I recently converted a solr 7.x + zookeeper 3.4.14 to solr 8.3 + zk 3.5.6, 
> and depending on how I start the solr nodes I'm geting a sync exception.
> My setup uses 3 zk nodes and 2 solr nodes (let's call it A and B). The 
> collection that has this problem has 1 shard and 2 replicas. I've noticed 2 
> situations: (1) which works fine and (2) which does not work.
> 1) This works: I start solr node A, and wait until it's replica is elected 
> leader ("green" in the Solr interface 'Cloud'->'Graph') - which takes about 2 
> min; and only then start solr node B. Both replicas are active and the one in 
> A is the leader.
> 2) This does NOT work: I start solr node A, and a few secs after I star solr 
> node B (that is, before the 'A' replica is elected leader - still "Down" in 
> the solr interface). In this case I get the following exception:
> ERROR (coreZkRegister-1-thread-2-processing-n:192.168.15.20:8986_solr 
> x:alldata_shard1_replica_n1 c:alldata s:shard1 r:core_node3) [c:alldata 
> s:shard1 r:core_node3 x:alldata_shard1_replica_n1] o.a.s.c.SyncStrategy Sync 
> Failed:java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 
> 99
> It seems that if both solr node are started soon after each other, then ZK 
> cannot elect one as leader. This error only appears in the solr.log of node 
> A, even if I invert the order of starting nodes.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9207) Don't build SpanQuery in QueryBuilder

2020-02-05 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030899#comment-17030899
 ] 

Michael Gibney commented on LUCENE-9207:


Thanks, [~romseygeek]. You're right, I was wrong about LUCENE-7398; in its 
current state (and any state that doesn't implement backtracking) 
SpanNearQuery/SpanOrQuery can potentially miss matches in cases like the one 
you describe, even when slop=0. So that's surely still a bug, which would 
indeed be addressed by this change (so to be explicit, that makes me +1 to this 
change).

Yes, maxBooleanClauses is a good failsafe, but I think it's worth specifically 
calling attention to the possibility that for some analyzer configurations and 
inputs, this will result in queries failing differently (and more consistently 
and transparently) than other queries had failed before (silently missing 
matches under certain conditions).

bq.If you have a heavily branching token stream then it's going to produce an 
unwieldy query whatever we do, really...

True in some ways; but the characteristics of the implementations do vary 
fundamentally, so it's not really 6-of-1, half-dozen-of-another. A complete 
nested SpanQuery (as proposed for LUCENE-7398, or analogous Intervals) 
implementation has the potential to be significantly more efficient than 
MultiPhraseQuery or its analogous Intervals impl (which expand all possible 
variants up front).

> Don't build SpanQuery in QueryBuilder
> -
>
> Key: LUCENE-9207
> URL: https://issues.apache.org/jira/browse/LUCENE-9207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Subtask of LUCENE-9204.  QueryBuilder currently has special logic for graph 
> phrase queries with no slop, constructing a spanquery that attempts to follow 
> all paths using a combination of OR and NEAR queries.  Given the known bugs 
> in this type of query (LUCENE-7398) and that we would like to move span 
> queries out of core in any case, we should remove this logic and just build a 
> disjunction of phrase queries, one phrase per path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andyvuong commented on a change in pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
andyvuong commented on a change in pull request #1188: SOLR-14044: Support 
collection and shard deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188#discussion_r375441526
 
 

 ##
 File path: 
solr/core/src/java/org/apache/solr/cloud/api/collections/DeleteCollectionCmd.java
 ##
 @@ -142,6 +148,34 @@ public void call(ClusterState state, ZkNodeProps message, 
NamedList results) thr
   break;
 }
   }
+  
+  // Delete the collection files from shared store. We want to delete all 
of the files before we delete
+  // the collection state from ZooKeeper.
+  DocCollection docCollection = 
zkStateReader.getClusterState().getCollectionOrNull(collection);
+  if (docCollection != null && docCollection.getSharedIndex()) {
+SharedStoreManager sharedStoreManager = 
ocmh.overseer.getCoreContainer().getSharedStoreManager();
+BlobDeleteManager deleteManager = 
sharedStoreManager.getBlobDeleteManager();
+BlobDeleteProcessor deleteProcessor = 
deleteManager.getOverseerDeleteProcessor();
+// deletes all files belonging to this collection
+CompletableFuture deleteFuture = 
+deleteProcessor.deleteCollection(collection, false);
+
+try {
+  // TODO: Find a reasonable timeout value
+  BlobDeleterTaskResult result = deleteFuture.get(60, 
TimeUnit.SECONDS);
+  if (!result.isSuccess()) {
+log.warn("Deleting all files belonging to shared collection " + 
collection + 
+" was not successful! Files belonging to this collection may 
be orphaned.");
+  }
+} catch (TimeoutException tex) {
+  // We can orphan files here if we don't delete everything in time 
but what matters for potentially
+  // reusing the collection name is that the zookeeper state of the 
collection gets deleted which 
+  // will happen in the finally block
+  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Could 
not complete deleting collection" + 
 
 Review comment:
   I throw an exception in both cases so the client calling is aware if the 
command fails and files are orphaned. In DeleteCollection the files are "truly 
orphaned" because even if we error out here, the collection will always be 
deleted from zookeeper in the finally block and it's effectively gone from 
Solr's perspective. The DeleteShard will fail the whole command without doing 
the same delete from zookeeper action and a subsequent delete shard command can 
be called to try again which isn't the case in the former (unless the same 
collection name is created again).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andyvuong commented on a change in pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
andyvuong commented on a change in pull request #1188: SOLR-14044: Support 
collection and shard deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188#discussion_r375442318
 
 

 ##
 File path: 
solr/core/src/test/org/apache/solr/store/blob/process/BlobDeleteProcessorTest.java
 ##
 @@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.store.blob.process;
+
+import java.nio.file.Path;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.solr.SolrTestCaseJ4;
+import org.apache.solr.store.blob.client.BlobException;
+import org.apache.solr.store.blob.client.CoreStorageClient;
+import org.apache.solr.store.blob.client.LocalStorageClient;
+import 
org.apache.solr.store.blob.process.BlobDeleterTask.BlobDeleterTaskResult;
+import org.apache.solr.store.blob.process.BlobDeleterTask.BlobFileDeletionTask;
+import 
org.apache.solr.store.blob.process.BlobDeleterTask.BlobPrefixedFileDeletionTask;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+/**
+ * Unit tests for {@link BlobDeleteProcessor}
+ */
+public class BlobDeleteProcessorTest extends SolrTestCaseJ4 {
+  
+  private static String DEFAULT_PROCESSOR_NAME = "DeleterForTest";
+  private static Path sharedStoreRootPath;
+  private static CoreStorageClient blobClient;
+  
+  private static List enqueuedTasks;
+
+  @BeforeClass
+  public static void setupTestClass() throws Exception {
+sharedStoreRootPath = createTempDir("tempDir");
+
System.setProperty(LocalStorageClient.BLOB_STORE_LOCAL_FS_ROOT_DIR_PROPERTY, 
sharedStoreRootPath.resolve("LocalBlobStore/").toString());
+blobClient = new LocalStorageClient() {
+   
+  // no ops for BlobFileDeletionTask and BlobPrefixedFileDeletionTask to 
execute successfully
+  @Override
+  public void deleteBlobs(Collection paths) throws BlobException {
+return;
+  }
+
+  // no ops for BlobFileDeletionTask and BlobPrefixedFileDeletionTask to 
execute successfully
+  @Override
+  public List listCoreBlobFiles(String prefix) throws 
BlobException {
+return new LinkedList<>();
+  }
+};
+  }
+  
+  @Before
+  public void setup() {
+enqueuedTasks = new LinkedList();
+  }
+  
+  /**
+   * Verify we enqueue a {@link BlobFileDeletionTask} with the correct 
parameters.
+   * Note we're not testing the functionality of the deletion task here only 
that the processor successfully
+   * handles the task. End to end blob deletion tests can be found {@link 
SharedStoreDeletionProcessTest} 
+   */
+  @Test
+  public void testDeleteFilesEnqueueTask() throws Exception {
+int maxQueueSize = 3;
+int numThreads = 1;
+int defaultMaxAttempts = 5;
+int retryDelay = 500; 
+String name = "testName";
+
+BlobDeleteProcessor processor = 
buildBlobDeleteProcessorForTest(enqueuedTasks, blobClient,
+maxQueueSize, numThreads, defaultMaxAttempts, retryDelay);
+Set names = new HashSet<>();
+names.add("test1");
+names.add("test2");
+// uses the specified defaultMaxAttempts at the processor (not task) level 
+CompletableFuture cf = processor.deleteFiles(name, 
names, true);
+// wait for this task and all its potential retries to finish
+BlobDeleterTaskResult res = cf.get(5000, TimeUnit.MILLISECONDS);
+assertEquals(1, enqueuedTasks.size());
+
+assertEquals(1, enqueuedTasks.size());
+assertNotNull(res);
+assertEquals(1, res.getTask().getAttempts());
+assertEquals(true, res.isSuccess());
+assertEquals(false, res.shouldRetry());
+
+processor.shutdown();
+  }
+  
+  /**
+   * Verify we enqueue a {@link BlobPrefixedFileDeletionTask} with the correct 
parameters.
+   * Note we're not testing the functionality of the deletion task here only 
that the processor successfully
+   * handles the task. End to end blob deletion tests can be found {@link 
SharedStoreDe

[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030916#comment-17030916
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 597141df6b6a017fced16ec27b8fd180e9a6fcc2 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=597141d ]

LUCENE-9147: Move the stored fields index off-heap. (#1179)

This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.

`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud

2020-02-05 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030919#comment-17030919
 ] 

Ilan Ginzburg commented on SOLR-5146:
-

Isn't a fundamental difference in SolrCloud vs standalone Solr that if we 
assume a given slice (shard) is not loaded anywhere and a request is received 
by a node for it, the node can load/open its local copy of that core just fine 
(let's assume that since it works in standalone), but then it's not immediately 
possible for it to get the shard leader election done since other nodes are not 
currently participating for that slice.

> Figure out what it would take for lazily-loaded cores to play nice with 
> SolrCloud
> -
>
> Key: SOLR-5146
> URL: https://issues.apache.org/jira/browse/SOLR-5146
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Affects Versions: 4.5, 6.0
>Reporter: Erick Erickson
>Assignee: David Smiley
>Priority: Major
>
> The whole lazy-load core thing was implemented with non-SolrCloud use-cases 
> in mind. There are several user-list threads that ask about using lazy cores 
> with SolrCloud, especially in multi-tenant use-cases.
> This is a marker JIRA to investigate what it would take to make lazy-load 
> cores play nice with SolrCloud. It's especially interesting how this all 
> works with shards, replicas, leader election, recovery, etc.
> NOTE: This is pretty much totally unexplored territory. It may be that a few 
> trivial modifications are all that's needed. OTOH, It may be that we'd have 
> to rip apart SolrCloud to handle this case. Until someone dives into the 
> code, we don't know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13887) socketTimeout of 0 causing timeouts in the Http2SolrClient

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030936#comment-17030936
 ] 

ASF subversion and git services commented on SOLR-13887:


Commit 80ed8c281b354884561b8f1edbadb5e369de3d52 in lucene-solr's branch 
refs/heads/master from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=80ed8c2 ]

SOLR-13887: Use the default idleTimeout instead of 0 for HTTP2 (#991)



> socketTimeout of 0 causing timeouts in the Http2SolrClient
> --
>
> Key: SOLR-13887
> URL: https://issues.apache.org/jira/browse/SOLR-13887
> Project: Solr
>  Issue Type: Bug
>  Components: http2
>Affects Versions: 8.0, master (9.0)
>Reporter: Houston Putman
>Assignee: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Solr 7, and previous versions, the both the *socketTimeout* and 
> *connTimeout* defaults in _solr.xml_ have accepted 0 as values. This is even 
> [documented in the ref 
> guide|https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml].
>  Using these same defaults with Solr 8 results in timeouts when trying to 
> manually create replicas. The major change here seems to be that the 
> Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 
> and previous versions.
> After some digging, I think that the issue lies in the Http2SolrClient, 
> [specifically 
> here|https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L399].
>  Since the idleTimeout is set to 0, since that is what solr pulls from the 
> solr.xml, the listener immediately responds with a timeout.
> The fix here is pretty simple, just set a default if 0 is provided. Basically 
> treat an idleTimeout (or socketTimeout) of 0 the same as null. The ref-guide 
> should also likely be updated with the same defaults as used in the solr.xml 
> packaged in Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman merged pull request #991: SOLR-13887: Use default instead of idleTimeouts of 0 for HTTP2 requests

2020-02-05 Thread GitBox
HoustonPutman merged pull request #991: SOLR-13887: Use default instead of 
idleTimeouts of 0 for HTTP2 requests
URL: https://github.com/apache/lucene-solr/pull/991
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mbwaheed commented on a change in pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
mbwaheed commented on a change in pull request #1188: SOLR-14044: Support 
collection and shard deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188#discussion_r375459115
 
 

 ##
 File path: 
solr/core/src/java/org/apache/solr/cloud/api/collections/DeleteCollectionCmd.java
 ##
 @@ -142,6 +148,34 @@ public void call(ClusterState state, ZkNodeProps message, 
NamedList results) thr
   break;
 }
   }
+  
+  // Delete the collection files from shared store. We want to delete all 
of the files before we delete
+  // the collection state from ZooKeeper.
+  DocCollection docCollection = 
zkStateReader.getClusterState().getCollectionOrNull(collection);
+  if (docCollection != null && docCollection.getSharedIndex()) {
+SharedStoreManager sharedStoreManager = 
ocmh.overseer.getCoreContainer().getSharedStoreManager();
+BlobDeleteManager deleteManager = 
sharedStoreManager.getBlobDeleteManager();
+BlobDeleteProcessor deleteProcessor = 
deleteManager.getOverseerDeleteProcessor();
+// deletes all files belonging to this collection
+CompletableFuture deleteFuture = 
+deleteProcessor.deleteCollection(collection, false);
+
+try {
+  // TODO: Find a reasonable timeout value
+  BlobDeleterTaskResult result = deleteFuture.get(60, 
TimeUnit.SECONDS);
+  if (!result.isSuccess()) {
+log.warn("Deleting all files belonging to shared collection " + 
collection + 
+" was not successful! Files belonging to this collection may 
be orphaned.");
+  }
+} catch (TimeoutException tex) {
+  // We can orphan files here if we don't delete everything in time 
but what matters for potentially
+  // reusing the collection name is that the zookeeper state of the 
collection gets deleted which 
+  // will happen in the finally block
+  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Could 
not complete deleting collection" + 
 
 Review comment:
   There is inconsistency. In case of timeout we throw and in case of failures 
we log warning. I am fine with throwing exception but it needs to be consistent 
for all failures. Is there a reason for this inconsistency?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mbwaheed commented on a change in pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
mbwaheed commented on a change in pull request #1188: SOLR-14044: Support 
collection and shard deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188#discussion_r375460635
 
 

 ##
 File path: 
solr/core/src/test/org/apache/solr/store/blob/process/BlobDeleteProcessorTest.java
 ##
 @@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.store.blob.process;
+
+import java.nio.file.Path;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.solr.SolrTestCaseJ4;
+import org.apache.solr.store.blob.client.BlobException;
+import org.apache.solr.store.blob.client.CoreStorageClient;
+import org.apache.solr.store.blob.client.LocalStorageClient;
+import 
org.apache.solr.store.blob.process.BlobDeleterTask.BlobDeleterTaskResult;
+import org.apache.solr.store.blob.process.BlobDeleterTask.BlobFileDeletionTask;
+import 
org.apache.solr.store.blob.process.BlobDeleterTask.BlobPrefixedFileDeletionTask;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+/**
+ * Unit tests for {@link BlobDeleteProcessor}
+ */
+public class BlobDeleteProcessorTest extends SolrTestCaseJ4 {
+  
+  private static String DEFAULT_PROCESSOR_NAME = "DeleterForTest";
+  private static Path sharedStoreRootPath;
+  private static CoreStorageClient blobClient;
+  
+  private static List enqueuedTasks;
+
+  @BeforeClass
+  public static void setupTestClass() throws Exception {
+sharedStoreRootPath = createTempDir("tempDir");
+
System.setProperty(LocalStorageClient.BLOB_STORE_LOCAL_FS_ROOT_DIR_PROPERTY, 
sharedStoreRootPath.resolve("LocalBlobStore/").toString());
+blobClient = new LocalStorageClient() {
+   
+  // no ops for BlobFileDeletionTask and BlobPrefixedFileDeletionTask to 
execute successfully
+  @Override
+  public void deleteBlobs(Collection paths) throws BlobException {
+return;
+  }
+
+  // no ops for BlobFileDeletionTask and BlobPrefixedFileDeletionTask to 
execute successfully
+  @Override
+  public List listCoreBlobFiles(String prefix) throws 
BlobException {
+return new LinkedList<>();
+  }
+};
+  }
+  
+  @Before
+  public void setup() {
+enqueuedTasks = new LinkedList();
+  }
+  
+  /**
+   * Verify we enqueue a {@link BlobFileDeletionTask} with the correct 
parameters.
+   * Note we're not testing the functionality of the deletion task here only 
that the processor successfully
+   * handles the task. End to end blob deletion tests can be found {@link 
SharedStoreDeletionProcessTest} 
+   */
+  @Test
+  public void testDeleteFilesEnqueueTask() throws Exception {
+int maxQueueSize = 3;
+int numThreads = 1;
+int defaultMaxAttempts = 5;
+int retryDelay = 500; 
+String name = "testName";
+
+BlobDeleteProcessor processor = 
buildBlobDeleteProcessorForTest(enqueuedTasks, blobClient,
+maxQueueSize, numThreads, defaultMaxAttempts, retryDelay);
+Set names = new HashSet<>();
+names.add("test1");
+names.add("test2");
+// uses the specified defaultMaxAttempts at the processor (not task) level 
+CompletableFuture cf = processor.deleteFiles(name, 
names, true);
+// wait for this task and all its potential retries to finish
+BlobDeleterTaskResult res = cf.get(5000, TimeUnit.MILLISECONDS);
+assertEquals(1, enqueuedTasks.size());
+
+assertEquals(1, enqueuedTasks.size());
+assertNotNull(res);
+assertEquals(1, res.getTask().getAttempts());
+assertEquals(true, res.isSuccess());
+assertEquals(false, res.shouldRetry());
+
+processor.shutdown();
+  }
+  
+  /**
+   * Verify we enqueue a {@link BlobPrefixedFileDeletionTask} with the correct 
parameters.
+   * Note we're not testing the functionality of the deletion task here only 
that the processor successfully
+   * handles the task. End to end blob deletion tests can be found {@link 
SharedStoreDel

[jira] [Commented] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud

2020-02-05 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030952#comment-17030952
 ] 

Erick Erickson commented on SOLR-5146:
--

[~murblanc] That's certainly one issue. Even if efficiently getting a leader 
for a completely unloaded shard is solved, the question of how to keep the core 
in sync is a sticky wicket. Say even one replica of a shard is unloaded and it 
gets loaded. How is the core synched before doing anything? If replicas are 
coming and going all the time, do we wind up doing full synchronizations 
(assuming the leader problem is solved)? In the case of, say, 200G indexes for 
a given replica, that's very expensive.

Core loading from a cold start is a very heavyweight operation. It may be that 
we need some intermediate state where we can free up lots of resources but keep 
the core kind of loaded, mostly so it could be waked up nearly instantly, say 
the equivalent of opening a new searcher.

Leader election is really all about insuring that the index is up to date. So 
I've wondered about a state for a replica that's "index only" rather than 
unloaded, the idea is that that way it's always up to date and can (almost) 
instantly assume leadership, but doesn't consume the heavier-weight resources. 
Then it could be brought online without having to sync from the leader. And 
then "somehow" combine it with autoscaling-like functionality, when they query 
rate exceeded X, bring another replica from index-only to serving searchers. 
That'd take untangling what's necessary for indexing and what's necessary for 
searching so they were relatively independent.

But I'll leave that for David to struggle with...

> Figure out what it would take for lazily-loaded cores to play nice with 
> SolrCloud
> -
>
> Key: SOLR-5146
> URL: https://issues.apache.org/jira/browse/SOLR-5146
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Affects Versions: 4.5, 6.0
>Reporter: Erick Erickson
>Assignee: David Smiley
>Priority: Major
>
> The whole lazy-load core thing was implemented with non-SolrCloud use-cases 
> in mind. There are several user-list threads that ask about using lazy cores 
> with SolrCloud, especially in multi-tenant use-cases.
> This is a marker JIRA to investigate what it would take to make lazy-load 
> cores play nice with SolrCloud. It's especially interesting how this all 
> works with shards, replicas, leader election, recovery, etc.
> NOTE: This is pretty much totally unexplored territory. It may be that a few 
> trivial modifications are all that's needed. OTOH, It may be that we'd have 
> to rip apart SolrCloud to handle this case. Until someone dives into the 
> code, we don't know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

2020-02-05 Thread GitBox
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel 
early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r375474191
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling 
{@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, 
Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by 
returning true from
+ * its {@link #update} method.
+ * 
+ * At any moment, N leaves have reported their counts of documents 
collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. 
And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each 
leaf.
+ *
+ * Once the total number of documents collected reaches the requested total 
(numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the 
results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no 
longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update 
the global maximum
+ * competitive score, and use that as a termination threshold, but assuming 
this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on 
the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together 
have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding 
the terminated leaf's max
+ * from consideration.
+ *
+ *  In practice this leads to a good bound on the number of documents 
collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly 
distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst 
case, where *all*
+ * the best documents are in a single segment, we expect to collect something 
O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * 
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score 
accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+leafStates = new ArrayList<>();
+this.totalToCollect = totalToCollect;
+setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+LeafState newLeafState = new LeafState();
+leafStates.add(newLeafState);
+if (leafState == null) {
+  leafState = newLeafState;
+}
+return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+interval = 1 << bitCount;
+intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+totalCollected += interval;
+//System.out.println(" scoreboard totalCollected = " + totalCollected + 
"/" + totalToCollect + " "
+//  + newLeafState + " ? " + lea

[jira] [Resolved] (SOLR-13552) Add recNum Stream Evaluator

2020-02-05 Thread Joel Bernstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein resolved SOLR-13552.
---
Resolution: Resolved

> Add recNum Stream Evaluator
> ---
>
> Key: SOLR-13552
> URL: https://issues.apache.org/jira/browse/SOLR-13552
> Project: Solr
>  Issue Type: New Feature
>  Components: streaming expressions
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Minor
> Fix For: 8.2
>
> Attachments: SOLR-13552.patch
>
>
> The *recNum* Stream Evaluator will return the index of the tuple in the 
> stream. It is designed to be used with the *select* expression to append the 
> index of the tuple to tuples in the stream.
> Syntax:
> {code:java}
> having(select(search(testapp),
>   id,
>   recNum() as recNum),
>and(gt(recNum, 3), lt(recNum, 6))){code}
> Returns:
> {code:java}
> { "result-set": { "docs": [ 
> { "recNum": 4, "id": "2e65eac8-7051-4a11-8409-b89e910bfedc" }, 
> { "recNum": 5, "id": "70edbf04-7dec-4e7f-86bb-71b97e737457" }, 
> { "EOF": true, "RESPONSE_TIME": 12 } ] } }{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on issue #1141: SOLR-14147 change the Security manager to default to true.

2020-02-05 Thread GitBox
rmuir commented on issue #1141: SOLR-14147 change the Security manager to 
default to true.
URL: https://github.com/apache/lucene-solr/pull/1141#issuecomment-582593853
 
 
   Changes look great. Thank you @MarcusSorealheis !


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] MarcusSorealheis commented on issue #1141: SOLR-14147 change the Security manager to default to true.

2020-02-05 Thread GitBox
MarcusSorealheis commented on issue #1141: SOLR-14147 change the Security 
manager to default to true.
URL: https://github.com/apache/lucene-solr/pull/1141#issuecomment-582594351
 
 
   thank you for guidance
   
   On Wed, Feb 5, 2020 at 12:20 PM Robert Muir 
   wrote:
   
   > Changes look great. Thank you @MarcusSorealheis
   >  !
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or unsubscribe
   > 

   > .
   >
   
   
   -- 
   Marcus Eagan
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14241) Streaming Expression for deleting documents by IDs (from tuples)

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031006#comment-17031006
 ] 

ASF subversion and git services commented on SOLR-14241:


Commit bbdfce944bf2c7b8a06b162156e278b3f3986c6e in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bbdfce9 ]

SOLR-14241: New delete() Stream Decorator

(cherry picked from commit c5d0391df9c821dc842287d8c769c6f73275a423)


> Streaming Expression for deleting documents by IDs (from tuples)
> 
>
> Key: SOLR-14241
> URL: https://issues.apache.org/jira/browse/SOLR-14241
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-14241.patch
>
>
> Streaming expressions currently supports an {{update(...)}} decorator 
> function for wrapping another stream and treating each Tuple from the inner 
> stream as a document to be added to an index.
> I've implemented an analogous subclass of the {{UpdateStream}} called 
> {{DeleteStream}} that uses the tuples from the inner stream to identify the 
> uniqueKeys of documents that should be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031009#comment-17031009
 ] 

Michael McCandless commented on LUCENE-9206:


+1 to the patch!  Better options and better defaults.  Thanks for showing this 
tool some love [~rcmuir] :)

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14241) Streaming Expression for deleting documents by IDs (from tuples)

2020-02-05 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-14241:
--
Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Streaming Expression for deleting documents by IDs (from tuples)
> 
>
> Key: SOLR-14241
> URL: https://issues.apache.org/jira/browse/SOLR-14241
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14241.patch
>
>
> Streaming expressions currently supports an {{update(...)}} decorator 
> function for wrapping another stream and treating each Tuple from the inner 
> stream as a document to be added to an index.
> I've implemented an analogous subclass of the {{UpdateStream}} called 
> {{DeleteStream}} that uses the tuples from the inner stream to identify the 
> uniqueKeys of documents that should be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-02-05 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031016#comment-17031016
 ] 

Chris M. Hostetter commented on SOLR-11725:
---

{quote}Planning to commit this to master if there is no objection
{quote}
Patch seems clean ... it sounds like you mean commit to master for release in 
9.0 but no backport for 8x so there's no backcompat chnage until the next major 
version?

that seems good ... just make sure theres a note in the 9.0 upgrade backcompat 
section

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031017#comment-17031017
 ] 

Adrien Grand commented on LUCENE-9206:
--

+1

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-02-05 Thread Yonik Seeley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031026#comment-17031026
 ] 

Yonik Seeley commented on SOLR-11725:
-

bq. No changes have been done with regards to sample size of 1 or 0

Wait... the conversation from 2017 wasn't resoved?  What do we want to do about 
stddev of singleton sets?  Solr currently returns 0.0, and Hoss seemed to think 
this was the right behavior.  But the patch here would seem to change the 
behavior to return NaN (but I didn't test it...) . After a quick glance, it 
doesn't look like existing tests cover this case either?


> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-5146) Figure out what it would take for lazily-loaded cores to play nice with SolrCloud

2020-02-05 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031050#comment-17031050
 ] 

David Smiley commented on SOLR-5146:


{quote}...but then it's not immediately possible for it to get the shard leader 
election done since other nodes are not currently participating for that slice.
{quote}
I don't get what you are saying here.  The "trick" with transient cores with 
SolrCloud will be that SolrCloud needn't know about the loaded status.  Maybe 
there will be an exception but it'll be a secret inside the node (other 
nodes/replicas won't know).  The core is _present_, and thus it's leader status 
is whatever it is to SolrCloud.  It might be awoken to participate in 
leadership elections (I hope not) but if so I'll look to fix that so an 
unloaded core can stay that way during this.  If there is data to sync then the 
core will be awoken to do so (/update and /replication and all request handlers 
require the core be loaded).

> Figure out what it would take for lazily-loaded cores to play nice with 
> SolrCloud
> -
>
> Key: SOLR-5146
> URL: https://issues.apache.org/jira/browse/SOLR-5146
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Affects Versions: 4.5, 6.0
>Reporter: Erick Erickson
>Assignee: David Smiley
>Priority: Major
>
> The whole lazy-load core thing was implemented with non-SolrCloud use-cases 
> in mind. There are several user-list threads that ask about using lazy cores 
> with SolrCloud, especially in multi-tenant use-cases.
> This is a marker JIRA to investigate what it would take to make lazy-load 
> cores play nice with SolrCloud. It's especially interesting how this all 
> works with shards, replicas, leader election, recovery, etc.
> NOTE: This is pretty much totally unexplored territory. It may be that a few 
> trivial modifications are all that's needed. OTOH, It may be that we'd have 
> to rip apart SolrCloud to handle this case. Until someone dives into the 
> code, we don't know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13887) socketTimeout of 0 causing timeouts in the Http2SolrClient

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031051#comment-17031051
 ] 

ASF subversion and git services commented on SOLR-13887:


Commit e0d35f964169a0d1efb686f2e2586c896194c02c in lucene-solr's branch 
refs/heads/branch_8x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e0d35f9 ]

SOLR-13887: Use the default idleTimeout instead of 0 for HTTP2 (#991)



> socketTimeout of 0 causing timeouts in the Http2SolrClient
> --
>
> Key: SOLR-13887
> URL: https://issues.apache.org/jira/browse/SOLR-13887
> Project: Solr
>  Issue Type: Bug
>  Components: http2
>Affects Versions: 8.0, master (9.0)
>Reporter: Houston Putman
>Assignee: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Solr 7, and previous versions, the both the *socketTimeout* and 
> *connTimeout* defaults in _solr.xml_ have accepted 0 as values. This is even 
> [documented in the ref 
> guide|https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml].
>  Using these same defaults with Solr 8 results in timeouts when trying to 
> manually create replicas. The major change here seems to be that the 
> Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 
> and previous versions.
> After some digging, I think that the issue lies in the Http2SolrClient, 
> [specifically 
> here|https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L399].
>  Since the idleTimeout is set to 0, since that is what solr pulls from the 
> solr.xml, the listener immediately responds with a timeout.
> The fix here is pretty simple, just set a default if 0 is provided. Basically 
> treat an idleTimeout (or socketTimeout) of 0 the same as null. The ref-guide 
> should also likely be updated with the same defaults as used in the solr.xml 
> packaged in Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031055#comment-17031055
 ] 

Robert Muir commented on LUCENE-9206:
-

For this command-line tool I will push to master only, with verbage in 
CHANGES.txt/MIGRATE.txt to pass {{--max-segments 1}} if you rely upon the old 
behavior.

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13887) socketTimeout of 0 causing timeouts in the Http2SolrClient

2020-02-05 Thread Houston Putman (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Houston Putman resolved SOLR-13887.
---
Resolution: Fixed

> socketTimeout of 0 causing timeouts in the Http2SolrClient
> --
>
> Key: SOLR-13887
> URL: https://issues.apache.org/jira/browse/SOLR-13887
> Project: Solr
>  Issue Type: Bug
>  Components: http2
>Affects Versions: 8.0, master (9.0)
>Reporter: Houston Putman
>Assignee: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Solr 7, and previous versions, the both the *socketTimeout* and 
> *connTimeout* defaults in _solr.xml_ have accepted 0 as values. This is even 
> [documented in the ref 
> guide|https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml].
>  Using these same defaults with Solr 8 results in timeouts when trying to 
> manually create replicas. The major change here seems to be that the 
> Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 
> and previous versions.
> After some digging, I think that the issue lies in the Http2SolrClient, 
> [specifically 
> here|https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L399].
>  Since the idleTimeout is set to 0, since that is what solr pulls from the 
> solr.xml, the listener immediately responds with a timeout.
> The fix here is pretty simple, just set a default if 0 is provided. Basically 
> treat an idleTimeout (or socketTimeout) of 0 the same as null. The ref-guide 
> should also likely be updated with the same defaults as used in the solr.xml 
> packaged in Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031060#comment-17031060
 ] 

ASF subversion and git services commented on LUCENE-9206:
-

Commit 93b83f635dffc782dc70174e5ea377ceab6a8174 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=93b83f6 ]

LUCENE-9206: Improve IndexMergeTool defaults and options

IndexMergeTool previously had no options and always forceMerge(1)
the resulting index. This can result in wasted work and confusing
performance (unbalancing the index).

Instead the default is to not do anything, except merges from the
merge policy.


> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9206.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9203) Make DocValuesIterator public

2020-02-05 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031062#comment-17031062
 ] 

Michael Sokolov commented on LUCENE-9203:
-

It looks to me as if the discussion in LUCENE-9081 was about whether to 
eliminate {{DocValuesIterator}} altogether since it doesn't actually do 
anything - is really just a marker interface. But I think it's helpful for when 
you want to deal with all the DocValues types in a generic manner to have some 
superclass (or could be an interface maybe?) to which to refer. Otherwise you 
end up with ugly switch code that must reference all the concrete types. Given 
that, I'd be +1 to making public, and if for some reason we decide not to, then 
I think we should remove, since this keeps coming up.

> Make DocValuesIterator public
> -
>
> Key: LUCENE-9203
> URL: https://issues.apache.org/jira/browse/LUCENE-9203
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.4
>Reporter: juan camilo rodriguez duran
>Priority: Trivial
>  Labels: docValues
>
> By doing this, we improve extensibility for new formats. Additionally this 
> will improve coherence with the public method already existent in the class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andyvuong commented on issue #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
andyvuong commented on issue #1188: SOLR-14044: Support collection and shard 
deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188#issuecomment-582637345
 
 
   Updated to close processors in tests even with failure and make error 
handling consistent @mbwaheed 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir merged pull request #1141: SOLR-14147 change the Security manager to default to true.

2020-02-05 Thread GitBox
rmuir merged pull request #1141: SOLR-14147 change the Security manager to 
default to true.
URL: https://github.com/apache/lucene-solr/pull/1141
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14147) enable security manager by default

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031142#comment-17031142
 ] 

ASF subversion and git services commented on SOLR-14147:


Commit bc5f837344a32b7795bd1d727251e639b33056c0 in lucene-solr's branch 
refs/heads/master from Marcus
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc5f837 ]

SOLR-14147 change the Security manager to default to true. (#1141)

* change the Security manager to default.
* update the ref-guide.
* uncomment init scripts update changes.
* changed the ref guide and re-commented file.
* remove added comment.
* modified shell script.
* removed comment in windows file.

Signed-off-by: marcussorealheis 

* bashism and fix windows
* remove space

Signed-off-by: marcussorealheis 


> enable security manager by default
> --
>
> Key: SOLR-14147
> URL: https://issues.apache.org/jira/browse/SOLR-14147
> Project: Solr
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> For 9.0, set SOLR_SECURITY_MANAGER_ENABLED=true by default. Remove the step 
> from securing solr page as it will be done by default (defaults become safe). 
> Users can disable if they are running hadoop or doing other crazy stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14147) enable security manager by default

2020-02-05 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-14147.

Resolution: Fixed

Thank [~marcussorealheis] !

> enable security manager by default
> --
>
> Key: SOLR-14147
> URL: https://issues.apache.org/jira/browse/SOLR-14147
> Project: Solr
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> For 9.0, set SOLR_SECURITY_MANAGER_ENABLED=true by default. Remove the step 
> from securing solr page as it will be done by default (defaults become safe). 
> Users can disable if they are running hadoop or doing other crazy stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031175#comment-17031175
 ] 

Robert Muir commented on LUCENE-9206:
-

forbidden-apis got me, I'm gonna fix it. The separate options class is not 
annotated with SuppressForbidden and it is doing commandline-tool stuff.

> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9206) improve IndexMergeTool

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031176#comment-17031176
 ] 

ASF subversion and git services commented on LUCENE-9206:
-

Commit 196ec5f4a879eb3e1fc1818a4c2dd70a215882f1 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=196ec5f ]

LUCENE-9206: add forbidden api exclusion to new class


> improve IndexMergeTool
> --
>
> Key: LUCENE-9206
> URL: https://issues.apache.org/jira/browse/LUCENE-9206
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/tools
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9206.patch
>
>
> This tool can have performance problems since it will only force merge the 
> index down to one segment. Let's give it some better options and default 
> behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] yonik merged pull request #1188: SOLR-14044: Support collection and shard deletion in shared storage

2020-02-05 Thread GitBox
yonik merged pull request #1188: SOLR-14044: Support collection and shard 
deletion in shared storage
URL: https://github.com/apache/lucene-solr/pull/1188
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14044) Support shard/collection deletion in shared storage

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031204#comment-17031204
 ] 

ASF subversion and git services commented on SOLR-14044:


Commit 3e8ca67f6f5d87b103148824e2bf158d5d5330da in lucene-solr's branch 
refs/heads/jira/SOLR-13101 from Andy Vuong
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3e8ca67 ]

SOLR-14044: Support collection and shard deletion in shared storage (#1188)

* Support collection and shard deletion in shared storage

* Add end to end collection api delete tests and fix local client test

* Fix timestamps

* Remove debug log line and fix timestamps

* Address review comments and fix test

* Close resource and throw exception on failure


> Support shard/collection deletion in shared storage
> ---
>
> Key: SOLR-14044
> URL: https://issues.apache.org/jira/browse/SOLR-14044
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Andy Vuong
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The Solr Cloud deletion APIs for collections and shards are not currently 
> supported by shared storage but are an essential functionality required by 
> the shared storage design. Deletion of objects from shared storage currently 
> only happens in the indexing path (on pushes) and after the index file 
> listings between the local solr process and external store have been resolved.
>  
> This task is to track supporting the delete shard/collection API commands and 
> its scope does not include cleaning up so called “orphaned” index files from 
> blob (i.e. files that are no longer referenced by any core.metadata file on 
> the external store). This will be designed/covered in another subtask.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14118) default embedded zookeeper port to localhost

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031208#comment-17031208
 ] 

ASF subversion and git services commented on SOLR-14118:


Commit 63be99bf12ddf32d30ba60e59dcc01f653cd2e0e in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=63be99b ]

SOLR-14118: default embedded zookeeper port to localhost


> default embedded zookeeper port to localhost
> 
>
> Key: SOLR-14118
> URL: https://issues.apache.org/jira/browse/SOLR-14118
> Project: Solr
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Attachments: SOLR-14118.patch
>
>
> Relates: SOLR-13985
> If someone runs {{bin/solr start -c}}:
> {noformat}
> tcp46  0  0  *.8983 *.*LISTEN 
> tcp46  0  0  *.9983 *.*LISTEN
> {noformat}
> In addition to the jetty port, the embedded zookeeper port should not bind to 
> all interfaces. Nothing should by default!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14118) default embedded zookeeper port to localhost

2020-02-05 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-14118.

Fix Version/s: master (9.0)
   Resolution: Fixed

> default embedded zookeeper port to localhost
> 
>
> Key: SOLR-14118
> URL: https://issues.apache.org/jira/browse/SOLR-14118
> Project: Solr
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-14118.patch
>
>
> Relates: SOLR-13985
> If someone runs {{bin/solr start -c}}:
> {noformat}
> tcp46  0  0  *.8983 *.*LISTEN 
> tcp46  0  0  *.9983 *.*LISTEN
> {noformat}
> In addition to the jetty port, the embedded zookeeper port should not bind to 
> all interfaces. Nothing should by default!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-05 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031223#comment-17031223
 ] 

Xin-Chun Zhang commented on LUCENE-9004:


??The default heap size that is given to Java processes depends on platforms, 
but for most commodity PCs it wouldn't be so large so you will see OOM if you 
are not set the -Xmx JVM arg.??

[~tomoko] I did set JVM option to "-Xmx8192m", but OOM error always throws. I 
guess there may be a memory leak in the static member "cache" of 
HNSWGraphReader. The key of static "cache" is composed of field name and 
context identity, where the context identity may vary from query to query. When 
I execute query multiple times, the static cache will increase rapidly (cache 
size equals to query times), result in OOM. 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query impl

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-02-05 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031223#comment-17031223
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 2/6/20 3:05 AM:


??The default heap size that is given to Java processes depends on platforms, 
but for most commodity PCs it wouldn't be so large so you will see OOM if you 
are not set the -Xmx JVM arg.??

[~tomoko] I did set JVM option to "-Xmx8192m", but OOM error always appears. I 
guess there may be a memory leak in the static member "cache" of 
HNSWGraphReader. The key of static "cache" is composed of field name and 
context identity, where the context identity may vary from query to query. When 
I execute query multiple times, the static cache size will increase rapidly 
(cache size equals to query times), result in OOM. 


was (Author: irvingzhang):
??The default heap size that is given to Java processes depends on platforms, 
but for most commodity PCs it wouldn't be so large so you will see OOM if you 
are not set the -Xmx JVM arg.??

[~tomoko] I did set JVM option to "-Xmx8192m", but OOM error always throws. I 
guess there may be a memory leak in the static member "cache" of 
HNSWGraphReader. The key of static "cache" is composed of field name and 
context identity, where the context identity may vary from query to query. When 
I execute query multiple times, the static cache will increase rapidly (cache 
size equals to query times), result in OOM. 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new grap

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-02-05 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031223#comment-17031223
 ] 

Xin-Chun Zhang edited comment on LUCENE-9004 at 2/6/20 3:38 AM:


??The default heap size that is given to Java processes depends on platforms, 
but for most commodity PCs it wouldn't be so large so you will see OOM if you 
are not set the -Xmx JVM arg.??

[~tomoko] I did set JVM option to "-Xmx8192m", but OOM error always appears. I 
guess there may be a memory leak in the static member "cache" of 
HNSWGraphReader. The key of "cache" is composed of field name and context 
identity, where the context identity may vary from query to query. When I 
execute query multiple times, the static cache size increases rapidly (cache 
size equals to query times), result in OOM. 


was (Author: irvingzhang):
??The default heap size that is given to Java processes depends on platforms, 
but for most commodity PCs it wouldn't be so large so you will see OOM if you 
are not set the -Xmx JVM arg.??

[~tomoko] I did set JVM option to "-Xmx8192m", but OOM error always appears. I 
guess there may be a memory leak in the static member "cache" of 
HNSWGraphReader. The key of static "cache" is composed of field name and 
context identity, where the context identity may vary from query to query. When 
I execute query multiple times, the static cache size will increase rapidly 
(cache size equals to query times), result in OOM. 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph fro

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031275#comment-17031275
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

The context is created per reader basis, not per query basis. You don't share 
your test code, but I suspect you open new IndexReader every time you issue a 
query? I think if you reuse one index reader (index searcher) through the test, 
the memory usage is stable between 2 and 4 GB. 
Anyway, yes, the static cache (for the graph structure) isn't good 
implementation, that is one reason why I said the HNSW branch is still on 
pretty early stage... 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and 

[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-02-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031333#comment-17031333
 ] 

ASF subversion and git services commented on LUCENE-9147:
-

Commit 1b882246d70e1b67c2c438092ea627f7baff3249 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1b88224 ]

LUCENE-9147: Avoid reusing file names with FileSwitchDirectory or 
NRTCachingDirectory and IOContext randomization.


> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org