[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510705#comment-17510705
 ] 

ASF subversion and git services commented on LUCENE-10477:
--

Commit ffb3168d6bd1ca70b2c32b0d78d5169000f34523 in lucene's branch 
refs/heads/branch_9x from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ffb3168 ]

LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (#758)

(cherry picked from commit 779c332a8c76f5de171b5d0239e5123ff8b5a10d)


> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510702#comment-17510702
 ] 

ASF subversion and git services commented on LUCENE-10477:
--

Commit 779c332a8c76f5de171b5d0239e5123ff8b5a10d in lucene's branch 
refs/heads/main from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=779c332 ]

LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (#758)



> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510547#comment-17510547
 ] 

ASF subversion and git services commented on LUCENE-10477:
--

Commit e7367f3047b7db2d6d54293b07ab121868a8de71 in lucene's branch 
refs/heads/branch_9x from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7367f3 ]

LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to rewrite sufficiently (#737)

(cherry picked from commit ca252d6621277cad8ca34361f8920c07482b0a16)


> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510541#comment-17510541
 ] 

ASF subversion and git services commented on LUCENE-10477:
--

Commit ca252d6621277cad8ca34361f8920c07482b0a16 in lucene's branch 
refs/heads/main from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca252d6 ]

LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to rewrite sufficiently (#737)



> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10464) unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510540#comment-17510540
 ] 

ASF subversion and git services commented on LUCENE-10464:
--

Commit ca252d6621277cad8ca34361f8920c07482b0a16 in lucene's branch 
refs/heads/main from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca252d6 ]

LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to rewrite sufficiently (#737)



> unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms 
> ---
>
> Key: LUCENE-10464
> URL: https://issues.apache.org/jira/browse/LUCENE-10464
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The 
> https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90
>  change in LUCENE-4728 included
> {code}
> - final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContextForField(field).reader());
> + final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContext().reader());
> {code}
> i.e. previously more needed to happen in the loop but now the query rewrite 
> and term collecting need not happen in the loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10422) Monitor instantiation configurabilty improvements

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510362#comment-17510362
 ] 

ASF subversion and git services commented on LUCENE-10422:
--

Commit 28afaadfb81e7bc494f1a570cd87a178953382e0 in lucene's branch 
refs/heads/branch_9x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28afaad ]

LUCENE-10422: Make errorprone happy


> Monitor instantiation configurabilty improvements
> -
>
> Key: LUCENE-10422
> URL: https://issues.apache.org/jira/browse/LUCENE-10422
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Niko Usai
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> I'm working on a project where I use very heavily Lucene Monitor package, but 
> I miss  some simple things in how {{Monitor}} manages it's Directory, 
> IndexWriter and IndexReader, what I want to do is extend 
> {{MonitorConfiguration}} to make possible mainly these two things: * use a 
> custom {{Directory}} implementation.
>  * use a readonly {{QueryIndex}} in order to have more Monitor instance on 
> different server reading from the same index (now the index reader is created 
> from the index writer so it is impossible to make a readonly {{{}Monitor{}}})



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10422) Monitor instantiation configurabilty improvements

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510359#comment-17510359
 ] 

ASF subversion and git services commented on LUCENE-10422:
--

Commit 42bf77229ec2882ac9a8a004b98a103417d4ce2f in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=42bf772 ]

LUCENE-10422: Make errorprone happy


> Monitor instantiation configurabilty improvements
> -
>
> Key: LUCENE-10422
> URL: https://issues.apache.org/jira/browse/LUCENE-10422
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Niko Usai
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> I'm working on a project where I use very heavily Lucene Monitor package, but 
> I miss  some simple things in how {{Monitor}} manages it's Directory, 
> IndexWriter and IndexReader, what I want to do is extend 
> {{MonitorConfiguration}} to make possible mainly these two things: * use a 
> custom {{Directory}} implementation.
>  * use a readonly {{QueryIndex}} in order to have more Monitor instance on 
> different server reading from the same index (now the index reader is created 
> from the index writer so it is impossible to make a readonly {{{}Monitor{}}})



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10478) Mark Test4GBStoredFields as @Monster (it consumes a lot of disk)

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510345#comment-17510345
 ] 

ASF subversion and git services commented on LUCENE-10478:
--

Commit c608d9660a7f7153bb0eccbb5d6cd8139969efb3 in lucene's branch 
refs/heads/branch_9x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c608d96 ]

LUCENE-10478: mark Test4GBStoredFields as @Monster (#757)



> Mark Test4GBStoredFields as @Monster (it consumes a lot of disk)
> 
>
> Key: LUCENE-10478
> URL: https://issues.apache.org/jira/browse/LUCENE-10478
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Tomoko Uchida
>Priority: Trivial
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> `Test4GBStoredFields` creates very large index files (7GiB+) and can cause a 
> "disk full" error when running the smoke tester if sufficient free space is 
> not available in tmpfs.
> See [https://github.com/apache/lucene/pull/755] for details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10478) Mark Test4GBStoredFields as @Monster (it consumes a lot of disk)

2022-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510340#comment-17510340
 ] 

ASF subversion and git services commented on LUCENE-10478:
--

Commit fa61953afdd5b988adc25e11c559b3cb23820203 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fa61953 ]

LUCENE-10478: mark Test4GBStoredFields as @Monster (#757)



> Mark Test4GBStoredFields as @Monster (it consumes a lot of disk)
> 
>
> Key: LUCENE-10478
> URL: https://issues.apache.org/jira/browse/LUCENE-10478
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Tomoko Uchida
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> `Test4GBStoredFields` creates very large index files (7GiB+) and can cause a 
> "disk full" error when running the smoke tester if sufficient free space is 
> not available in tmpfs.
> See [https://github.com/apache/lucene/pull/755] for details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10422) Monitor instantiation configurabilty improvements

2022-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510009#comment-17510009
 ] 

ASF subversion and git services commented on LUCENE-10422:
--

Commit 72cf36a2b8d8811536b74e742c75bf1dc923 in lucene's branch 
refs/heads/branch_9x from mogui
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=72cf36a ]

LUCENE-10422: Read-only monitor implementation (#679)

This commit adds a read-only monitor implementation that can
search the QueryIndex of another monitor without supporting adding
new queries.

> Monitor instantiation configurabilty improvements
> -
>
> Key: LUCENE-10422
> URL: https://issues.apache.org/jira/browse/LUCENE-10422
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Niko Usai
>Priority: Minor
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> I'm working on a project where I use very heavily Lucene Monitor package, but 
> I miss  some simple things in how {{Monitor}} manages it's Directory, 
> IndexWriter and IndexReader, what I want to do is extend 
> {{MonitorConfiguration}} to make possible mainly these two things: * use a 
> custom {{Directory}} implementation.
>  * use a readonly {{QueryIndex}} in order to have more Monitor instance on 
> different server reading from the same index (now the index reader is created 
> from the index writer so it is impossible to make a readonly {{{}Monitor{}}})



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10422) Monitor instantiation configurabilty improvements

2022-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510003#comment-17510003
 ] 

ASF subversion and git services commented on LUCENE-10422:
--

Commit be9917895645f99573833859e0c5c0d1cfc5b6d8 in lucene's branch 
refs/heads/main from mogui
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=be99178 ]

LUCENE-10422: Read-only monitor implementation (#679)

This commit adds a read-only monitor implementation that can
search the QueryIndex of another monitor without supporting adding
new queries.

> Monitor instantiation configurabilty improvements
> -
>
> Key: LUCENE-10422
> URL: https://issues.apache.org/jira/browse/LUCENE-10422
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Niko Usai
>Priority: Minor
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> I'm working on a project where I use very heavily Lucene Monitor package, but 
> I miss  some simple things in how {{Monitor}} manages it's Directory, 
> IndexWriter and IndexReader, what I want to do is extend 
> {{MonitorConfiguration}} to make possible mainly these two things: * use a 
> custom {{Directory}} implementation.
>  * use a readonly {{QueryIndex}} in order to have more Monitor instance on 
> different server reading from the same index (now the index reader is created 
> from the index writer so it is impossible to make a readonly {{{}Monitor{}}})



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10473) Address slow testRandomBig runs

2022-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509720#comment-17509720
 ] 

ASF subversion and git services commented on LUCENE-10473:
--

Commit f8d7073a857d0e0ae40a596145572c30cd09ce24 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f8d7073 ]

LUCENE-10473: Make tests a bit faster when running nightly. (#754)



> Address slow testRandomBig runs
> ---
>
> Key: LUCENE-10473
> URL: https://issues.apache.org/jira/browse/LUCENE-10473
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, we noticed the smoke tester sometimes 
> taking several hours. It looks like some tests can take a really long time, 
> especially when a higher "tests.multiplier" is used (than the default of 1):
> {code:java}
> The slowest tests (exceeding 500 ms) during this run:
>   3298.44s TestDoubleRangeFieldQueries.testRandomBig (:lucene:core)
>   2869.82s Test2BPostings.test (:lucene:core)
>   1951.74s TestLatLonDocValuesQueries.testRandomBig (:lucene:core)
>   1628.04s TestLatLonPointQueries.testRandomBig (:lucene:core)
>   1492.32s TestGeo3DPoint.testRandomBig (:lucene:spatial3d)
>   1481.19s TestXYDocValuesQueries.testRandomBig (:lucene:core)
>   1351.95s TestXYPointQueries.testRandomBig (:lucene:core)
>   940.30s TestLongRangeFieldQueries.testRandomBig (:lucene:core)
>   871.50s Test4GBStoredFields.test (:lucene:core)
>   743.00s TestFloatRangeFieldQueries.testRandomBig (:lucene:core)
> {code}
> -The main offender looks like {{{}BaseSpatialTestCase#testRandomBig{}}}, we 
> should look into making this run faster.-
> Maybe relates to LUCENE-8643?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10473) Address slow testRandomBig runs

2022-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509721#comment-17509721
 ] 

ASF subversion and git services commented on LUCENE-10473:
--

Commit 1b890ab5f9d7e6455a5d0a8ad4c0b0bbd93ccb35 in lucene's branch 
refs/heads/branch_9_1 from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1b890ab ]

LUCENE-10473: Make tests a bit faster when running nightly. (#754)



> Address slow testRandomBig runs
> ---
>
> Key: LUCENE-10473
> URL: https://issues.apache.org/jira/browse/LUCENE-10473
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, we noticed the smoke tester sometimes 
> taking several hours. It looks like some tests can take a really long time, 
> especially when a higher "tests.multiplier" is used (than the default of 1):
> {code:java}
> The slowest tests (exceeding 500 ms) during this run:
>   3298.44s TestDoubleRangeFieldQueries.testRandomBig (:lucene:core)
>   2869.82s Test2BPostings.test (:lucene:core)
>   1951.74s TestLatLonDocValuesQueries.testRandomBig (:lucene:core)
>   1628.04s TestLatLonPointQueries.testRandomBig (:lucene:core)
>   1492.32s TestGeo3DPoint.testRandomBig (:lucene:spatial3d)
>   1481.19s TestXYDocValuesQueries.testRandomBig (:lucene:core)
>   1351.95s TestXYPointQueries.testRandomBig (:lucene:core)
>   940.30s TestLongRangeFieldQueries.testRandomBig (:lucene:core)
>   871.50s Test4GBStoredFields.test (:lucene:core)
>   743.00s TestFloatRangeFieldQueries.testRandomBig (:lucene:core)
> {code}
> -The main offender looks like {{{}BaseSpatialTestCase#testRandomBig{}}}, we 
> should look into making this run faster.-
> Maybe relates to LUCENE-8643?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10473) Address slow testRandomBig runs

2022-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509718#comment-17509718
 ] 

ASF subversion and git services commented on LUCENE-10473:
--

Commit f239c0e03c47996ad666357a0775a35f6211d7ca in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f239c0e ]

LUCENE-10473: Make tests a bit faster when running nightly. (#754)



> Address slow testRandomBig runs
> ---
>
> Key: LUCENE-10473
> URL: https://issues.apache.org/jira/browse/LUCENE-10473
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, we noticed the smoke tester sometimes 
> taking several hours. It looks like some tests can take a really long time, 
> especially when a higher "tests.multiplier" is used (than the default of 1):
> {code:java}
> The slowest tests (exceeding 500 ms) during this run:
>   3298.44s TestDoubleRangeFieldQueries.testRandomBig (:lucene:core)
>   2869.82s Test2BPostings.test (:lucene:core)
>   1951.74s TestLatLonDocValuesQueries.testRandomBig (:lucene:core)
>   1628.04s TestLatLonPointQueries.testRandomBig (:lucene:core)
>   1492.32s TestGeo3DPoint.testRandomBig (:lucene:spatial3d)
>   1481.19s TestXYDocValuesQueries.testRandomBig (:lucene:core)
>   1351.95s TestXYPointQueries.testRandomBig (:lucene:core)
>   940.30s TestLongRangeFieldQueries.testRandomBig (:lucene:core)
>   871.50s Test4GBStoredFields.test (:lucene:core)
>   743.00s TestFloatRangeFieldQueries.testRandomBig (:lucene:core)
> {code}
> -The main offender looks like {{{}BaseSpatialTestCase#testRandomBig{}}}, we 
> should look into making this run faster.-
> Maybe relates to LUCENE-8643?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509373#comment-17509373
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit fcacd22a80565758155a6cb6973a5ca92918d957 in lucene's branch 
refs/heads/branch_9_1 from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fcacd22 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509371#comment-17509371
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit f09b08563074379da2b6f24054ffe7243ba365a1 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f09b085 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509368#comment-17509368
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit a4b30b4cf4c0c9b4de0c27893ea2350af498b1c0 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a4b30b4 ]

LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat

Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.


> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9614) Implement KNN Query

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509097#comment-17509097
 ] 

ASF subversion and git services commented on LUCENE-9614:
-

Commit 22a9e45f096f87bf3f3dbaf866c34b472e1f8da3 in lucene's branch 
refs/heads/branch_9_1 from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=22a9e45 ]

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs
were passed to IndexWriter. This is fragile and sometimes resulted in failures.
Now we check against an "id" field instead.


> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9614) Implement KNN Query

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509096#comment-17509096
 ] 

ASF subversion and git services commented on LUCENE-9614:
-

Commit e924d48b6a87ca0e52e66d49cade46371393972a in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e924d48 ]

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs
were passed to IndexWriter. This is fragile and sometimes resulted in failures.
Now we check against an "id" field instead.


> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9614) Implement KNN Query

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509091#comment-17509091
 ] 

ASF subversion and git services commented on LUCENE-9614:
-

Commit 18f9d31608654f88b2f048c4fd6fbdfedb3b6f95 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=18f9d31 ]

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs
were passed to IndexWriter. This is fragile and sometimes resulted in failures.
Now we check against an "id" field instead.


> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10472) TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508983#comment-17508983
 ] 

ASF subversion and git services commented on LUCENE-10472:
--

Commit ee14e46fc70cb77db074a579af3d446698124088 in lucene's branch 
refs/heads/branch_9x from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ee14e46 ]

LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (#753)

As part of #716 I moved the test to use a collector manager, but I forgot to 
update one of the assertions.
We can't rely on totalHits being accurate when the search is executed my 
multiple threads and early terminated.

> TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion
> -
>
> Key: LUCENE-10472
> URL: https://issues.apache.org/jira/browse/LUCENE-10472
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Very rarely TestMatchAllDocsQuery can fail with an assertion error:
> {code}
>> java.lang.AssertionError: expected:<201> but was:<241>
>> at 
> __randomizedtesting.SeedInfo.seed([B4F2350E137110E3:7B6574512B1D80E0]:0)
>> at org.junit.Assert.fail(Assert.java:89)
>> at org.junit.Assert.failNotEquals(Assert.java:835)
>> at org.junit.Assert.assertEquals(Assert.java:647)
>> at org.junit.Assert.assertEquals(Assert.java:633)
>> at 
> org.apache.lucene.search.TestMatchAllDocsQuery.testEarlyTermination(TestMatchAllDocsQuery.java:124)
> {code}
> We are expecting the totalHits.value to be exactly totalHitsThreshold + 1, 
> but it is sometimes more:
> {code}
> assertEquals(totalHitsThreshold + 1, topDocs.totalHits.value);
> assertEquals(TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, 
> topDocs.totalHits.relation);
> {code}
> The failures only seem to happen when we choose to use multiple threads in 
> IndexSearcher. I was able to reproduce this regularly by forcing 
> IndexSearcher to use multiple threads and running the test 100 times. This 
> started failing after https://github.com/apache/lucene/pull/716, where we 
> updated tests to use a collector manager instead of a collector, which 
> introduced the possibility of multiple threads.
> I am not sure this actually indicates a bug, or if the test just needs to be 
> tweaked. It seems okay to return more than totalHitsThreshold + 1 sometimes, 
> that doesn't seem to violate any contract?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10472) TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508984#comment-17508984
 ] 

ASF subversion and git services commented on LUCENE-10472:
--

Commit 9b4003236f2b9e9f35ba484a08611952794cf6ad in lucene's branch 
refs/heads/branch_9_1 from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9b40032 ]

LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (#753)

As part of #716 I moved the test to use a collector manager, but I forgot to 
update one of the assertions.
We can't rely on totalHits being accurate when the search is executed my 
multiple threads and early terminated.

> TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion
> -
>
> Key: LUCENE-10472
> URL: https://issues.apache.org/jira/browse/LUCENE-10472
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Very rarely TestMatchAllDocsQuery can fail with an assertion error:
> {code}
>> java.lang.AssertionError: expected:<201> but was:<241>
>> at 
> __randomizedtesting.SeedInfo.seed([B4F2350E137110E3:7B6574512B1D80E0]:0)
>> at org.junit.Assert.fail(Assert.java:89)
>> at org.junit.Assert.failNotEquals(Assert.java:835)
>> at org.junit.Assert.assertEquals(Assert.java:647)
>> at org.junit.Assert.assertEquals(Assert.java:633)
>> at 
> org.apache.lucene.search.TestMatchAllDocsQuery.testEarlyTermination(TestMatchAllDocsQuery.java:124)
> {code}
> We are expecting the totalHits.value to be exactly totalHitsThreshold + 1, 
> but it is sometimes more:
> {code}
> assertEquals(totalHitsThreshold + 1, topDocs.totalHits.value);
> assertEquals(TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, 
> topDocs.totalHits.relation);
> {code}
> The failures only seem to happen when we choose to use multiple threads in 
> IndexSearcher. I was able to reproduce this regularly by forcing 
> IndexSearcher to use multiple threads and running the test 100 times. This 
> started failing after https://github.com/apache/lucene/pull/716, where we 
> updated tests to use a collector manager instead of a collector, which 
> introduced the possibility of multiple threads.
> I am not sure this actually indicates a bug, or if the test just needs to be 
> tweaked. It seems okay to return more than totalHitsThreshold + 1 sometimes, 
> that doesn't seem to violate any contract?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10472) TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion

2022-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508982#comment-17508982
 ] 

ASF subversion and git services commented on LUCENE-10472:
--

Commit bb7568d865ca5f6932c705f1ae3b5adb159ae8ec in lucene's branch 
refs/heads/main from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bb7568d ]

LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (#753)

As part of #716 I moved the test to use a collector manager, but I forgot to 
update one of the assertions.
We can't rely on totalHits being accurate when the search is executed my 
multiple threads and early terminated.

> TestMatchAllDocsQuery#testEarlyTermination fails total hits assertion
> -
>
> Key: LUCENE-10472
> URL: https://issues.apache.org/jira/browse/LUCENE-10472
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Very rarely TestMatchAllDocsQuery can fail with an assertion error:
> {code}
>> java.lang.AssertionError: expected:<201> but was:<241>
>> at 
> __randomizedtesting.SeedInfo.seed([B4F2350E137110E3:7B6574512B1D80E0]:0)
>> at org.junit.Assert.fail(Assert.java:89)
>> at org.junit.Assert.failNotEquals(Assert.java:835)
>> at org.junit.Assert.assertEquals(Assert.java:647)
>> at org.junit.Assert.assertEquals(Assert.java:633)
>> at 
> org.apache.lucene.search.TestMatchAllDocsQuery.testEarlyTermination(TestMatchAllDocsQuery.java:124)
> {code}
> We are expecting the totalHits.value to be exactly totalHitsThreshold + 1, 
> but it is sometimes more:
> {code}
> assertEquals(totalHitsThreshold + 1, topDocs.totalHits.value);
> assertEquals(TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, 
> topDocs.totalHits.relation);
> {code}
> The failures only seem to happen when we choose to use multiple threads in 
> IndexSearcher. I was able to reproduce this regularly by forcing 
> IndexSearcher to use multiple threads and running the test 100 times. This 
> started failing after https://github.com/apache/lucene/pull/716, where we 
> updated tests to use a collector manager instead of a collector, which 
> introduced the possibility of multiple threads.
> I am not sure this actually indicates a bug, or if the test just needs to be 
> tweaked. It seems okay to return more than totalHitsThreshold + 1 sometimes, 
> that doesn't seem to violate any contract?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10418) Improve Query rewriting for non-scoring clauses

2022-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508252#comment-17508252
 ] 

ASF subversion and git services commented on LUCENE-10418:
--

Commit 1dcb64b492b33f2adc3735458eefbc80e8feb1ef in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1dcb64b ]

LUCENE-10418: Move CHANGES to the correct section.


> Improve Query rewriting for non-scoring clauses
> ---
>
> Key: LUCENE-10418
> URL: https://issues.apache.org/jira/browse/LUCENE-10418
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Query rewriting is occasionally important for performance, e.g. it may allow 
> using an optimized bulk scorer instead of the default bulk scorer like in the 
> example from LUCENE-10412.
> One case when we could simplify queries is in the non-scoring case. All 
> layers of query wrappers that only affect scoring like BoostQuery and 
> ConstantScore query can be removed, which might help identify new 
> opportunities for rewriting. For instance, we have several rewrite rules that 
> optimize for MatchAllDocsQuery and would fail to recognize it if it is behind 
> a ConstantScoreQuery or a BoostQuery. Boolean queries can also simplify 
> themselves in the non-scoring case, by changing MUST clauses to FILTER 
> clauses, or removing fully optional SHOULD clauses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10418) Improve Query rewriting for non-scoring clauses

2022-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508253#comment-17508253
 ] 

ASF subversion and git services commented on LUCENE-10418:
--

Commit 1c30cd7671471b007522c0324e56d6790c255c1f in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1c30cd7 ]

LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case. (#672)


> Improve Query rewriting for non-scoring clauses
> ---
>
> Key: LUCENE-10418
> URL: https://issues.apache.org/jira/browse/LUCENE-10418
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Query rewriting is occasionally important for performance, e.g. it may allow 
> using an optimized bulk scorer instead of the default bulk scorer like in the 
> example from LUCENE-10412.
> One case when we could simplify queries is in the non-scoring case. All 
> layers of query wrappers that only affect scoring like BoostQuery and 
> ConstantScore query can be removed, which might help identify new 
> opportunities for rewriting. For instance, we have several rewrite rules that 
> optimize for MatchAllDocsQuery and would fail to recognize it if it is behind 
> a ConstantScoreQuery or a BoostQuery. Boolean queries can also simplify 
> themselves in the non-scoring case, by changing MUST clauses to FILTER 
> clauses, or removing fully optional SHOULD clauses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10418) Improve Query rewriting for non-scoring clauses

2022-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508250#comment-17508250
 ] 

ASF subversion and git services commented on LUCENE-10418:
--

Commit 8fb6543280ca5b1c4abfb9cd758765cda29e316a in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8fb6543 ]

LUCENE-10418: Optimize `Query#rewrite` in the non-scoring case. (#672)



> Improve Query rewriting for non-scoring clauses
> ---
>
> Key: LUCENE-10418
> URL: https://issues.apache.org/jira/browse/LUCENE-10418
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Query rewriting is occasionally important for performance, e.g. it may allow 
> using an optimized bulk scorer instead of the default bulk scorer like in the 
> example from LUCENE-10412.
> One case when we could simplify queries is in the non-scoring case. All 
> layers of query wrappers that only affect scoring like BoostQuery and 
> ConstantScore query can be removed, which might help identify new 
> opportunities for rewriting. For instance, we have several rewrite rules that 
> optimize for MatchAllDocsQuery and would fail to recognize it if it is behind 
> a ConstantScoreQuery or a BoostQuery. Boolean queries can also simplify 
> themselves in the non-scoring case, by changing MUST clauses to FILTER 
> clauses, or removing fully optional SHOULD clauses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10469) ConstantScoreQuery doesn't propagate its score mode correctly

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507556#comment-17507556
 ] 

ASF subversion and git services commented on LUCENE-10469:
--

Commit 31ddf6a6cd42aaf24d768a815700448c169d2efb in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=31ddf6a ]

LUCENE-10469: Fix score mode propagation in ConstantScoreQuery. (#750)


> ConstantScoreQuery doesn't propagate its score mode correctly
> -
>
> Key: LUCENE-10469
> URL: https://issues.apache.org/jira/browse/LUCENE-10469
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We just noticed a performance bug on Elasticsearch that if you issue a search 
> request sorted by field and the query is a MatchAllDocsQuery then everything 
> works as expected.
> But if you change the query to be a MatchAllDocsQuery within a 
> ConstantScoreQuery then the query suddenly visits all matching documents. 
> This is due to the fact that ConstantScoreQuery always passes 
> COMPLETE_NO_SCORES to the inner weight, and thet the MatchAllDocsQuery's 
> optimized bulk scorer performs a brute-force for-loop over the documents to 
> score unless the score mode is not exhaustive.
> The fix consists of making ConstantScoreQuery propagate a non-exhaustive 
> score mode to the inner weight when the score mode that is passed to the 
> ConstantScoreQuery is not exhaustive itself.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10469) ConstantScoreQuery doesn't propagate its score mode correctly

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507555#comment-17507555
 ] 

ASF subversion and git services commented on LUCENE-10469:
--

Commit 5b522487ba8e0f1002b50a136817ca037aec9686 in lucene's branch 
refs/heads/branch_9_1 from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5b52248 ]

LUCENE-10469: Fix score mode propagation in ConstantScoreQuery. (#750)


> ConstantScoreQuery doesn't propagate its score mode correctly
> -
>
> Key: LUCENE-10469
> URL: https://issues.apache.org/jira/browse/LUCENE-10469
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We just noticed a performance bug on Elasticsearch that if you issue a search 
> request sorted by field and the query is a MatchAllDocsQuery then everything 
> works as expected.
> But if you change the query to be a MatchAllDocsQuery within a 
> ConstantScoreQuery then the query suddenly visits all matching documents. 
> This is due to the fact that ConstantScoreQuery always passes 
> COMPLETE_NO_SCORES to the inner weight, and thet the MatchAllDocsQuery's 
> optimized bulk scorer performs a brute-force for-loop over the documents to 
> score unless the score mode is not exhaustive.
> The fix consists of making ConstantScoreQuery propagate a non-exhaustive 
> score mode to the inner weight when the score mode that is passed to the 
> ConstantScoreQuery is not exhaustive itself.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10469) ConstantScoreQuery doesn't propagate its score mode correctly

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507549#comment-17507549
 ] 

ASF subversion and git services commented on LUCENE-10469:
--

Commit 86bd921fcedce6be69ff2aff70c99dcb78bd8ce5 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=86bd921f ]

LUCENE-10469: Fix score mode propagation in ConstantScoreQuery. (#750)



> ConstantScoreQuery doesn't propagate its score mode correctly
> -
>
> Key: LUCENE-10469
> URL: https://issues.apache.org/jira/browse/LUCENE-10469
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We just noticed a performance bug on Elasticsearch that if you issue a search 
> request sorted by field and the query is a MatchAllDocsQuery then everything 
> works as expected.
> But if you change the query to be a MatchAllDocsQuery within a 
> ConstantScoreQuery then the query suddenly visits all matching documents. 
> This is due to the fact that ConstantScoreQuery always passes 
> COMPLETE_NO_SCORES to the inner weight, and thet the MatchAllDocsQuery's 
> optimized bulk scorer performs a brute-force for-loop over the documents to 
> score unless the score mode is not exhaustive.
> The fix consists of making ConstantScoreQuery propagate a non-exhaustive 
> score mode to the inner weight when the score mode that is passed to the 
> ConstantScoreQuery is not exhaustive itself.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10452) Hunspell: call checkCanceled less frequently to reduce the overhead

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507417#comment-17507417
 ] 

ASF subversion and git services commented on LUCENE-10452:
--

Commit 5c273159c8ea8e847630415aa8d62a2d9b7aa55c in lucene's branch 
refs/heads/branch_9x from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5c27315 ]

LUCENE-10452, LUCENE-10451: mention hunspell changes in CHANGES.txt


> Hunspell: call checkCanceled less frequently to reduce the overhead
> ---
>
> Key: LUCENE-10452
> URL: https://issues.apache.org/jira/browse/LUCENE-10452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10451) Hunspell: don't perform potentially expensive spellchecking after timeout

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507418#comment-17507418
 ] 

ASF subversion and git services commented on LUCENE-10451:
--

Commit 5c273159c8ea8e847630415aa8d62a2d9b7aa55c in lucene's branch 
refs/heads/branch_9x from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5c27315 ]

LUCENE-10452, LUCENE-10451: mention hunspell changes in CHANGES.txt


> Hunspell: don't perform potentially expensive spellchecking after timeout
> -
>
> Key: LUCENE-10451
> URL: https://issues.apache.org/jira/browse/LUCENE-10451
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, to return partial result after timeout, additional processing with 
> case-adjustment and `spell` calls is performed, which can take time and also 
> result in superfluous `checkCanceled` invocations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10451) Hunspell: don't perform potentially expensive spellchecking after timeout

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507419#comment-17507419
 ] 

ASF subversion and git services commented on LUCENE-10451:
--

Commit 385fd560fa2b3b70121b8dcb9acc03001b7fe9a8 in lucene's branch 
refs/heads/branch_9x from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=385fd56 ]

LUCENE-10451 Hunspell: don't perform potentially expensive spellchecking after 
timeout (#721)

move all expensive operations closer to the suggestion creation, encapsulate 
case and output conversion into a new Suggestion class


> Hunspell: don't perform potentially expensive spellchecking after timeout
> -
>
> Key: LUCENE-10451
> URL: https://issues.apache.org/jira/browse/LUCENE-10451
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, to return partial result after timeout, additional processing with 
> case-adjustment and `spell` calls is performed, which can take time and also 
> result in superfluous `checkCanceled` invocations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10452) Hunspell: call checkCanceled less frequently to reduce the overhead

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507416#comment-17507416
 ] 

ASF subversion and git services commented on LUCENE-10452:
--

Commit d91f51f01e6269b55c1dab27db3131cae935cdc9 in lucene's branch 
refs/heads/branch_9x from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d91f51f ]

LUCENE-10452: Hunspell: call checkCanceled less frequently to reduce the 
overhead (#723)



> Hunspell: call checkCanceled less frequently to reduce the overhead
> ---
>
> Key: LUCENE-10452
> URL: https://issues.apache.org/jira/browse/LUCENE-10452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10452) Hunspell: call checkCanceled less frequently to reduce the overhead

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507414#comment-17507414
 ] 

ASF subversion and git services commented on LUCENE-10452:
--

Commit 0e3c315b7662799658f331ca2c9985457d0cd161 in lucene's branch 
refs/heads/main from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0e3c315 ]

LUCENE-10452, LUCENE-10451: mention hunspell changes in CHANGES.txt


> Hunspell: call checkCanceled less frequently to reduce the overhead
> ---
>
> Key: LUCENE-10452
> URL: https://issues.apache.org/jira/browse/LUCENE-10452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10451) Hunspell: don't perform potentially expensive spellchecking after timeout

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507415#comment-17507415
 ] 

ASF subversion and git services commented on LUCENE-10451:
--

Commit 0e3c315b7662799658f331ca2c9985457d0cd161 in lucene's branch 
refs/heads/main from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0e3c315 ]

LUCENE-10452, LUCENE-10451: mention hunspell changes in CHANGES.txt


> Hunspell: don't perform potentially expensive spellchecking after timeout
> -
>
> Key: LUCENE-10451
> URL: https://issues.apache.org/jira/browse/LUCENE-10451
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, to return partial result after timeout, additional processing with 
> case-adjustment and `spell` calls is performed, which can take time and also 
> result in superfluous `checkCanceled` invocations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10452) Hunspell: call checkCanceled less frequently to reduce the overhead

2022-03-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507404#comment-17507404
 ] 

ASF subversion and git services commented on LUCENE-10452:
--

Commit af97c5ef379f794b77a2bc201257d991f7bf11e8 in lucene's branch 
refs/heads/main from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=af97c5e ]

LUCENE-10452: Hunspell: call checkCanceled less frequently to reduce the 
overhead (#723)



> Hunspell: call checkCanceled less frequently to reduce the overhead
> ---
>
> Key: LUCENE-10452
> URL: https://issues.apache.org/jira/browse/LUCENE-10452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507177#comment-17507177
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit ea989fe8f305040701527fe3cf9c68ed99dc9c42 in lucene's branch 
refs/heads/branch_9_1 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ea989fe ]

LUCENE-10311: avoid division by zero on small sets.


> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10451) Hunspell: don't perform potentially expensive spellchecking after timeout

2022-03-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507119#comment-17507119
 ] 

ASF subversion and git services commented on LUCENE-10451:
--

Commit 92a20c24e93915a79ef53391a70cf2fb35cb1e93 in lucene's branch 
refs/heads/main from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=92a20c2 ]

LUCENE-10451 Hunspell: don't perform potentially expensive spellchecking after 
timeout (#721)

move all expensive operations closer to the suggestion creation, encapsulate 
case and output conversion into a new Suggestion class

> Hunspell: don't perform potentially expensive spellchecking after timeout
> -
>
> Key: LUCENE-10451
> URL: https://issues.apache.org/jira/browse/LUCENE-10451
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, to return partial result after timeout, additional processing with 
> case-adjustment and `spell` calls is performed, which can take time and also 
> result in superfluous `checkCanceled` invocations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10463) Make smoke tester script work on main branch (java 17)

2022-03-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506849#comment-17506849
 ] 

ASF subversion and git services commented on LUCENE-10463:
--

Commit b6c1024f550c2d2408e3a86aabbc682142e2b9c1 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b6c1024 ]

LUCENE-10463: increment java version to 17 in smoke tester (#748)



> Make smoke tester script work on main branch (java 17)
> --
>
> Key: LUCENE-10463
> URL: https://issues.apache.org/jira/browse/LUCENE-10463
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The smoke tester script has been obsoleted on main after upgrading to Java 
> 17. To enable nightly smoke tests on Jenkins for main, its target java 
> version should be bumped to 17.
> In addition to bump the java version, it looks it should be refactored not to 
> hard-code target java version. I feel it'd be better to make it coordinate 
> with the Gradle distribution task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10458) BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables

2022-03-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506532#comment-17506532
 ] 

ASF subversion and git services commented on LUCENE-10458:
--

Commit a6114b532a273e370528675d551d3ddfa02f4679 in lucene's branch 
refs/heads/branch_9_1 from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6114b5 ]

Revert "LUCENE-10385: Implement Weight#count on IndexSortSortedNumeri… (#745)

In LUCENE-10458 we identified a bug in the logic. We're reverting on the 9.1
branch to avoid holding up the release.


> BoundedDocSetIdIterator may supply error count in 
> Weigth#count(LeafReaderContext) when missingValue enables
> ---
>
> Key: LUCENE-10458
> URL: https://issues.apache.org/jira/browse/LUCENE-10458
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When IndexSortSortedNumericDocValuesRangeQuery can take advantage of index 
> sort, Weight#count will use BoundedDocSetIdIterator's lastDoc and firstDoc to 
> calculate count, but if missingValue enables, those Documents which not 
> contain DocValues may be involved in calculating count.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10385) Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.

2022-03-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506531#comment-17506531
 ] 

ASF subversion and git services commented on LUCENE-10385:
--

Commit a6114b532a273e370528675d551d3ddfa02f4679 in lucene's branch 
refs/heads/branch_9_1 from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6114b5 ]

Revert "LUCENE-10385: Implement Weight#count on IndexSortSortedNumeri… (#745)

In LUCENE-10458 we identified a bug in the logic. We're reverting on the 9.1
branch to avoid holding up the release.


> Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery.
> 
>
> Key: LUCENE-10385
> URL: https://issues.apache.org/jira/browse/LUCENE-10385
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This query can count matches by computing the first and last matching doc IDs 
> using binary search.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10461) Luke: Windows launch script passes integration tests but fails to run

2022-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505246#comment-17505246
 ] 

ASF subversion and git services commented on LUCENE-10461:
--

Commit a796e08b1f9769209643547c535cb2701b2d2cab in lucene's branch 
refs/heads/branch_9_1 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a796e08 ]

LUCENE-10461: fix windows launch script for luke so that it works with 
integration tests AND actual command line. Cmd escaping rules and start command 
line is absolutely insane. (#743)



> Luke: Windows launch script passes integration tests but fails to run
> -
>
> Key: LUCENE-10461
> URL: https://issues.apache.org/jira/browse/LUCENE-10461
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> PR at https://github.com/apache/lucene/pull/743



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10461) Luke: Windows launch script passes integration tests but fails to run

2022-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505245#comment-17505245
 ] 

ASF subversion and git services commented on LUCENE-10461:
--

Commit a34072c816d9992e7c04849f83f36b7a3231dcc3 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a34072c ]

LUCENE-10461: fix windows launch script for luke so that it works with 
integration tests AND actual command line. Cmd escaping rules and start command 
line is absolutely insane. (#743)



> Luke: Windows launch script passes integration tests but fails to run
> -
>
> Key: LUCENE-10461
> URL: https://issues.apache.org/jira/browse/LUCENE-10461
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> PR at https://github.com/apache/lucene/pull/743



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10461) Luke: Windows launch script passes integration tests but fails to run

2022-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505242#comment-17505242
 ] 

ASF subversion and git services commented on LUCENE-10461:
--

Commit 25c4310bd56fee8d7474f041f2eebb54c6133c42 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=25c4310 ]

LUCENE-10461: fix windows launch script for luke so that it works with 
integration tests AND actual command line. Cmd escaping rules and start command 
line is absolutely insane. (#743)



> Luke: Windows launch script passes integration tests but fails to run
> -
>
> Key: LUCENE-10461
> URL: https://issues.apache.org/jira/browse/LUCENE-10461
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> PR at https://github.com/apache/lucene/pull/743



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1

2022-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505069#comment-17505069
 ] 

ASF subversion and git services commented on LUCENE-10459:
--

Commit a3a058de6d7e8208cd988d3394b6562ee22928c6 in lucene's branch 
refs/heads/branch_9_1 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a3a058d ]

LUCENE-10459: Update smoke tester for 9.1 (#744)

Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.

Co-authored-by: Tomoko Uchida 
Co-authored-by: Julie Tibshirani 

> Update smoke tester for 9.1
> ---
>
> Key: LUCENE-10459
> URL: https://issues.apache.org/jira/browse/LUCENE-10459
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, I ran into several failures in the smoke 
> tester that seem related to our move to the module system. At a high level, 
> they include:
> * Including test directories in the binary distribution
> * Missing dependencies for the demo
> I opened this PR to show the details of the issues: 
> https://github.com/apache/lucene/pull/739.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1

2022-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505067#comment-17505067
 ] 

ASF subversion and git services commented on LUCENE-10459:
--

Commit 4b828f27f365b48647cf5aa9d5825263caf8dc03 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4b828f2 ]

LUCENE-10459: Update smoke tester for 9.1 (#744)

Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.

Co-authored-by: Tomoko Uchida 
Co-authored-by: Julie Tibshirani 

> Update smoke tester for 9.1
> ---
>
> Key: LUCENE-10459
> URL: https://issues.apache.org/jira/browse/LUCENE-10459
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, I ran into several failures in the smoke 
> tester that seem related to our move to the module system. At a high level, 
> they include:
> * Including test directories in the binary distribution
> * Missing dependencies for the demo
> I opened this PR to show the details of the issues: 
> https://github.com/apache/lucene/pull/739.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10459) Update smoke tester for 9.1

2022-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505054#comment-17505054
 ] 

ASF subversion and git services commented on LUCENE-10459:
--

Commit 9e9c457f8034d88052f1ed2a3746ab2b3b2940f1 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9e9c457 ]

LUCENE-10459: Update smoke tester for 9.1 (#744)

Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.

Co-authored-by: Tomoko Uchida 
Co-authored-by: Julie Tibshirani 

> Update smoke tester for 9.1
> ---
>
> Key: LUCENE-10459
> URL: https://issues.apache.org/jira/browse/LUCENE-10459
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While working on the 9.1 release, I ran into several failures in the smoke 
> tester that seem related to our move to the module system. At a high level, 
> they include:
> * Including test directories in the binary distribution
> * Missing dependencies for the demo
> I opened this PR to show the details of the issues: 
> https://github.com/apache/lucene/pull/739.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503547#comment-17503547
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 8f399572c99786b859123bca8ff50e99692d4ae3 in lucene's branch 
refs/heads/branch_9_1 from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f39957 ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503543#comment-17503543
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 1f497819e6b60db7908657056512e4c65fef420a in lucene's branch 
refs/heads/branch_9x from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1f49781 ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503484#comment-17503484
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit 38b4bbf74e25a5e578486ba434751a3f361912f5 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=38b4bbf ]

LUCENE-10311: avoid division by zero on small sets.


> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503480#comment-17503480
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit e999056c19d98b5dbd6434f6986e19c69cdb28ab in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e999056 ]

LUCENE-10311: avoid division by zero on small sets.


> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-03-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503366#comment-17503366
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit e5717cddfda68dace6e45357f5e33d81c368db31 in lucene's branch 
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e5717cd ]

LUCENE-10408 Test correction checksum (#734)

Use double instead of float to test vector values checksum

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

2022-03-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503097#comment-17503097
 ] 

ASF subversion and git services commented on LUCENE-10171:
--

Commit a03306724664be1a5f4047d845736d85b03ddc41 in lucene's branch 
refs/heads/branch_9x from Spyros Kapnissis
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a033067 ]

LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer 
objects (#380)

Instead of caching dictionary strings and building multiple redundant 
DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney 

> Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory
> 
>
> Key: LUCENE-10171
> URL: https://issues.apache.org/jira/browse/LUCENE-10171
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 9.0, 7.7.3, 8.10
>Reporter: Spyros Kapnissis
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
> caches internally only the string format of the dictionary, and not the 
> DictionaryLemmatizer object. This results in parsing and creating a new 
> DictionaryLemmatizer object every time the 
> OpenNLPLemmatizerFilterFactory.create() is called.
> In our case, with a large lemmas.txt file (5MB) and the 
> OpenNLPLemmatizerFilter used in many fields across our setup and in multiple 
> collections (we use Solr), we had several random OOM issues and generally 
> high server load due to GC activity. After heap dump analysis we noticed few 
> thousands of DictionaryLemmatizer instances of around 80MB each.
> By switching the caching to the DictionaryLemmatizer instead of the String, 
> we were able to resolve these issues. I will be attaching a PR for review, 
> please let me know of any comments.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10171) Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

2022-03-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503082#comment-17503082
 ] 

ASF subversion and git services commented on LUCENE-10171:
--

Commit 8afec33e747ec81c2301a4b099bd26b4195a556e in lucene's branch 
refs/heads/main from Spyros Kapnissis
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8afec33 ]

LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer 
objects (#380)

Instead of caching dictionary strings and building multiple redundant 
DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney 

> Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory
> 
>
> Key: LUCENE-10171
> URL: https://issues.apache.org/jira/browse/LUCENE-10171
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 9.0, 7.7.3, 8.10
>Reporter: Spyros Kapnissis
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
> caches internally only the string format of the dictionary, and not the 
> DictionaryLemmatizer object. This results in parsing and creating a new 
> DictionaryLemmatizer object every time the 
> OpenNLPLemmatizerFilterFactory.create() is called.
> In our case, with a large lemmas.txt file (5MB) and the 
> OpenNLPLemmatizerFilter used in many fields across our setup and in multiple 
> collections (we use Solr), we had several random OOM issues and generally 
> high server load due to GC activity. After heap dump analysis we noticed few 
> thousands of DictionaryLemmatizer instances of around 80MB each.
> By switching the caching to the DictionaryLemmatizer instead of the String, 
> we were able to resolve these issues. I will be attaching a PR for review, 
> please let me know of any comments.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501785#comment-17501785
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit b5fe307c6ff877e1c7479ef7ee628fc8d99c883c in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b5fe307 ]

LUCENE-10311: Remove pop_XXX helpers from `BitUtil`. (#724)

As @rmuir noted, it would be as simple and create less cognitive overhead to
use `Long#bitCount` directly.

> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501781#comment-17501781
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit ae16917c1defd0db6334a4424727157fe12169ab in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ae16917 ]

LUCENE-10311: Remove pop_XXX helpers from `BitUtil`. (#724)

As @rmuir noted, it would be as simple and create less cognitive overhead to
use `Long#bitCount` directly.

> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10455) IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext)

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501777#comment-17501777
 ] 

ASF subversion and git services commented on LUCENE-10455:
--

Commit ffdb246702f42a3b95193993f79064056fedd095 in lucene's branch 
refs/heads/branch_9x from Chris Lu
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ffdb246 ]

LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery should implement 
Weight#scorerSupplier(LeafReaderContext) (#729)



> IndexSortSortedNumericDocValuesRangeQuery should implement 
> Weight#scorerSupplier(LeafReaderContext)
> ---
>
> Key: LUCENE-10455
> URL: https://issues.apache.org/jira/browse/LUCENE-10455
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery was used as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, but When 
> IndexSortSortedNumericDocValuesRangeQuery can't take advantage of index sort, 
> the fallbackQuery(IndexOrDocValuesQuery)  always only supply Scorer by 
> indexQuery, beacuse IndexSortSortedNumericDocValuesRangeQuery did not 
> implement Weight#scorerSupplier(LeafReaderContext).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10453) Speed up VectorUtil#squareDistance

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501779#comment-17501779
 ] 

ASF subversion and git services commented on LUCENE-10453:
--

Commit 1818ae9de35501193e07dd15672f28f956671e97 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1818ae9 ]

LUCENE-10453: Speed up euclidean distances. (#725)


> Speed up VectorUtil#squareDistance
> --
>
> Key: LUCENE-10453
> URL: https://issues.apache.org/jira/browse/LUCENE-10453
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{VectorUtil#squareDistance}} is used in conjunction with 
> {{VectorSimilarityFunction#EUCLIDEAN}}.
> It didn't get as much love as dot products (LUCENE-9837) yet there seems to 
> be room for improvement. I wrote a quick JMH benchmark to run some 
> comparisons: https://github.com/jpountz/vector-similarity-benchmarks.
> While it's not as fast as using the vector API (which makes squareDistance 
> computations more than 2x faster), we can get a ~25% speedup by unrolling the 
> loop in a similar way to what dot product does.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10455) IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext)

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501778#comment-17501778
 ] 

ASF subversion and git services commented on LUCENE-10455:
--

Commit 29282fa315b723b32f8e788ba29e2311f7ece83f in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=29282fa ]

LUCENE-10455: CHANGES entry.


> IndexSortSortedNumericDocValuesRangeQuery should implement 
> Weight#scorerSupplier(LeafReaderContext)
> ---
>
> Key: LUCENE-10455
> URL: https://issues.apache.org/jira/browse/LUCENE-10455
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery was used as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, but When 
> IndexSortSortedNumericDocValuesRangeQuery can't take advantage of index sort, 
> the fallbackQuery(IndexOrDocValuesQuery)  always only supply Scorer by 
> indexQuery, beacuse IndexSortSortedNumericDocValuesRangeQuery did not 
> implement Weight#scorerSupplier(LeafReaderContext).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10455) IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext)

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501776#comment-17501776
 ] 

ASF subversion and git services commented on LUCENE-10455:
--

Commit 8086ef9f451fc59a6c9a204db3530229edc56c74 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8086ef9 ]

LUCENE-10455: CHANGES entry.


> IndexSortSortedNumericDocValuesRangeQuery should implement 
> Weight#scorerSupplier(LeafReaderContext)
> ---
>
> Key: LUCENE-10455
> URL: https://issues.apache.org/jira/browse/LUCENE-10455
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery was used as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, but When 
> IndexSortSortedNumericDocValuesRangeQuery can't take advantage of index sort, 
> the fallbackQuery(IndexOrDocValuesQuery)  always only supply Scorer by 
> indexQuery, beacuse IndexSortSortedNumericDocValuesRangeQuery did not 
> implement Weight#scorerSupplier(LeafReaderContext).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10453) Speed up VectorUtil#squareDistance

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501775#comment-17501775
 ] 

ASF subversion and git services commented on LUCENE-10453:
--

Commit 9d732380aefd41fdbe152eab47eca83d8de4f2af in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9d73238 ]

LUCENE-10453: Speed up euclidean distances. (#725)



> Speed up VectorUtil#squareDistance
> --
>
> Key: LUCENE-10453
> URL: https://issues.apache.org/jira/browse/LUCENE-10453
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> {{VectorUtil#squareDistance}} is used in conjunction with 
> {{VectorSimilarityFunction#EUCLIDEAN}}.
> It didn't get as much love as dot products (LUCENE-9837) yet there seems to 
> be room for improvement. I wrote a quick JMH benchmark to run some 
> comparisons: https://github.com/jpountz/vector-similarity-benchmarks.
> While it's not as fast as using the vector API (which makes squareDistance 
> computations more than 2x faster), we can get a ~25% speedup by unrolling the 
> loop in a similar way to what dot product does.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10455) IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext)

2022-03-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501771#comment-17501771
 ] 

ASF subversion and git services commented on LUCENE-10455:
--

Commit 2700c6b525e652ef1539bc603c18554211d4cc6e in lucene's branch 
refs/heads/main from Chris Lu
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2700c6b ]

LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery should implement 
Weight#scorerSupplier(LeafReaderContext) (#729)



> IndexSortSortedNumericDocValuesRangeQuery should implement 
> Weight#scorerSupplier(LeafReaderContext)
> ---
>
> Key: LUCENE-10455
> URL: https://issues.apache.org/jira/browse/LUCENE-10455
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery was used as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, but When 
> IndexSortSortedNumericDocValuesRangeQuery can't take advantage of index sort, 
> the fallbackQuery(IndexOrDocValuesQuery)  always only supply Scorer by 
> indexQuery, beacuse IndexSortSortedNumericDocValuesRangeQuery did not 
> implement Weight#scorerSupplier(LeafReaderContext).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501292#comment-17501292
 ] 

ASF subversion and git services commented on LUCENE-10431:
--

Commit 5e539bc50d92c0003e4310d9034aec5af353f1f8 in lucene's branch 
refs/heads/branch_9x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5e539bc ]

LUCENE-10431: Don't include rewriteMethod in MTQ hash calculation (#727)

BooleanQuery assumes that its children's hashcodes are stable, and has some
assertions to this effect. This did not apply to MultiTermQuery, which has a
mutable RewriteMethod member variable that was included in its hash calculation.
Changing the rewrite method would change the hash, leading to assertion failures
being tripped. This commit removes rewriteMethod from the hash calculation,
meaning that the hashcode will be stable even under mutability.

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501291#comment-17501291
 ] 

ASF subversion and git services commented on LUCENE-10431:
--

Commit e049e426dd182bf273e8acde466ab145afb6b190 in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e049e42 ]

LUCENE-10431: Remove MultiTermQuery.setRewriteMethod() (#726)



> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10447) Charset issue in TestScripts#testLukeCanBeLaunched()

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501013#comment-17501013
 ] 

ASF subversion and git services commented on LUCENE-10447:
--

Commit 8f92ec157f1a01e7903186da8607e3d1003b1829 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f92ec1 ]

LUCENE-10447: always use utf8 for forked process encoding. Use the sa… (#717)



> Charset issue in TestScripts#testLukeCanBeLaunched()
> 
>
> Key: LUCENE-10447
> URL: https://issues.apache.org/jira/browse/LUCENE-10447
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Reporter: Lu Xugang
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.1
>
> Attachments: 1.png, 2.png, process-10536545874299101128.out
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When run TestScripts#testLukeCanBeLaunched(), a temp file will be created in 
> the path of lucene/distribution.tests/build/tmp/tests-tmp/process-*.out, this 
> process-*.out file may contains some non StandardCharsets.US_ASCII content 
> base on Operating System language, and then a Exception will be throw because 
> later the test will read this temp file with StandardCharsets.US_ASCII.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10447) Charset issue in TestScripts#testLukeCanBeLaunched()

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501009#comment-17501009
 ] 

ASF subversion and git services commented on LUCENE-10447:
--

Commit 81ab1e598f4e2a6f16de312614823c9eccb7abe2 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81ab1e5 ]

LUCENE-10447: always use utf8 for forked process encoding. Use the sa… (#717)



> Charset issue in TestScripts#testLukeCanBeLaunched()
> 
>
> Key: LUCENE-10447
> URL: https://issues.apache.org/jira/browse/LUCENE-10447
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Reporter: Lu Xugang
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: 1.png, 2.png, process-10536545874299101128.out
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When run TestScripts#testLukeCanBeLaunched(), a temp file will be created in 
> the path of lucene/distribution.tests/build/tmp/tests-tmp/process-*.out, this 
> process-*.out file may contains some non StandardCharsets.US_ASCII content 
> base on Operating System language, and then a Exception will be throw because 
> later the test will read this temp file with StandardCharsets.US_ASCII.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500678#comment-17500678
 ] 

ASF subversion and git services commented on LUCENE-10431:
--

Commit 63454b83ad3cea3bae7c70f4b6276fce60d81672 in lucene's branch 
refs/heads/branch_9x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=63454b8 ]

LUCENE-10431: Deprecate MultiTermQuery.setRewriteMethod() (#722)

Allowing users to mutate MultiTermQuery can give rise to odd bugs, for example
in wrapper queries such as BooleanQuery which lazily calculate their hashcodes
and then cache the result. This commit deprecates the setRewriteMethod()
method on MultiTermQuery, in preparation for removing it entirely, and adds
constructor parameters to the various MTQ implementations as a preferred
way to set the rewrite method.


> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10431) AssertionError in BooleanQuery.hashCode()

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500662#comment-17500662
 ] 

ASF subversion and git services commented on LUCENE-10431:
--

Commit 3f994dec53a1f45c27be9f577a01f20516461b3e in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3f994de ]

LUCENE-10431: Deprecate MultiTermQuery.setRewriteMethod() (#722)

Allowing users to mutate MultiTermQuery can give rise to odd bugs, for example
in wrapper queries such as BooleanQuery which lazily calculate their hashcodes
and then cache the result. This commit deprecates the setRewriteMethod()
method on MultiTermQuery, in preparation for removing it entirely, and adds
constructor parameters to the various MTQ implementations as a preferred
way to set the rewrite method.

> AssertionError in BooleanQuery.hashCode()
> -
>
> Key: LUCENE-10431
> URL: https://issues.apache.org/jira/browse/LUCENE-10431
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Michael Bien
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hello devs,
> the constructor of BooleanQuery can under some circumstances trigger a hash 
> code computation before "clauseSets" is fully filled. Since BooleanClause is 
> using its query field for the hash code too, it can happen that the "wrong" 
> hash code is stored, since adding the clause to the set triggers its 
> hashCode().
> If assertions are enabled the check in BooleanQuery, which recomputes the 
> hash code, will notice it and throw an error.
> exception:
> {code:java}
> java.lang.AssertionError
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:614)
>     at java.base/java.util.Objects.hashCode(Objects.java:103)
>     at java.base/java.util.HashMap$Node.hashCode(HashMap.java:298)
>     at java.base/java.util.AbstractMap.hashCode(AbstractMap.java:527)
>     at org.apache.lucene.search.Multiset.hashCode(Multiset.java:119)
>     at java.base/java.util.EnumMap.entryHashCode(EnumMap.java:717)
>     at java.base/java.util.EnumMap.hashCode(EnumMap.java:709)
>     at java.base/java.util.Arrays.hashCode(Arrays.java:4498)
>     at java.base/java.util.Objects.hash(Objects.java:133)
>     at 
> org.apache.lucene.search.BooleanQuery.computeHashCode(BooleanQuery.java:597)
>     at org.apache.lucene.search.BooleanQuery.hashCode(BooleanQuery.java:611)
>     at java.base/java.util.HashMap.hash(HashMap.java:340)
>     at java.base/java.util.HashMap.put(HashMap.java:612)
>     at org.apache.lucene.search.Multiset.add(Multiset.java:82)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:154)
>     at org.apache.lucene.search.BooleanQuery.(BooleanQuery.java:42)
>     at 
> org.apache.lucene.search.BooleanQuery$Builder.build(BooleanQuery.java:133)
> {code}
> I noticed this while trying to upgrade the NetBeans maven indexer modules 
> from lucene 5.x to 8.x https://github.com/apache/netbeans/pull/3558



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500606#comment-17500606
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit 2a6b2ca1435ddb719bf0834d035ec38b7401c931 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2a6b2ca ]

LUCENE-10002: Fix test failure.

When IndexSearcher is created with a threadpool it becomes impossible to assert
on the number of evaluated hits overall.


> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500605#comment-17500605
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit bff4246476d860942e1b20dae2540b5caae2eda9 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bff4246 ]

LUCENE-10002: Fix test failure.

When IndexSearcher is created with a threadpool it becomes impossible to assert
on the number of evaluated hits overall.


> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500594#comment-17500594
 ] 

ASF subversion and git services commented on LUCENE-10428:
--

Commit 0d35e38b93d4c394aee691f308092cb9cfa792a2 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0d35e38 ]

LUCENE-10428: Avoid infinite loop under error conditions. (#711)

Co-authored-by: dblock 

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10428) getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop

2022-03-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500591#comment-17500591
 ] 

ASF subversion and git services commented on LUCENE-10428:
--

Commit 44a2a82319c9d375f3399a4b36abf2c3c7e229d6 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=44a2a82 ]

LUCENE-10428: Avoid infinite loop under error conditions. (#711)

Co-authored-by: dblock 

> getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge 
> leading to busy threads in infinite loop
> -
>
> Key: LUCENE-10428
> URL: https://issues.apache.org/jira/browse/LUCENE-10428
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/search
>Reporter: Ankit Jain
>Priority: Major
> Attachments: Flame_graph.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Customers complained about high CPU for Elasticsearch cluster in production. 
> We noticed that few search requests were stuck for long time
> {code:java}
> % curl -s localhost:9200/_cat/tasks?v   
> indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
> AmMLzDQ4RrOJievRDeGFZw:569204  direct1645195007282 14:36:47  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
> emjWc5bUTG6lgnCGLulq-Q:502074  direct1645195037259 14:37:17  6.2h
> indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
> emjWc5bUTG6lgnCGLulq-Q:583269  direct1645201316981 16:21:56  4.5h
> {code}
> Flame graphs indicated that CPU time is mostly going into 
> *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some 
> live JVM debugging found that 
> org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
> around 4 million invocations every second
> Figured out the values of some parameters from live debugging:
> {code:java}
> minScoreSum = 3.5541441
> minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
> 3.554144322872162
> returnObj scoreSumUpperBound = 3.5541444
> Math.ulp(minScoreSum) = 2.3841858E-7
> {code}
> Example code snippet:
> {code:java}
> double sumOfOtherMaxScores = 3.554144322872162;
> double minScoreSum = 3.5541441;
> float minScore = (float) (minScoreSum - sumOfOtherMaxScores);
> while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) {
> minScore -= Math.ulp(minScoreSum);
> System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500568#comment-17500568
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit bb10e62dff886e4fcb3af58b37f2f28959aef3f7 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bb10e62 ]

LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually 
approximate). (#710)

This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.

> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500563#comment-17500563
 ] 

ASF subversion and git services commented on LUCENE-10311:
--

Commit ca73ed1c2842b10c338f1d27ec54cead69ac090e in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca73ed1 ]

LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually 
approximate). (#710)

This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.

> Should DocIdSetBuilder have different implementations for point and terms?
> --
>
> Key: LUCENE-10311
> URL: https://issues.apache.org/jira/browse/LUCENE-10311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
> int docID;
> while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>   add(docID);
> }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10450) IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499982#comment-17499982
 ] 

ASF subversion and git services commented on LUCENE-10450:
--

Commit 19679428614109b1afe6e7c4ffadd35936b924c4 in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1967942 ]

LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to 
MatchAllDocsQuery (#720)



> IndexSortSortedNumericDocValuesRangeQuery could be rewrite to 
> MatchAllDocsQuery
> ---
>
> Key: LUCENE-10450
> URL: https://issues.apache.org/jira/browse/LUCENE-10450
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I see IndexOrDocValuesQuery as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, so After 
> [LUCENE-10442|https://issues.apache.org/jira/browse/LUCENE-10442], we should 
> rewrite IndexSortSortedNumericDocValuesRangeQuery to MatchAllDocsQuery if 
> fallbackQuery could be as MatchAllDocsQuery?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10237) Add merge on flush merge policy to Lucene Sandbox

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499977#comment-17499977
 ] 

ASF subversion and git services commented on LUCENE-10237:
--

Commit 46f9a25216ad2e88898e98e458dafae86adba4db in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=46f9a25 ]

LUCENE-10237: Move CHANGES entry to 9.1.


> Add merge on flush merge policy to Lucene Sandbox
> -
>
> Key: LUCENE-10237
> URL: https://issues.apache.org/jira/browse/LUCENE-10237
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Anand Kotriwal
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Add MergeOnCommitTieredMergePolicy to lucene sandbox. We use it within Amazon 
> product search, which enables 
> 1.  easy to configure merge-on-commit merges 
> 2. ensures no merged segment is accidentally too big a percentage of the 
> total index thus harming effective within-query concurrency and long-pole 
> query latencies



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10450) IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499973#comment-17499973
 ] 

ASF subversion and git services commented on LUCENE-10450:
--

Commit e996f1d8e722077a0fef05f922a7e97fb2e8d52e in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e996f1d ]

LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to 
MatchAllDocsQuery (#720)



> IndexSortSortedNumericDocValuesRangeQuery could be rewrite to 
> MatchAllDocsQuery
> ---
>
> Key: LUCENE-10450
> URL: https://issues.apache.org/jira/browse/LUCENE-10450
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I see IndexOrDocValuesQuery as a fallbackQuery of 
> IndexSortSortedNumericDocValuesRangeQuery in Elasticsearch, so After 
> [LUCENE-10442|https://issues.apache.org/jira/browse/LUCENE-10442], we should 
> rewrite IndexSortSortedNumericDocValuesRangeQuery to MatchAllDocsQuery if 
> fallbackQuery could be as MatchAllDocsQuery?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499972#comment-17499972
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit c0d5022d5aa0b88f53d805844a05fe9abcce8f2b in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c0d5022 ]

LUCENE-10439: update CHANGES.txt (#714)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499970#comment-17499970
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit bfe7096565eb2ecaa8c8c2419bdcc18c9e36d49e in lucene's branch 
refs/heads/branch_9x from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bfe7096 ]

LUCENE-10002: Replace test usages of TopScoreDocCollector with a corresponding 
collector manager (#716)

In the effort or replacing usages of IndexSearcher#search(Query, Collector) 
with IndexSearcher#search(Query, CollectorManager), this commit replaces many 
test usages of TopScoreDocCollector with its corresponding CollectorManager 
created by calling TopScoreDocCollector#createSharedManager.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10237) Add merge on commit merge policy to Lucene Sandbox

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499971#comment-17499971
 ] 

ASF subversion and git services commented on LUCENE-10237:
--

Commit 11e2fb8e0b6c03cf20e6e71fd2473109a01c4ddb in lucene's branch 
refs/heads/branch_9x from Anand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=11e2fb8 ]

LUCENE-10237 : Add MergeOnCommitTieredMergePolicy to sandbox (#446)


> Add merge on commit merge policy to Lucene Sandbox
> --
>
> Key: LUCENE-10237
> URL: https://issues.apache.org/jira/browse/LUCENE-10237
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Anand Kotriwal
>Priority: Minor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Add MergeOnCommitTieredMergePolicy to lucene sandbox. We use it within Amazon 
> product search, which enables 
> 1.  easy to configure merge-on-commit merges 
> 2. ensures no merged segment is accidentally too big a percentage of the 
> total index thus harming effective within-query concurrency and long-pole 
> query latencies



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499968#comment-17499968
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit e8e522a52b25e1302b0675600f11f95db49e8952 in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e8e522a ]

LUCENE-10439: update CHANGES.txt (#714)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10237) Add merge on commit merge policy to Lucene Sandbox

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499962#comment-17499962
 ] 

ASF subversion and git services commented on LUCENE-10237:
--

Commit 14726dec5168ccaf14065e14e7d7f3ee22f186f0 in lucene's branch 
refs/heads/main from Anand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=14726de ]

LUCENE-10237 : Add MergeOnCommitTieredMergePolicy to sandbox (#446)



> Add merge on commit merge policy to Lucene Sandbox
> --
>
> Key: LUCENE-10237
> URL: https://issues.apache.org/jira/browse/LUCENE-10237
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Anand Kotriwal
>Priority: Minor
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Add MergeOnCommitTieredMergePolicy to lucene sandbox. We use it within Amazon 
> product search, which enables 
> 1.  easy to configure merge-on-commit merges 
> 2. ensures no merged segment is accidentally too big a percentage of the 
> total index thus harming effective within-query concurrency and long-pole 
> query latencies



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2022-03-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499957#comment-17499957
 ] 

ASF subversion and git services commented on LUCENE-10002:
--

Commit 1b083ea03936469aeb47f1c76294ec2b706c3638 in lucene's branch 
refs/heads/main from Luca Cavanna
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1b083ea ]

LUCENE-10002: Replace test usages of TopScoreDocCollector with a corresponding 
collector manager (#716)

In the effort or replacing usages of IndexSearcher#search(Query, Collector) 
with IndexSearcher#search(Query, CollectorManager), this commit replaces many 
test usages of TopScoreDocCollector with its corresponding CollectorManager 
created by calling TopScoreDocCollector#createSharedManager.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 13h 20m
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-03-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499551#comment-17499551
 ] 

ASF subversion and git services commented on LUCENE-10440:
--

Commit 51797dc7f1bee39904c3181b9391973669b4afda in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=51797dc ]

LUCENE-10440: Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets (#712)



> Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
> ---
>
> Key: LUCENE-10440
> URL: https://issues.apache.org/jira/browse/LUCENE-10440
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Similar to what we did in LUCENE-10379, let's reduce the {{public}} 
> visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private 
> since they're really implementation details housing common logic and not 
> really intended as extension points for user faceting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10440) Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets

2022-03-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499549#comment-17499549
 ] 

ASF subversion and git services commented on LUCENE-10440:
--

Commit 4c9c1c074686aad5e0816fd26471ec9cf279f8d6 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4c9c1c0 ]

LUCENE-10440: Mark TaxonomyFacets and FloatTaxonomyFacets as deprecated (#713)



> Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets
> ---
>
> Key: LUCENE-10440
> URL: https://issues.apache.org/jira/browse/LUCENE-10440
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to what we did in LUCENE-10379, let's reduce the {{public}} 
> visibility of {{TaxonomyFacets}} and {{FloatTaxonomyFacets}} to pkg-private 
> since they're really implementation details housing common logic and not 
> really intended as extension points for user faceting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10442) When indexQuery or/and dvQuery be a MatchAllDocsQuery then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery

2022-02-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499087#comment-17499087
 ] 

ASF subversion and git services commented on LUCENE-10442:
--

Commit 9497524cc2d1eea24c5dd3da10e46eda991a7df7 in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9497524 ]

LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery  then 
IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery (#715)



> When indexQuery or/and dvQuery be a MatchAllDocsQuery  then 
> IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery 
> -
>
> Key: LUCENE-10442
> URL: https://issues.apache.org/jira/browse/LUCENE-10442
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery is typically useful for range queries, When indexQuery 
> was rewrite to MatchAllDocsQuery and if IndexOrDocValuesQuery not be a lead 
> iterator , it most likely that dvQuery will supply the Scorer not indexQuery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10442) When indexQuery or/and dvQuery be a MatchAllDocsQuery then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery

2022-02-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499077#comment-17499077
 ] 

ASF subversion and git services commented on LUCENE-10442:
--

Commit 6224d0b157f9339f9048f33bd65436b2ebf5d9b8 in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6224d0b ]

LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery  then 
IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery (#715)



> When indexQuery or/and dvQuery be a MatchAllDocsQuery  then 
> IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery 
> -
>
> Key: LUCENE-10442
> URL: https://issues.apache.org/jira/browse/LUCENE-10442
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery is typically useful for range queries, When indexQuery 
> was rewrite to MatchAllDocsQuery and if IndexOrDocValuesQuery not be a lead 
> iterator , it most likely that dvQuery will supply the Scorer not indexQuery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497932#comment-17497932
 ] 

ASF subversion and git services commented on LUCENE-10421:
--

Commit 5972b495ba6f5145492077dfb2a5d28717f71533 in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5972b49 ]

LUCENE-10421: use Constant instead of relying upon timestamp (#686)



> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497920#comment-17497920
 ] 

ASF subversion and git services commented on LUCENE-10421:
--

Commit 466278e14921572ceb54a0a52a8a262476ee24b7 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=466278e ]

LUCENE-10421: use Constant instead of relying upon timestamp (#686)



> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497757#comment-17497757
 ] 

ASF subversion and git services commented on LUCENE-9952:
-

Commit 81ab1d6ab6fa3aee69153b256a01a4b984f88b59 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=81ab1d6 ]

Remove TODO for LUCENE-9952 since that issue was fixed


> FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
> ---
>
> Key: LUCENE-9952
> URL: https://issues.apache.org/jira/browse/LUCENE-9952
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As described in a dev@ list 
> [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E],
>  the value of {{FacetResult#value}} can be incorrect in SSDV faceting when 
> docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and 
> {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in 
> the same dimension, it will be counted multiple times when populating the 
> counts of {{FacetResult#value}}.
> We should either provide an accurate count, or provide {{-1}} if we don't 
> have an accurate count (like we do in taxonomy faceting). I _think_ this 
> change will be a bit involved though as SSDV facet counting likely needs to 
> be made aware of {{FacetConfig}}.
> NOTE: I've updated this description to describe only the SSDV case after 
> spinning off LUCENE-9953 to track the LongValueFacetCounts case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9952) FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497749#comment-17497749
 ] 

ASF subversion and git services commented on LUCENE-9952:
-

Commit 4af516a1491e55022ca81a909b7c78d54d8272c0 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4af516a ]

Remove TODO for LUCENE-9952 since that issue was fixed


> FacetResult#value can be inaccurate in SortedSetDocValueFacetCounts
> ---
>
> Key: LUCENE-9952
> URL: https://issues.apache.org/jira/browse/LUCENE-9952
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As described in a dev@ list 
> [thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCANJ0CDo-9zt0U_pxWNOBkfiJpaAXZGGwOEJPnENAP6JzWz_t9Q%40mail.gmail.com%3E],
>  the value of {{FacetResult#value}} can be incorrect in SSDV faceting when 
> docs are multi-valued (affects both {{SortedSetDocValueFacetCounts}} and 
> {{ConcurrentSortedSetDocValueFacetCounts}}). If a doc has multiple values in 
> the same dimension, it will be counted multiple times when populating the 
> counts of {{FacetResult#value}}.
> We should either provide an accurate count, or provide {{-1}} if we don't 
> have an accurate count (like we do in taxonomy faceting). I _think_ this 
> change will be a bit involved though as SSDV facet counting likely needs to 
> be made aware of {{FacetConfig}}.
> NOTE: I've updated this description to describe only the SSDV case after 
> spinning off LUCENE-9953 to track the LongValueFacetCounts case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497372#comment-17497372
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit d4cb6d0a307be42b8d3498d4363a68eec5947f15 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d4cb6d0 ]

LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)

Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497373#comment-17497373
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497369#comment-17497369
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38 ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497367#comment-17497367
 ] 

ASF subversion and git services commented on LUCENE-10408:
--

Commit 44d7d962ae42cfca7070a8e2c84ab059fec21e10 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=44d7d96 ]

LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)

Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497273#comment-17497273
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit 6acf16a2e3427179614f99e159dec16f63b4dfc4 in lucene's branch 
refs/heads/branch_9x from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6acf16a ]

LUCENE-10439: Support multi-valued and multiple dimensions for count query in 
PointRangeQuery (#705)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10439) Support multi-valued and multiple dimensions for count query in PointRangeQuery

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497272#comment-17497272
 ] 

ASF subversion and git services commented on LUCENE-10439:
--

Commit 550d1305db71b33f988484fe58de1f754283562d in lucene's branch 
refs/heads/main from Lu Xugang
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=550d130 ]

LUCENE-10439: Support multi-valued and multiple dimensions for count query in 
PointRangeQuery (#705)



> Support multi-valued and multiple dimensions for count query in 
> PointRangeQuery
> ---
>
> Key: LUCENE-10439
> URL: https://issues.apache.org/jira/browse/LUCENE-10439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-10424, it also works with fields that have multiple 
> dimensions and/or that are multi-valued.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497267#comment-17497267
 ] 

ASF subversion and git services commented on LUCENE-10417:
--

Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ]

LUCENE-10417: Revert "LUCENE-10315" (#706) (#707)



> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497268#comment-17497268
 ] 

ASF subversion and git services commented on LUCENE-10315:
--

Commit ad48203b557d3250f6975d097a5af331db0ee3cd in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad48203 ]

LUCENE-10417: Revert "LUCENE-10315" (#706) (#707)



> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> ---
>
> Key: LUCENE-10315
> URL: https://issues.apache.org/jira/browse/LUCENE-10315
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
> Fix For: 9.1
>
> Attachments: addall.svg
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for 
> users with its dynamic mapping feature. When users index some low cardinality 
> fields, such as gender / age / status... they often use some numbers to 
> represent the values, while ES will infer these fields as {{{}long{}}}, and 
> ES uses BKD as the index of {{long}} fields. When the data volume grows, 
> building the result set of low-cardinality fields will make the CPU usage and 
> load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex 
> {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly 
> reduced ( We spent weeks of time to reindex all indices... ). I know that ES 
> recommended to use {{keyword}} for term/terms query and {{long}} for range 
> query in the document, but there are always some users who didn't realize 
> this and keep their habit of using sql database, or dynamic mapping 
> automatically selects the type for them. All in all, users won't realize that 
> there would be such a big difference in performance between {{long}} and 
> {{keyword}} fields in low cardinality fields. So from my point of view it 
> will make sense if we can make BKD works better for the low/medium 
> cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of 
> {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's 
> delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily 
> materialize of its result set, and when another small result clause 
> intersects with this low cardinality condition, the low cardinality field can 
> avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to 
> use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization 
> by mocking some random {{LongPoint}} and querying them with 
> {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff 
> percentage|
> |1|32|1|51.44|148.26|188.22%|
> |1|32|2|26.8|101.88|280.15%|
> |1|32|4|14.04|53.52|281.20%|
> |1|32|8|7.04|28.54|305.40%|
> |1|32|16|3.54|14.61|312.71%|
> |1|128|1|110.56|350.26|216.81%|
> |1|128|8|16.6|89.81|441.02%|
> |1|128|16|8.45|48.07|468.88%|
> |1|128|32|4.2|25.35|503.57%|
> |1|128|64|2.13|13.02|511.27%|
> |1|1024|1|536.19|843.88|57.38%|
> |1|1024|8|109.71|251.89|129.60%|
> |1|1024|32|33.24|104.11|213.21%|
> |1|1024|128|8.87|30.47|243.52%|
> |1|1024|512|2.24|8.3|270.54%|
> |1|8192|1|.33|5000|50.00%|
> |1|8192|32|139.47|214.59|53.86%|
> |1|8192|128|54.59|109.23|100.09%|
> |1|8192|512|15.61|36.15|131.58%|
> |1|8192|2048|4.11|11.14|171.05%|
> |1|1048576|1|2597.4|3030.3|16.67%|
> |1|1048576|32|314.96|371.75|18.03%|
> |1|1048576|128|99.7|116.28|16.63%|
> |1|1048576|512|30.5|37.15|21.80%|
> |1|1048576|2048|10.38|12.3|18.50%|
> |1|8388608|1|2564.1|3174.6|23.81%|
> |1|8388608|32|196.27|238.95|21.75%|
> |1|8388608|128|55.36|68.03|22.89%|
> |1|8388608|512|15.58|19.24|23.49%|
> |1|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high 
> cardinality fields.
> {code:java}
> 113Mindex_1_doc_32_cardinality_baseline
> 114Mindex_1_doc_32_cardinality_candidate
> 140Mindex_1_doc_128_cardinality_baseline
> 133Mindex_1_doc_128_cardinality_candidate
> 193Mindex_1_doc_1024_cardinality_baseline
> 174Mindex_1_doc_1024_cardinality_candidate
> 241M

[jira] [Commented] (LUCENE-10417) IntNRQ task performance decreased in nightly benchmark

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497259#comment-17497259
 ] 

ASF subversion and git services commented on LUCENE-10417:
--

Commit b0ca227862950a1869b535f31881cdfc2e859176 in lucene's branch 
refs/heads/main from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0ca227 ]

LUCENE-10417: Revert "LUCENE-10315" (#706)



> IntNRQ task performance decreased in nightly benchmark
> --
>
> Key: LUCENE-10417
> URL: https://issues.apache.org/jira/browse/LUCENE-10417
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Link: https://home.apache.org/~mikemccand/lucenebench/2022.02.07.18.02.48.html
> Probably related to LUCENE-10315,  I'll dig.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



<    1   2   3   4   5   6   7   8   9   10   >