date:20220715

[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-15 Thread GitBox



zacharymorn commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r922641944


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {

Review Comment:
   Thanks for the suggestion! I gave that a try and it did work, but it would 
reduce the performance boost for OrHighMed from around 110+% to 70+%, most 
likely due to the extra logic inside `DefaultBulkScorer`. I guess my preference 
would be to use the anonymous bulk scorer to maintain the performance 
advantage, but I'm also good with using `DefaultBulkScorer` if reducing 
potentially duplicated code and keeping things consistent are preferred?
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermBGroup1M1P   55.89  (7.1%)   54.01  
(6.1%)   -3.4% ( -15% -   10%) 0.108
 TermDateFacets   34.46  (5.8%)   33.58  
(5.0%)   -2.5% ( -12% -8%) 0.138
AndHighOrMedMed   90.90  (5.6%)   88.59  
(4.6%)   -2.5% ( -12% -8%) 0.115
  BrowseDayOfYearSSDVFacets   28.63 (12.5%)   28.01 
(14.5%)   -2.2% ( -25% -   28%) 0.612
   MedTermDayTaxoFacets   79.74  (5.1%)   78.04  
(4.3%)   -2.1% ( -10% -7%) 0.150
   TermGroup100   36.28  (3.5%)   35.54  
(3.1%)   -2.1% (  -8% -4%) 0.050
   TermBGroup1M   30.37  (3.7%)   29.87  
(3.8%)   -1.6% (  -8% -6%) 0.165
   TermGroup10K   41.33  (3.2%)   40.66  
(3.3%)   -1.6% (  -7% -5%) 0.117
   PKLookup  330.84  (5.1%)  326.20  
(4.3%)   -1.4% ( -10% -8%) 0.349
   SloppyPhrase   13.56  (2.8%)   13.39  
(2.5%)   -1.2% (  -6% -4%) 0.139
TermGroup1M   39.76  (3.2%)   39.32  
(3.2%)   -1.1% (  -7% -5%) 0.272
   AndMedOrHighHigh   88.13  (5.5%)   87.22  
(4.4%)   -1.0% ( -10% -9%) 0.511
   BrowseDateSSDVFacets4.17 (29.0%)4.13 
(29.0%)   -0.8% ( -45% -   80%) 0.933
   SpanNear  169.70  (2.6%)  168.59  
(2.0%)   -0.7% (  -5% -4%) 0.369
 Fuzzy2   83.59  (2.4%)   83.12  
(2.2%)   -0.6% (  -5% -4%) 0.442
Respell   96.22  (3.1%)   95.85  
(2.6%)   -0.4% (  -5% -5%) 0.672
   IntervalsOrdered   23.02  (4.3%)   22.94  
(4.2%)   -0.3% (  -8% -8%) 0.799
   Wildcard  231.51  (4.5%)  230.74  
(5.0%)   -0.3% (  -9% -9%) 0.827
 AndHighMed  143.73  (5.8%)  143.48  
(4.0%)   -0.2% (  -9% -   10%) 0.914
   AndHighHighDayTaxoFacets   54.80  (1.3%)   54.71  
(1.5%)   -0.2% (  -2% -2%) 0.717
 Fuzzy1  154.41  (2.8%)  154.44  
(1.9%)0.0% (  -4% -4%) 0.981
  BrowseMonthSSDVFacets   28.06 (10.4%)   28.06 
(13.3%)0.0% ( -21% -   26%) 0.995
 OrHighMedDayTaxoFacets7.38  (5.7%)7.39  
(5.4%)0.0% ( -10% -   11%) 0.981
AndHighMedDayTaxoFacets  134.39  (2.0%)  134.59  
(1.9%)0.2% (  -3% -4%) 0.809
 Phrase   38.80  (1.8%)   38.86  
(2.5%)0.2% (  -4% -4%) 0.823
  TermMonthSort  357.58  (5.2%)  359.31  
(6.7%)0.5% ( -10% -   13%) 0.801
  TermTitleSort  274.32  (5.1%)  275.77  
(6.8%)0.5% ( -10% -   13%) 0.781
  TermDayOfYearSort  259

[GitHub] [lucene] gsmiller commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition

2022-07-15 Thread GitBox



gsmiller commented on code in PR #1021:
URL: https://github.com/apache/lucene/pull/1021#discussion_r922620035


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -391,11 +386,7 @@ public boolean advanceExact(int target) throws IOException 
{
 
 @Override
 public long nextOrd() {
-  if (limit == ordUpto) {
-return NO_MORE_ORDS;
-  } else {
-return ords.ords.get(ordUpto++);
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   @LuXugang yeah, that's exactly right. The caller is now responsible for 
using `docValueCount()` to determine how many values the positioned doc has, 
and shouldn't call `nextOrd()` more than that many times. You're right that we 
could remove `limit` from `SortingSortedNumericDocValues` as well. I just 
missed it. Would you like to follow up with a change to remove it? I can get to 
it next week too. Thanks for pointing this out!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta closed issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"

2022-07-15 Thread GitBox



mocobeta closed issue #15: Make a script to add comments to all Jira issues to 
indicate that "this was moved to GitHub"
URL: https://github.com/apache/lucene-jira-archive/issues/15


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"

2022-07-15 Thread GitBox



mocobeta commented on issue #15:
URL: 
https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186068253

   Merged #15
   
   Note that this sends notifications to issues@ mail list (per POST API call). 
I think it's fine to let users know about the migration - an announcement for 
devs would be needed beforehand though.
   
   This would take about 2.5 hours - I think it'd be okay that we include this 
in the migration process (the whole process takes days).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta merged pull request #47: Add a script to add 'moved to' comments to Jira issues

2022-07-15 Thread GitBox



mocobeta merged PR #47:
URL: https://github.com/apache/lucene-jira-archive/pull/47


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on pull request #47: Add a script to add 'moved to' comments to Jira issues

2022-07-15 Thread GitBox



mocobeta commented on PR #47:
URL: 
https://github.com/apache/lucene-jira-archive/pull/47#issuecomment-1186062124

   Looks to work fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567435#comment-17567435
 ] 

Tomoko Uchida commented on LUCENE-10557:


[TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/196.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-07-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567434#comment-17567434
 ] 

Tomoko Uchida commented on LUCENE-10622:


[TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/61.

> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: test-1.txt, test.txt
>
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #47: Add a script to add 'moved to' comments to Jira issues

2022-07-15 Thread GitBox



mocobeta opened a new pull request, #47:
URL: https://github.com/apache/lucene-jira-archive/pull/47

   With a [personal access 
token](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html)
 (of a committer?), it is possible to add comments to ASF Jira issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-07-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567432#comment-17567432
 ] 

Tomoko Uchida commented on LUCENE-10622:


[TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/61.

> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: test-1.txt, test.txt
>
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-07-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567431#comment-17567431
 ] 

Tomoko Uchida commented on LUCENE-10622:


[TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/61.

> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: test-1.txt, test.txt
>
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"

2022-07-15 Thread GitBox



mikemccand commented on issue #15:
URL: 
https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186048071

   It's possible this might still work even once we've made Jira "read-only" by 
disabling all workflows!  Then we can take our time after the migration (and 
Jira becoming read-only) to append these comments to all Jira issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)

2022-07-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567429#comment-17567429
 ] 

Tomoko Uchida commented on LUCENE-10622:


This is a test comment from API.

> Prepare complete migration script to GitHub issue from Jira (best effort)
> -
>
> Key: LUCENE-10622
> URL: https://issues.apache.org/jira/browse/LUCENE-10622
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: test-1.txt, test.txt
>
>
> If we intend to move the history to GitHub, it should be perfect as far as 
> possible - significantly degraded copies of history are harmful, rather than 
> helpful for future contributors, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"

2022-07-15 Thread GitBox



mocobeta commented on issue #15:
URL: 
https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186041151

   Adding a comment: examples
   
https://developer.atlassian.com/server/jira/platform/jira-rest-api-examples/#adding-a-comment--examples


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta closed issue #26: Make a script to set colors and descriptions for labels

2022-07-15 Thread GitBox



mocobeta closed issue #26: Make a script to set colors and descriptions for 
labels
URL: https://github.com/apache/lucene-jira-archive/issues/26


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta merged pull request #46: Add a script to update issue labels and descriptions

2022-07-15 Thread GitBox



mocobeta merged PR #46:
URL: https://github.com/apache/lucene-jira-archive/pull/46


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #46: Add a script to update issue labels and descriptions

2022-07-15 Thread GitBox



mocobeta opened a new pull request, #46:
URL: https://github.com/apache/lucene-jira-archive/pull/46

   Close #26 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567366#comment-17567366
 ] 

Michael Sokolov commented on LUCENE-10655:
--

meh, I tried a few things, but nothing really moved the needle.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-15 Thread GitBox



rmuir commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1185833705

   I think the title of the PR is wrong? We shouldn't be quantizing anything. 
The user should be supplying a `byte[]` vector for 8-bit vectors. Floats should 
not be involved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358
 ] 

Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:41 PM:
---

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently. Oh I see - in fact that *is* what we do. Okay, returning to this 
again, I think I will try using that one for the fully-populated level only


was (Author: sokolov):
OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358
 ] 

Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:39 PM:
---

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently.


was (Author: sokolov):
OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer. And I tried allocating afresh rather than clear-ing, and it was a bit 
slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to 
be done about it.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358
 ] 

Michael Sokolov commented on LUCENE-10655:
--

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer. And I tried allocating afresh rather than clear-ing, and it was a bit 
slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to 
be done about it.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567334#comment-17567334
 ] 

Michael Sokolov commented on LUCENE-10633:
--

Adrien that's crazy !

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567330#comment-17567330
 ] 

Michael Sokolov commented on LUCENE-10151:
--

> Did you forget to push to branch_9x? I cannot see the change there.

Yes! Thanks - pushed now

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-15 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567328#comment-17567328
 ] 

ASF subversion and git services commented on LUCENE-10151:
--

Commit aa082b46f669f71cd0deb2e409c62be863f17091 in lucene's branch 
refs/heads/branch_9x from Deepika0510
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=aa082b46f66 ]

LUCENE-10151: Adding Timeout Support to IndexSearcher  (#927)

Authored-by: Deepika Sharma 

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-15 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567329#comment-17567329
 ] 

ASF subversion and git services commented on LUCENE-10151:
--

Commit 5cd6eda8caba5a93eeaf60215885ec3171707449 in lucene's branch 
refs/heads/branch_9x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5cd6eda8cab ]

CHANGES entry for LUCENE-10151


> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-15 Thread GitBox



nknize commented on PR #1017:
URL: https://github.com/apache/lucene/pull/1017#issuecomment-1185733826

   Hey @iverase; here's the PR related to the ShapeDocValuesField for the 9.3 
release. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-15 Thread Vigya Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567317#comment-17567317
 ] 

Vigya Sharma commented on LUCENE-10649:
---

Created https://github.com/apache/lucene/pull/1025

> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma opened a new pull request, #1025: LUCENE-10649: Fix failures in TestDemoParallelLeafReader

2022-07-15 Thread GitBox



vigyasharma opened a new pull request, #1025:
URL: https://github.com/apache/lucene/pull/1025

   With merge-on-refresh enabled, the `ReindexingMergePolicy` in this test, 
needs to overide `findFullFlushMerges()`, to wrap the input reader, so that 
merged segment gets fields from the parallel readers.
   
   # Testing:
   Ran the test on repeat on my dev box. Without the fix, it fails in a couple 
of runs.
   
   ```java
   % ./gradlew test --tests 
TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
-Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
-Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
-Dtests.iters=500 -Dtests.failfast=true
   ...
   :lucene:core:test (SUCCESS): 500 test(s)
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10655:


 Summary: can we optimize visited bitset usage in HNSW graph 
search/indexing?
 Key: LUCENE-10655
 URL: https://issues.apache.org/jira/browse/LUCENE-10655
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/hnsw
Reporter: Michael Sokolov


When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
the CPU profiler output. I had a few ideas:
 # In upper graph layers, the occupied nodes are very sparse - maybe 
{{SparseFixedBitSet}} would be a better fit for those
 # We are caching these bitsets, but they are only used for a single search 
(single document insert, during indexing). Should we cache across searches? We 
would need to pool them though, and they would vary by field since fields can 
have different numbers of vector nodes. This starts to get complex
 # Are we sure that clearing a bitset is more efficient than allocating a new 
one? Maybe the JDK maintains a pool of already-zeroed memory for us

I think we could try specializing the bitset type by graph level, and then I 
think we ought to measure the performance of allocation vs the limited reuse 
that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-15 Thread GitBox



msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1185649145

   I pushed an updated luceneutil PR adapting to these changes 
https://github.com/mikemccand/luceneutil/pull/181. Running that perf test I saw 
consistent gains (20-55% depending on the test case) as compared to the earlier 
test runs.
   
   I also noticed that the profiler shows the most expensive function during 
indexing is FixedBitSet.clear(), which makes me think we mioght want to use 
sparse bitsets for the "upper" layers of the graph which have many fewer nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #1024: LUCENE-10557: Add GitHub issue templates

2022-07-15 Thread GitBox



mocobeta commented on PR #1024:
URL: https://github.com/apache/lucene/pull/1024#issuecomment-1185631535

   This is also a proposal for issue management. Feedback is welcome - I'm 
going to merge this in a week to proceed with the migration (if there are no 
comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta opened a new pull request, #1024: LUCENE-10557: Add GitHub issue templates

2022-07-15 Thread GitBox



mocobeta opened a new pull request, #1024:
URL: https://github.com/apache/lucene/pull/1024

   ### Description (or a Jira issue link if you have one)
   
   LUCENE-10557
   
   This adds GitHub issue templates and a draft how-to manual for organizing 
issues with labels/templates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz opened a new pull request, #1023: LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields.

2022-07-15 Thread GitBox



jpountz opened a new pull request, #1023:
URL: https://github.com/apache/lucene/pull/1023

   This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
   by using postings to filter competitive documents.
   
   JIRA: [LUCENE-10633](https://issues.apache.org/jira/browse/LUCENE-10633)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-15 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567268#comment-17567268
 ] 

Adrien Grand commented on LUCENE-10633:
---

I played with a prototype that starts dynamically pruning matches as soon as 
there are 128 competitive ordinals left or less by pulling postings to iterate 
over the remaining documents that have competitive values. I still need to 
think of simplifying the logic and improving tests but the initial benchmarks 
on wikimedium10m are very encouraging (assuming I didn't get anything wrong):

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 Prefix3  248.74  (6.1%)  242.61  
(5.8%)   -2.5% ( -13% -   10%) 0.191
   BrowseMonthTaxoFacets   27.71 (10.1%)   27.34 
(10.6%)   -1.3% ( -20% -   21%) 0.682
BrowseDateSSDVFacets4.99 (10.3%)4.94  
(8.4%)   -1.1% ( -17% -   19%) 0.707
BrowseDateTaxoFacets   44.26 (12.2%)   43.97 
(13.1%)   -0.7% ( -23% -   28%) 0.870
Wildcard  137.61  (3.0%)  136.97  
(2.6%)   -0.5% (  -5% -5%) 0.592
   BrowseDayOfYearTaxoFacets   45.53 (12.4%)   45.44 
(13.4%)   -0.2% ( -23% -   29%) 0.963
  IntNRQ  198.27  (8.1%)  197.94  
(7.4%)   -0.2% ( -14% -   16%) 0.946
 BrowseRandomLabelSSDVFacets   14.51  (2.2%)   14.49  
(2.4%)   -0.2% (  -4% -4%) 0.835
AndHighHighDayTaxoFacets8.32  (5.1%)8.31  
(5.7%)   -0.1% ( -10% -   11%) 0.956
 LowSpanNear   46.83  (1.6%)   46.82  
(2.0%)   -0.0% (  -3% -3%) 0.990
 BrowseRandomLabelTaxoFacets   36.18 (10.5%)   36.18 
(12.6%)0.0% ( -20% -   25%) 0.998
MedTermDayTaxoFacets   73.59  (4.8%)   73.66  
(5.7%)0.1% (  -9% -   11%) 0.954
   OrNotHighHigh 1476.08  (5.3%) 1477.58  
(3.9%)0.1% (  -8% -9%) 0.945
  TermDTSort  746.55  (2.4%)  747.70  
(1.7%)0.2% (  -3% -4%) 0.817
  Fuzzy2   96.18  (1.3%)   96.39  
(1.4%)0.2% (  -2% -2%) 0.617
 AndHighMedDayTaxoFacets  154.89  (1.8%)  155.29  
(1.6%)0.3% (  -3% -3%) 0.629
  AndHighMed  378.38  (3.7%)  379.50  
(4.4%)0.3% (  -7% -8%) 0.817
PKLookup  243.14  (1.9%)  243.99  
(1.9%)0.4% (  -3% -4%) 0.552
  HighPhrase  279.13  (2.1%)  280.21  
(1.5%)0.4% (  -3% -4%) 0.510
 Respell   71.59  (1.5%)   71.87  
(1.5%)0.4% (  -2% -3%) 0.406
  OrHighHigh   66.95  (6.5%)   67.21  
(5.7%)0.4% ( -11% -   13%) 0.837
  Fuzzy1  101.53  (1.5%)  101.95  
(1.5%)0.4% (  -2% -3%) 0.382
   LowPhrase  101.76  (2.3%)  102.22  
(2.6%)0.5% (  -4% -5%) 0.558
 LowSloppyPhrase   21.14  (3.1%)   21.25  
(4.1%)0.5% (  -6% -7%) 0.661
   MedPhrase  173.45  (2.7%)  174.55  
(2.6%)0.6% (  -4% -6%) 0.443
 MedSpanNear   17.77  (4.5%)   17.88  
(4.8%)0.6% (  -8% -   10%) 0.661
OrHighNotLow 1396.26  (5.6%) 1406.85  
(6.4%)0.8% ( -10% -   13%) 0.692
   OrHighMed  162.41  (5.3%)  163.69  
(4.8%)0.8% (  -8% -   11%) 0.625
   HighTermDayOfYearSort 1476.11  (2.7%) 1488.26  
(2.4%)0.8% (  -4% -6%) 0.312
 MedIntervalsOrdered  113.65  (4.2%)  114.59  
(7.0%)0.8% (  -9% -   12%) 0.652
   OrHighLow  828.13  (5.2%)  835.45  
(4.7%)0.9% (  -8% -   11%) 0.574
 MedTerm 2356.21  (4.7%) 2377.47  
(5.0%)0.9% (  -8% -   11%) 0.554
 MedSloppyPhrase   62.13  (3.4%)   62.72  
(3.9%)0.9% (  -6% -8%) 0.420
HighIntervalsOrdered   18.19  (5.7%)   18.37  
(8.6%)1.0% ( -12% -   16%) 0.673
 AndHighHigh   54.46  (6.2%)   55.01  
(6.3%)1.0% ( -10% -   14%) 0.615
 LowTerm 2247.13  (4.7%) 2270.19  
(3.7%)1.0% (  -7% -9%) 0.446
OrNotHighLow 1728.71  (4.3%) 1748.19  
(4.7%)1.1% (  -7% -   10%) 0.427
HighTermTitleBDVSort   14.31  (3.3%)   14.47

[GitHub] [lucene] jpountz commented on pull request #1007: Small tweak to PointRangeQuery#visit logic

2022-07-15 Thread GitBox



jpountz commented on PR #1007:
URL: https://github.com/apache/lucene/pull/1007#issuecomment-1185610823

   This looks right to me, hopefully @iverase can confirm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-15 Thread GitBox



jpountz commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r97906


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {

Review Comment:
   I wonder if we could reuse `DefaultBulkScorer` instead of this anonymous 
bulk scorer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #6: Document issue label / template management policy

2022-07-15 Thread GitBox



mocobeta commented on issue #6:
URL: 
https://github.com/apache/lucene-jira-archive/issues/6#issuecomment-1185547448

   Issue templates proposal:
   - Bug Report - this is associated with `type:bug` label
   - Enhancement Request - this is associated with `type:enhancement` label
   - Test Failure Report - this is associated with `type:testFailure` label
   - Task - this is associated with `type:task` label
   - Documentation - this is associated with `type:documentation` label
   - Question - this creates no issues; is used to guide users to mailing lists


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta closed issue #45: Bug

2022-07-15 Thread GitBox



mocobeta closed issue #45: Bug
URL: https://github.com/apache/lucene-jira-archive/issues/45


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #45: Bug

2022-07-15 Thread GitBox



mocobeta commented on issue #45:
URL: 
https://github.com/apache/lucene-jira-archive/issues/45#issuecomment-1185540048

   Ok, the label was set as expected. (looks like it has to be created 
beforehand).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #45: Bug

2022-07-15 Thread GitBox



mocobeta opened a new issue, #45:
URL: https://github.com/apache/lucene-jira-archive/issues/45

   ### Description
   
   This is a test issue (take 2).
   
   ### Version and Environments
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #44: Bug

2022-07-15 Thread GitBox



mocobeta commented on issue #44:
URL: 
https://github.com/apache/lucene-jira-archive/issues/44#issuecomment-1185537568

   The issue label was not set...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta closed issue #44: Bug

2022-07-15 Thread GitBox



mocobeta closed issue #44: Bug
URL: https://github.com/apache/lucene-jira-archive/issues/44


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #44: Bug

2022-07-15 Thread GitBox



mocobeta opened a new issue, #44:
URL: https://github.com/apache/lucene-jira-archive/issues/44

   ### Description
   
   This is a test bug report.
   
   ### Version and Environments
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?

2022-07-15 Thread GitBox



mocobeta commented on issue #4:
URL: 
https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1185506588

   There have been no additional comments/requests.
   I decided to use my account for the second pass (updating step after 
importing) since I don't think we should bother infra with asking to run a 
time-consuming job that can be done by ourselves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-15 Thread GitBox



mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1185499857

   Tasks to be done:
   
   - [ ] regenerate a candidate mapping (on July 24th)
   - [ ] manually make a "verified" mapping and commit it to `main` (on July 
24th or 25th)
   - [ ] send a mail to the dev list to let others browse/check both 
"candidate" and "verified" mappings (on July 25th)
   - [ ] accept pull requests to add/edit the mapping
   - [ ] fix the final mapping (on August 7th)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta merged pull request #34: Add a tool to generate account mapping

2022-07-15 Thread GitBox



mocobeta merged PR #34:
URL: https://github.com/apache/lucene-jira-archive/pull/34


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on pull request #34: Add a tool to generate account mapping

2022-07-15 Thread GitBox



mocobeta commented on PR #34:
URL: 
https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185491422

   Here's the re-taken candidate and verified (with [the above 
criteria](https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185313945))
 mapping.
   
https://github.com/apache/lucene-jira-archive/pull/34/commits/b44bd73626fc9490b0da9437e54c156f4a361b32
   
   - 5792 candidate mapping
   - 163 verified mapping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7713) Optimize TopFieldDocCollector for the sorted case

2022-07-15 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567202#comment-17567202
 ] 

Lu Xugang commented on LUCENE-7713:
---

Hi [~jpountz], it seems like there is no need to do comparison or collecting by 
PriorityQueue when the search sort order  is a prefix of the index sort order.  
This issue still existed,  should we need to  do this optimization?



> Optimize TopFieldDocCollector for the sorted case
> -
>
> Key: LUCENE-7713
> URL: https://issues.apache.org/jira/browse/LUCENE-7713
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> When the sort order is a prefix of the index sort order, 
> {{TopFieldDocCollector}} could skip reading doc values and comparing them 
> against the bottom value after {{numHits}} documents have been collected, and 
> just count matches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on pull request #34: Add a tool to generate account mapping

2022-07-15 Thread GitBox



mikemccand commented on PR #34:
URL: 
https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185412128

   Wow, the mapping file is massive!  5,793 developers.  We've had so many 
contributors over the years ;)  Inspiring.
   
   > I'd put priority on avoiding false positives.
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on pull request #34: Add a tool to generate account mapping

2022-07-15 Thread GitBox



mocobeta commented on PR #34:
URL: 
https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185313945

   For verification, I'll do
   1. Check if the candidate github account has push access on apache/lucene 
repo.
   2. Check if the candidate github account has been logged as "author" at 
least once in the commit history.
   
   For accounts that do not satisfy the above criteria, I would just omit them. 
   
   There should be some false negatives (for example, Jira users who reported 
issues but were not logged in commit history are omitted). I'd put priority on 
avoiding false positives.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #767: LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery

2022-07-15 Thread GitBox



LuXugang commented on code in PR #767:
URL: https://github.com/apache/lucene/pull/767#discussion_r845715708


##
lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java:
##
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Objects;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FieldInfos;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.index.Terms;
+
+/**
+ * A {@link Query} that matches documents that contain either a {@link
+ * org.apache.lucene.document.KnnVectorField}, or a field that indexes norms 
or doc values.
+ */
+public class FieldExistsQuery extends Query {
+  private String field;
+
+  /** Create a query that will match that have a value for the given {@code 
field}. */
+  public FieldExistsQuery(String field) {
+this.field = Objects.requireNonNull(field);
+  }
+
+  public String getField() {
+return field;
+  }
+
+  @Override
+  public String toString(String field) {
+return "FieldExistsQuery [field=" + this.field + "]";
+  }
+
+  @Override
+  public void visit(QueryVisitor visitor) {
+if (visitor.acceptField(field)) {
+  visitor.visitLeaf(this);
+}
+  }
+
+  @Override
+  public boolean equals(Object other) {
+return sameClassAs(other) && field.equals(((FieldExistsQuery) 
other).field);
+  }
+
+  @Override
+  public int hashCode() {
+final int prime = 31;
+int hash = classHash();
+hash = prime * hash + field.hashCode();
+return hash;
+  }
+
+  @Override
+  public Query rewrite(IndexReader reader) throws IOException {
+boolean allReadersRewritable = true;
+
+for (LeafReaderContext context : reader.leaves()) {
+  LeafReader leaf = context.reader();
+  FieldInfos fieldInfos = leaf.getFieldInfos();
+  FieldInfo fieldInfo = fieldInfos.fieldInfo(field);
+
+  if (fieldInfo == null) {
+allReadersRewritable = false;
+break;
+  }
+
+  if (fieldInfo.hasNorms()) { // the field indexes norms
+if (reader.getDocCount(field) != reader.maxDoc()) {
+  allReadersRewritable = false;
+  break;
+}
+  } else if (fieldInfo.getVectorDimension() != 0) { // the field indexes 
vectors
+if (leaf.getVectorValues(field).size() != reader.maxDoc()) {
+  allReadersRewritable = false;
+  break;
+}
+  } else if (fieldInfo.getDocValuesType()
+  != DocValuesType.NONE) { // the field indexes doc values or points
+
+// This optimization is possible due to LUCENE-9334 enforcing a field 
to always uses the
+// same data structures (all or nothing). Since there's no index 
statistic to detect when
+// all documents have doc values for a specific field, 
FieldExistsQuery can only be
+// rewritten to MatchAllDocsQuery for doc values field, when that same 
field also indexes
+// terms or point values which do have index statistics, and those 
statistics confirm that
+// all documents in this segment have values terms or point values.
+
+Terms terms = leaf.terms(field);
+PointValues pointValues = leaf.getPointValues(field);
+
+if ((terms == null || terms.getDocCount() != leaf.maxDoc())
+&& (pointValues == null || pointValues.getDocCount() != 
leaf.maxDoc())) {
+  allReadersRewritable = false;
+  break;
+}
+  } else {
+throw new IllegalStateException(buildErrorMsg(fieldInfo));
+  }
+}
+if (allReadersRewritable) {
+  return new MatchAllDocsQuery();
+}
+return super.rewrite(reader);
+  }
+
+  @Override
+  public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, 
float boost) {
+return new ConstantScoreWeight(this, boost) {
+  @Override
+  public Scorer scorer(LeafReaderContext context) throws IOExcept

52 matches

Mail list logo