[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn commented on code in PR #1018: URL: https://github.com/apache/lucene/pull/1018#discussion_r922641944 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -191,6 +191,69 @@ public long cost() { // or null if it is not applicable // pkg-private for forcing use of BooleanScorer in tests BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException { +if (scoreMode == ScoreMode.TOP_SCORES) { + if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 2) { +return null; + } + + List optional = new ArrayList<>(); + for (WeightedBooleanClause wc : weightedClauses) { +Weight w = wc.weight; +BooleanClause c = wc.clause; +if (c.getOccur() != Occur.SHOULD) { + continue; +} +ScorerSupplier scorer = w.scorerSupplier(context); +if (scorer != null) { + optional.add(scorer); +} + } + + if (optional.size() <= 1) { +return null; + } + + List optionalScorers = new ArrayList<>(); + for (ScorerSupplier ss : optional) { +optionalScorers.add(ss.get(Long.MAX_VALUE)); + } + + return new BulkScorer() { Review Comment: Thanks for the suggestion! I gave that a try and it did work, but it would reduce the performance boost for OrHighMed from around 110+% to 70+%, most likely due to the extra logic inside `DefaultBulkScorer`. I guess my preference would be to use the anonymous bulk scorer to maintain the performance advantage, but I'm also good with using `DefaultBulkScorer` if reducing potentially duplicated code and keeping things consistent are preferred? ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value TermBGroup1M1P 55.89 (7.1%) 54.01 (6.1%) -3.4% ( -15% - 10%) 0.108 TermDateFacets 34.46 (5.8%) 33.58 (5.0%) -2.5% ( -12% -8%) 0.138 AndHighOrMedMed 90.90 (5.6%) 88.59 (4.6%) -2.5% ( -12% -8%) 0.115 BrowseDayOfYearSSDVFacets 28.63 (12.5%) 28.01 (14.5%) -2.2% ( -25% - 28%) 0.612 MedTermDayTaxoFacets 79.74 (5.1%) 78.04 (4.3%) -2.1% ( -10% -7%) 0.150 TermGroup100 36.28 (3.5%) 35.54 (3.1%) -2.1% ( -8% -4%) 0.050 TermBGroup1M 30.37 (3.7%) 29.87 (3.8%) -1.6% ( -8% -6%) 0.165 TermGroup10K 41.33 (3.2%) 40.66 (3.3%) -1.6% ( -7% -5%) 0.117 PKLookup 330.84 (5.1%) 326.20 (4.3%) -1.4% ( -10% -8%) 0.349 SloppyPhrase 13.56 (2.8%) 13.39 (2.5%) -1.2% ( -6% -4%) 0.139 TermGroup1M 39.76 (3.2%) 39.32 (3.2%) -1.1% ( -7% -5%) 0.272 AndMedOrHighHigh 88.13 (5.5%) 87.22 (4.4%) -1.0% ( -10% -9%) 0.511 BrowseDateSSDVFacets4.17 (29.0%)4.13 (29.0%) -0.8% ( -45% - 80%) 0.933 SpanNear 169.70 (2.6%) 168.59 (2.0%) -0.7% ( -5% -4%) 0.369 Fuzzy2 83.59 (2.4%) 83.12 (2.2%) -0.6% ( -5% -4%) 0.442 Respell 96.22 (3.1%) 95.85 (2.6%) -0.4% ( -5% -5%) 0.672 IntervalsOrdered 23.02 (4.3%) 22.94 (4.2%) -0.3% ( -8% -8%) 0.799 Wildcard 231.51 (4.5%) 230.74 (5.0%) -0.3% ( -9% -9%) 0.827 AndHighMed 143.73 (5.8%) 143.48 (4.0%) -0.2% ( -9% - 10%) 0.914 AndHighHighDayTaxoFacets 54.80 (1.3%) 54.71 (1.5%) -0.2% ( -2% -2%) 0.717 Fuzzy1 154.41 (2.8%) 154.44 (1.9%)0.0% ( -4% -4%) 0.981 BrowseMonthSSDVFacets 28.06 (10.4%) 28.06 (13.3%)0.0% ( -21% - 26%) 0.995 OrHighMedDayTaxoFacets7.38 (5.7%)7.39 (5.4%)0.0% ( -10% - 11%) 0.981 AndHighMedDayTaxoFacets 134.39 (2.0%) 134.59 (1.9%)0.2% ( -3% -4%) 0.809 Phrase 38.80 (1.8%) 38.86 (2.5%)0.2% ( -4% -4%) 0.823 TermMonthSort 357.58 (5.2%) 359.31 (6.7%)0.5% ( -10% - 13%) 0.801 TermTitleSort 274.32 (5.1%) 275.77 (6.8%)0.5% ( -10% - 13%) 0.781 TermDayOfYearSort 259
[GitHub] [lucene] gsmiller commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition
gsmiller commented on code in PR #1021: URL: https://github.com/apache/lucene/pull/1021#discussion_r922620035 ## lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java: ## @@ -391,11 +386,7 @@ public boolean advanceExact(int target) throws IOException { @Override public long nextOrd() { - if (limit == ordUpto) { -return NO_MORE_ORDS; - } else { -return ords.ords.get(ordUpto++); - } + return ords.ords.get(ordUpto++); Review Comment: @LuXugang yeah, that's exactly right. The caller is now responsible for using `docValueCount()` to determine how many values the positioned doc has, and shouldn't call `nextOrd()` more than that many times. You're right that we could remove `limit` from `SortingSortedNumericDocValues` as well. I just missed it. Would you like to follow up with a change to remove it? I can get to it next week too. Thanks for pointing this out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"
mocobeta closed issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub" URL: https://github.com/apache/lucene-jira-archive/issues/15 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"
mocobeta commented on issue #15: URL: https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186068253 Merged #15 Note that this sends notifications to issues@ mail list (per POST API call). I think it's fine to let users know about the migration - an announcement for devs would be needed beforehand though. This would take about 2.5 hours - I think it'd be okay that we include this in the migration process (the whole process takes days). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #47: Add a script to add 'moved to' comments to Jira issues
mocobeta merged PR #47: URL: https://github.com/apache/lucene-jira-archive/pull/47 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #47: Add a script to add 'moved to' comments to Jira issues
mocobeta commented on PR #47: URL: https://github.com/apache/lucene-jira-archive/pull/47#issuecomment-1186062124 Looks to work fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567435#comment-17567435 ] Tomoko Uchida commented on LUCENE-10557: [TEST] This was moved to GitHub issue: https://github.com/mocobeta/migration-test-3/issues/196. > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, > image-2022-06-29-13-36-57-365.png, screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * (/) Choose issues that should be moved to GitHub - We'll migrate all > issues towards an atomic switch to GitHub if no major technical obstacles > show up. > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Prepare a complete migration tool > ** See https://github.com/apache/lucene-jira-archive/issues/5 > * Build the convention for issue label/milestone management > ** See [https://github.com/apache/lucene-jira-archive/issues/6] > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * (/) Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** See [https://github.com/apache/lucene-jira-archive/issues/7] > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)
[ https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567434#comment-17567434 ] Tomoko Uchida commented on LUCENE-10622: [TEST] This was moved to GitHub issue: https://github.com/mocobeta/migration-test-3/issues/61. > Prepare complete migration script to GitHub issue from Jira (best effort) > - > > Key: LUCENE-10622 > URL: https://issues.apache.org/jira/browse/LUCENE-10622 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: test-1.txt, test.txt > > > If we intend to move the history to GitHub, it should be perfect as far as > possible - significantly degraded copies of history are harmful, rather than > helpful for future contributors, I think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #47: Add a script to add 'moved to' comments to Jira issues
mocobeta opened a new pull request, #47: URL: https://github.com/apache/lucene-jira-archive/pull/47 With a [personal access token](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html) (of a committer?), it is possible to add comments to ASF Jira issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)
[ https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567432#comment-17567432 ] Tomoko Uchida commented on LUCENE-10622: [TEST] This was moved to GitHub issue: https://github.com/mocobeta/migration-test-3/issues/61. > Prepare complete migration script to GitHub issue from Jira (best effort) > - > > Key: LUCENE-10622 > URL: https://issues.apache.org/jira/browse/LUCENE-10622 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: test-1.txt, test.txt > > > If we intend to move the history to GitHub, it should be perfect as far as > possible - significantly degraded copies of history are harmful, rather than > helpful for future contributors, I think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)
[ https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567431#comment-17567431 ] Tomoko Uchida commented on LUCENE-10622: [TEST] This was moved to GitHub issue: https://github.com/mocobeta/migration-test-3/issues/61. > Prepare complete migration script to GitHub issue from Jira (best effort) > - > > Key: LUCENE-10622 > URL: https://issues.apache.org/jira/browse/LUCENE-10622 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: test-1.txt, test.txt > > > If we intend to move the history to GitHub, it should be perfect as far as > possible - significantly degraded copies of history are harmful, rather than > helpful for future contributors, I think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"
mikemccand commented on issue #15: URL: https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186048071 It's possible this might still work even once we've made Jira "read-only" by disabling all workflows! Then we can take our time after the migration (and Jira becoming read-only) to append these comments to all Jira issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10622) Prepare complete migration script to GitHub issue from Jira (best effort)
[ https://issues.apache.org/jira/browse/LUCENE-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567429#comment-17567429 ] Tomoko Uchida commented on LUCENE-10622: This is a test comment from API. > Prepare complete migration script to GitHub issue from Jira (best effort) > - > > Key: LUCENE-10622 > URL: https://issues.apache.org/jira/browse/LUCENE-10622 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: test-1.txt, test.txt > > > If we intend to move the history to GitHub, it should be perfect as far as > possible - significantly degraded copies of history are harmful, rather than > helpful for future contributors, I think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"
mocobeta commented on issue #15: URL: https://github.com/apache/lucene-jira-archive/issues/15#issuecomment-1186041151 Adding a comment: examples https://developer.atlassian.com/server/jira/platform/jira-rest-api-examples/#adding-a-comment--examples -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #26: Make a script to set colors and descriptions for labels
mocobeta closed issue #26: Make a script to set colors and descriptions for labels URL: https://github.com/apache/lucene-jira-archive/issues/26 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #46: Add a script to update issue labels and descriptions
mocobeta merged PR #46: URL: https://github.com/apache/lucene-jira-archive/pull/46 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #46: Add a script to update issue labels and descriptions
mocobeta opened a new pull request, #46: URL: https://github.com/apache/lucene-jira-archive/pull/46 Close #26 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
[ https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567366#comment-17567366 ] Michael Sokolov commented on LUCENE-10655: -- meh, I tried a few things, but nothing really moved the needle. > can we optimize visited bitset usage in HNSW graph search/indexing? > --- > > Key: LUCENE-10655 > URL: https://issues.apache.org/jira/browse/LUCENE-10655 > Project: Lucene - Core > Issue Type: Improvement > Components: core/hnsw >Reporter: Michael Sokolov >Priority: Major > > When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates > the CPU profiler output. I had a few ideas: > # In upper graph layers, the occupied nodes are very sparse - maybe > {{SparseFixedBitSet}} would be a better fit for those > # We are caching these bitsets, but they are only used for a single search > (single document insert, during indexing). Should we cache across searches? > We would need to pool them though, and they would vary by field since fields > can have different numbers of vector nodes. This starts to get complex > # Are we sure that clearing a bitset is more efficient than allocating a new > one? Maybe the JDK maintains a pool of already-zeroed memory for us > I think we could try specializing the bitset type by graph level, and then I > think we ought to measure the performance of allocation vs the limited reuse > that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
rmuir commented on PR #947: URL: https://github.com/apache/lucene/pull/947#issuecomment-1185833705 I think the title of the PR is wrong? We shouldn't be quantizing anything. The user should be supplying a `byte[]` vector for 8-bit vectors. Floats should not be involved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
[ https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358 ] Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:41 PM: --- OK I was confused, and in fact we already do use SparseFixedBitSet for every layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which processes all the vectors. And I tried allocating afresh rather than clear-ing, and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to be done about it. Perhaps since we are re-using we could try using a fully-allocated FixedBitSet (not sparse) when indexing? My concern is that over the lifetime of indexing many vectors, the sparse bit set will eventually become dense, but inefficiently. Oh I see - in fact that *is* what we do. Okay, returning to this again, I think I will try using that one for the fully-populated level only was (Author: sokolov): OK I was confused, and in fact we already do use SparseFixedBitSet for every layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which processes all the vectors. And I tried allocating afresh rather than clear-ing, and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to be done about it. Perhaps since we are re-using we could try using a fully-allocated FixedBitSet (not sparse) when indexing? My concern is that over the lifetime of indexing many vectors, the sparse bit set will eventually become dense, but inefficiently. > can we optimize visited bitset usage in HNSW graph search/indexing? > --- > > Key: LUCENE-10655 > URL: https://issues.apache.org/jira/browse/LUCENE-10655 > Project: Lucene - Core > Issue Type: Improvement > Components: core/hnsw >Reporter: Michael Sokolov >Priority: Major > > When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates > the CPU profiler output. I had a few ideas: > # In upper graph layers, the occupied nodes are very sparse - maybe > {{SparseFixedBitSet}} would be a better fit for those > # We are caching these bitsets, but they are only used for a single search > (single document insert, during indexing). Should we cache across searches? > We would need to pool them though, and they would vary by field since fields > can have different numbers of vector nodes. This starts to get complex > # Are we sure that clearing a bitset is more efficient than allocating a new > one? Maybe the JDK maintains a pool of already-zeroed memory for us > I think we could try specializing the bitset type by graph level, and then I > think we ought to measure the performance of allocation vs the limited reuse > that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
[ https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358 ] Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:39 PM: --- OK I was confused, and in fact we already do use SparseFixedBitSet for every layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which processes all the vectors. And I tried allocating afresh rather than clear-ing, and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to be done about it. Perhaps since we are re-using we could try using a fully-allocated FixedBitSet (not sparse) when indexing? My concern is that over the lifetime of indexing many vectors, the sparse bit set will eventually become dense, but inefficiently. was (Author: sokolov): OK I was confused, and in fact we already do use SparseFixedBitSet for every layer. And I tried allocating afresh rather than clear-ing, and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to be done about it. > can we optimize visited bitset usage in HNSW graph search/indexing? > --- > > Key: LUCENE-10655 > URL: https://issues.apache.org/jira/browse/LUCENE-10655 > Project: Lucene - Core > Issue Type: Improvement > Components: core/hnsw >Reporter: Michael Sokolov >Priority: Major > > When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates > the CPU profiler output. I had a few ideas: > # In upper graph layers, the occupied nodes are very sparse - maybe > {{SparseFixedBitSet}} would be a better fit for those > # We are caching these bitsets, but they are only used for a single search > (single document insert, during indexing). Should we cache across searches? > We would need to pool them though, and they would vary by field since fields > can have different numbers of vector nodes. This starts to get complex > # Are we sure that clearing a bitset is more efficient than allocating a new > one? Maybe the JDK maintains a pool of already-zeroed memory for us > I think we could try specializing the bitset type by graph level, and then I > think we ought to measure the performance of allocation vs the limited reuse > that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
[ https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567358#comment-17567358 ] Michael Sokolov commented on LUCENE-10655: -- OK I was confused, and in fact we already do use SparseFixedBitSet for every layer. And I tried allocating afresh rather than clear-ing, and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to be done about it. > can we optimize visited bitset usage in HNSW graph search/indexing? > --- > > Key: LUCENE-10655 > URL: https://issues.apache.org/jira/browse/LUCENE-10655 > Project: Lucene - Core > Issue Type: Improvement > Components: core/hnsw >Reporter: Michael Sokolov >Priority: Major > > When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates > the CPU profiler output. I had a few ideas: > # In upper graph layers, the occupied nodes are very sparse - maybe > {{SparseFixedBitSet}} would be a better fit for those > # We are caching these bitsets, but they are only used for a single search > (single document insert, during indexing). Should we cache across searches? > We would need to pool them though, and they would vary by field since fields > can have different numbers of vector nodes. This starts to get complex > # Are we sure that clearing a bitset is more efficient than allocating a new > one? Maybe the JDK maintains a pool of already-zeroed memory for us > I think we could try specializing the bitset type by graph level, and then I > think we ought to measure the performance of allocation vs the limited reuse > that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
[ https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567334#comment-17567334 ] Michael Sokolov commented on LUCENE-10633: -- Adrien that's crazy ! > Dynamic pruning for queries sorted by SORTED(_SET) field > > > Key: LUCENE-10633 > URL: https://issues.apache.org/jira/browse/LUCENE-10633 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > LUCENE-9280 introduced the ability to dynamically prune non-competitive hits > when sorting by a numeric field, by leveraging the points index to skip > documents that do not compare better than the top of the priority queue > maintained by the field comparator. > However queries sorted by a SORTED(_SET) field still look at all hits, which > is disappointing. Could we leverage the terms index to skip hits? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567330#comment-17567330 ] Michael Sokolov commented on LUCENE-10151: -- > Did you forget to push to branch_9x? I cannot see the change there. Yes! Thanks - pushed now > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567328#comment-17567328 ] ASF subversion and git services commented on LUCENE-10151: -- Commit aa082b46f669f71cd0deb2e409c62be863f17091 in lucene's branch refs/heads/branch_9x from Deepika0510 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=aa082b46f66 ] LUCENE-10151: Adding Timeout Support to IndexSearcher (#927) Authored-by: Deepika Sharma > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567329#comment-17567329 ] ASF subversion and git services commented on LUCENE-10151: -- Commit 5cd6eda8caba5a93eeaf60215885ec3171707449 in lucene's branch refs/heads/branch_9x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5cd6eda8cab ] CHANGES entry for LUCENE-10151 > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize commented on PR #1017: URL: https://github.com/apache/lucene/pull/1017#issuecomment-1185733826 Hey @iverase; here's the PR related to the ShapeDocValuesField for the 9.3 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
[ https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567317#comment-17567317 ] Vigya Sharma commented on LUCENE-10649: --- Created https://github.com/apache/lucene/pull/1025 > Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > --- > > Key: LUCENE-10649 > URL: https://issues.apache.org/jira/browse/LUCENE-10649 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Failing Build Link: > [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/] > Repro: > {code:java} > gradlew test --tests > TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA > -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > {code} > Error: > {code:java} > java.lang.AssertionError: expected:<103> but was:<2147483647> > at > __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma opened a new pull request, #1025: LUCENE-10649: Fix failures in TestDemoParallelLeafReader
vigyasharma opened a new pull request, #1025: URL: https://github.com/apache/lucene/pull/1025 With merge-on-refresh enabled, the `ReindexingMergePolicy` in this test, needs to overide `findFullFlushMerges()`, to wrap the input reader, so that merged segment gets fields from the parallel readers. # Testing: Ran the test on repeat on my dev box. Without the fix, it fails in a couple of runs. ```java % ./gradlew test --tests TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.iters=500 -Dtests.failfast=true ... :lucene:core:test (SUCCESS): 500 test(s) ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?
Michael Sokolov created LUCENE-10655: Summary: can we optimize visited bitset usage in HNSW graph search/indexing? Key: LUCENE-10655 URL: https://issues.apache.org/jira/browse/LUCENE-10655 Project: Lucene - Core Issue Type: Improvement Components: core/hnsw Reporter: Michael Sokolov When running {{luceneutil}} I noticed that {{FixedBitSet.clear()}} dominates the CPU profiler output. I had a few ideas: # In upper graph layers, the occupied nodes are very sparse - maybe {{SparseFixedBitSet}} would be a better fit for those # We are caching these bitsets, but they are only used for a single search (single document insert, during indexing). Should we cache across searches? We would need to pool them though, and they would vary by field since fields can have different numbers of vector nodes. This starts to get complex # Are we sure that clearing a bitset is more efficient than allocating a new one? Maybe the JDK maintains a pool of already-zeroed memory for us I think we could try specializing the bitset type by graph level, and then I think we ought to measure the performance of allocation vs the limited reuse that we currently have. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on PR #947: URL: https://github.com/apache/lucene/pull/947#issuecomment-1185649145 I pushed an updated luceneutil PR adapting to these changes https://github.com/mikemccand/luceneutil/pull/181. Running that perf test I saw consistent gains (20-55% depending on the test case) as compared to the earlier test runs. I also noticed that the profiler shows the most expensive function during indexing is FixedBitSet.clear(), which makes me think we mioght want to use sparse bitsets for the "upper" layers of the graph which have many fewer nodes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #1024: LUCENE-10557: Add GitHub issue templates
mocobeta commented on PR #1024: URL: https://github.com/apache/lucene/pull/1024#issuecomment-1185631535 This is also a proposal for issue management. Feedback is welcome - I'm going to merge this in a week to proceed with the migration (if there are no comments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #1024: LUCENE-10557: Add GitHub issue templates
mocobeta opened a new pull request, #1024: URL: https://github.com/apache/lucene/pull/1024 ### Description (or a Jira issue link if you have one) LUCENE-10557 This adds GitHub issue templates and a draft how-to manual for organizing issues with labels/templates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #1023: LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields.
jpountz opened a new pull request, #1023: URL: https://github.com/apache/lucene/pull/1023 This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields by using postings to filter competitive documents. JIRA: [LUCENE-10633](https://issues.apache.org/jira/browse/LUCENE-10633) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
[ https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567268#comment-17567268 ] Adrien Grand commented on LUCENE-10633: --- I played with a prototype that starts dynamically pruning matches as soon as there are 128 competitive ordinals left or less by pulling postings to iterate over the remaining documents that have competitive values. I still need to think of simplifying the logic and improving tests but the initial benchmarks on wikimedium10m are very encouraging (assuming I didn't get anything wrong): {noformat} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Prefix3 248.74 (6.1%) 242.61 (5.8%) -2.5% ( -13% - 10%) 0.191 BrowseMonthTaxoFacets 27.71 (10.1%) 27.34 (10.6%) -1.3% ( -20% - 21%) 0.682 BrowseDateSSDVFacets4.99 (10.3%)4.94 (8.4%) -1.1% ( -17% - 19%) 0.707 BrowseDateTaxoFacets 44.26 (12.2%) 43.97 (13.1%) -0.7% ( -23% - 28%) 0.870 Wildcard 137.61 (3.0%) 136.97 (2.6%) -0.5% ( -5% -5%) 0.592 BrowseDayOfYearTaxoFacets 45.53 (12.4%) 45.44 (13.4%) -0.2% ( -23% - 29%) 0.963 IntNRQ 198.27 (8.1%) 197.94 (7.4%) -0.2% ( -14% - 16%) 0.946 BrowseRandomLabelSSDVFacets 14.51 (2.2%) 14.49 (2.4%) -0.2% ( -4% -4%) 0.835 AndHighHighDayTaxoFacets8.32 (5.1%)8.31 (5.7%) -0.1% ( -10% - 11%) 0.956 LowSpanNear 46.83 (1.6%) 46.82 (2.0%) -0.0% ( -3% -3%) 0.990 BrowseRandomLabelTaxoFacets 36.18 (10.5%) 36.18 (12.6%)0.0% ( -20% - 25%) 0.998 MedTermDayTaxoFacets 73.59 (4.8%) 73.66 (5.7%)0.1% ( -9% - 11%) 0.954 OrNotHighHigh 1476.08 (5.3%) 1477.58 (3.9%)0.1% ( -8% -9%) 0.945 TermDTSort 746.55 (2.4%) 747.70 (1.7%)0.2% ( -3% -4%) 0.817 Fuzzy2 96.18 (1.3%) 96.39 (1.4%)0.2% ( -2% -2%) 0.617 AndHighMedDayTaxoFacets 154.89 (1.8%) 155.29 (1.6%)0.3% ( -3% -3%) 0.629 AndHighMed 378.38 (3.7%) 379.50 (4.4%)0.3% ( -7% -8%) 0.817 PKLookup 243.14 (1.9%) 243.99 (1.9%)0.4% ( -3% -4%) 0.552 HighPhrase 279.13 (2.1%) 280.21 (1.5%)0.4% ( -3% -4%) 0.510 Respell 71.59 (1.5%) 71.87 (1.5%)0.4% ( -2% -3%) 0.406 OrHighHigh 66.95 (6.5%) 67.21 (5.7%)0.4% ( -11% - 13%) 0.837 Fuzzy1 101.53 (1.5%) 101.95 (1.5%)0.4% ( -2% -3%) 0.382 LowPhrase 101.76 (2.3%) 102.22 (2.6%)0.5% ( -4% -5%) 0.558 LowSloppyPhrase 21.14 (3.1%) 21.25 (4.1%)0.5% ( -6% -7%) 0.661 MedPhrase 173.45 (2.7%) 174.55 (2.6%)0.6% ( -4% -6%) 0.443 MedSpanNear 17.77 (4.5%) 17.88 (4.8%)0.6% ( -8% - 10%) 0.661 OrHighNotLow 1396.26 (5.6%) 1406.85 (6.4%)0.8% ( -10% - 13%) 0.692 OrHighMed 162.41 (5.3%) 163.69 (4.8%)0.8% ( -8% - 11%) 0.625 HighTermDayOfYearSort 1476.11 (2.7%) 1488.26 (2.4%)0.8% ( -4% -6%) 0.312 MedIntervalsOrdered 113.65 (4.2%) 114.59 (7.0%)0.8% ( -9% - 12%) 0.652 OrHighLow 828.13 (5.2%) 835.45 (4.7%)0.9% ( -8% - 11%) 0.574 MedTerm 2356.21 (4.7%) 2377.47 (5.0%)0.9% ( -8% - 11%) 0.554 MedSloppyPhrase 62.13 (3.4%) 62.72 (3.9%)0.9% ( -6% -8%) 0.420 HighIntervalsOrdered 18.19 (5.7%) 18.37 (8.6%)1.0% ( -12% - 16%) 0.673 AndHighHigh 54.46 (6.2%) 55.01 (6.3%)1.0% ( -10% - 14%) 0.615 LowTerm 2247.13 (4.7%) 2270.19 (3.7%)1.0% ( -7% -9%) 0.446 OrNotHighLow 1728.71 (4.3%) 1748.19 (4.7%)1.1% ( -7% - 10%) 0.427 HighTermTitleBDVSort 14.31 (3.3%) 14.47
[GitHub] [lucene] jpountz commented on pull request #1007: Small tweak to PointRangeQuery#visit logic
jpountz commented on PR #1007: URL: https://github.com/apache/lucene/pull/1007#issuecomment-1185610823 This looks right to me, hopefully @iverase can confirm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
jpountz commented on code in PR #1018: URL: https://github.com/apache/lucene/pull/1018#discussion_r97906 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -191,6 +191,69 @@ public long cost() { // or null if it is not applicable // pkg-private for forcing use of BooleanScorer in tests BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException { +if (scoreMode == ScoreMode.TOP_SCORES) { + if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 2) { +return null; + } + + List optional = new ArrayList<>(); + for (WeightedBooleanClause wc : weightedClauses) { +Weight w = wc.weight; +BooleanClause c = wc.clause; +if (c.getOccur() != Occur.SHOULD) { + continue; +} +ScorerSupplier scorer = w.scorerSupplier(context); +if (scorer != null) { + optional.add(scorer); +} + } + + if (optional.size() <= 1) { +return null; + } + + List optionalScorers = new ArrayList<>(); + for (ScorerSupplier ss : optional) { +optionalScorers.add(ss.get(Long.MAX_VALUE)); + } + + return new BulkScorer() { Review Comment: I wonder if we could reuse `DefaultBulkScorer` instead of this anonymous bulk scorer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #6: Document issue label / template management policy
mocobeta commented on issue #6: URL: https://github.com/apache/lucene-jira-archive/issues/6#issuecomment-1185547448 Issue templates proposal: - Bug Report - this is associated with `type:bug` label - Enhancement Request - this is associated with `type:enhancement` label - Test Failure Report - this is associated with `type:testFailure` label - Task - this is associated with `type:task` label - Documentation - this is associated with `type:documentation` label - Question - this creates no issues; is used to guide users to mailing lists -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #45: Bug
mocobeta closed issue #45: Bug URL: https://github.com/apache/lucene-jira-archive/issues/45 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #45: Bug
mocobeta commented on issue #45: URL: https://github.com/apache/lucene-jira-archive/issues/45#issuecomment-1185540048 Ok, the label was set as expected. (looks like it has to be created beforehand). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #45: Bug
mocobeta opened a new issue, #45: URL: https://github.com/apache/lucene-jira-archive/issues/45 ### Description This is a test issue (take 2). ### Version and Environments _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #44: Bug
mocobeta commented on issue #44: URL: https://github.com/apache/lucene-jira-archive/issues/44#issuecomment-1185537568 The issue label was not set... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #44: Bug
mocobeta closed issue #44: Bug URL: https://github.com/apache/lucene-jira-archive/issues/44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #44: Bug
mocobeta opened a new issue, #44: URL: https://github.com/apache/lucene-jira-archive/issues/44 ### Description This is a test bug report. ### Version and Environments _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?
mocobeta commented on issue #4: URL: https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1185506588 There have been no additional comments/requests. I decided to use my account for the second pass (updating step after importing) since I don't think we should bother infra with asking to run a time-consuming job that can be done by ourselves. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1185499857 Tasks to be done: - [ ] regenerate a candidate mapping (on July 24th) - [ ] manually make a "verified" mapping and commit it to `main` (on July 24th or 25th) - [ ] send a mail to the dev list to let others browse/check both "candidate" and "verified" mappings (on July 25th) - [ ] accept pull requests to add/edit the mapping - [ ] fix the final mapping (on August 7th) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #34: Add a tool to generate account mapping
mocobeta merged PR #34: URL: https://github.com/apache/lucene-jira-archive/pull/34 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #34: Add a tool to generate account mapping
mocobeta commented on PR #34: URL: https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185491422 Here's the re-taken candidate and verified (with [the above criteria](https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185313945)) mapping. https://github.com/apache/lucene-jira-archive/pull/34/commits/b44bd73626fc9490b0da9437e54c156f4a361b32 - 5792 candidate mapping - 163 verified mapping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7713) Optimize TopFieldDocCollector for the sorted case
[ https://issues.apache.org/jira/browse/LUCENE-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567202#comment-17567202 ] Lu Xugang commented on LUCENE-7713: --- Hi [~jpountz], it seems like there is no need to do comparison or collecting by PriorityQueue when the search sort order is a prefix of the index sort order. This issue still existed, should we need to do this optimization? > Optimize TopFieldDocCollector for the sorted case > - > > Key: LUCENE-7713 > URL: https://issues.apache.org/jira/browse/LUCENE-7713 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > When the sort order is a prefix of the index sort order, > {{TopFieldDocCollector}} could skip reading doc values and comparing them > against the bottom value after {{numHits}} documents have been collected, and > just count matches. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #34: Add a tool to generate account mapping
mikemccand commented on PR #34: URL: https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185412128 Wow, the mapping file is massive! 5,793 developers. We've had so many contributors over the years ;) Inspiring. > I'd put priority on avoiding false positives. +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #34: Add a tool to generate account mapping
mocobeta commented on PR #34: URL: https://github.com/apache/lucene-jira-archive/pull/34#issuecomment-1185313945 For verification, I'll do 1. Check if the candidate github account has push access on apache/lucene repo. 2. Check if the candidate github account has been logged as "author" at least once in the commit history. For accounts that do not satisfy the above criteria, I would just omit them. There should be some false negatives (for example, Jira users who reported issues but were not logged in commit history are omitted). I'd put priority on avoiding false positives. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #767: LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery
LuXugang commented on code in PR #767: URL: https://github.com/apache/lucene/pull/767#discussion_r845715708 ## lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java: ## @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.Objects; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FieldInfos; +import org.apache.lucene.index.IndexOptions; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.PointValues; +import org.apache.lucene.index.Terms; + +/** + * A {@link Query} that matches documents that contain either a {@link + * org.apache.lucene.document.KnnVectorField}, or a field that indexes norms or doc values. + */ +public class FieldExistsQuery extends Query { + private String field; + + /** Create a query that will match that have a value for the given {@code field}. */ + public FieldExistsQuery(String field) { +this.field = Objects.requireNonNull(field); + } + + public String getField() { +return field; + } + + @Override + public String toString(String field) { +return "FieldExistsQuery [field=" + this.field + "]"; + } + + @Override + public void visit(QueryVisitor visitor) { +if (visitor.acceptField(field)) { + visitor.visitLeaf(this); +} + } + + @Override + public boolean equals(Object other) { +return sameClassAs(other) && field.equals(((FieldExistsQuery) other).field); + } + + @Override + public int hashCode() { +final int prime = 31; +int hash = classHash(); +hash = prime * hash + field.hashCode(); +return hash; + } + + @Override + public Query rewrite(IndexReader reader) throws IOException { +boolean allReadersRewritable = true; + +for (LeafReaderContext context : reader.leaves()) { + LeafReader leaf = context.reader(); + FieldInfos fieldInfos = leaf.getFieldInfos(); + FieldInfo fieldInfo = fieldInfos.fieldInfo(field); + + if (fieldInfo == null) { +allReadersRewritable = false; +break; + } + + if (fieldInfo.hasNorms()) { // the field indexes norms +if (reader.getDocCount(field) != reader.maxDoc()) { + allReadersRewritable = false; + break; +} + } else if (fieldInfo.getVectorDimension() != 0) { // the field indexes vectors +if (leaf.getVectorValues(field).size() != reader.maxDoc()) { + allReadersRewritable = false; + break; +} + } else if (fieldInfo.getDocValuesType() + != DocValuesType.NONE) { // the field indexes doc values or points + +// This optimization is possible due to LUCENE-9334 enforcing a field to always uses the +// same data structures (all or nothing). Since there's no index statistic to detect when +// all documents have doc values for a specific field, FieldExistsQuery can only be +// rewritten to MatchAllDocsQuery for doc values field, when that same field also indexes +// terms or point values which do have index statistics, and those statistics confirm that +// all documents in this segment have values terms or point values. + +Terms terms = leaf.terms(field); +PointValues pointValues = leaf.getPointValues(field); + +if ((terms == null || terms.getDocCount() != leaf.maxDoc()) +&& (pointValues == null || pointValues.getDocCount() != leaf.maxDoc())) { + allReadersRewritable = false; + break; +} + } else { +throw new IllegalStateException(buildErrorMsg(fieldInfo)); + } +} +if (allReadersRewritable) { + return new MatchAllDocsQuery(); +} +return super.rewrite(reader); + } + + @Override + public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { +return new ConstantScoreWeight(this, boost) { + @Override + public Scorer scorer(LeafReaderContext context) throws IOExcept