[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"
mocobeta opened a new issue, #15: URL: https://github.com/apache/lucene-jira-archive/issues/15 After migration, Jira issues should not be updated. To prevent updating Jira and guiding people who reach Jira to GitHub, comments should be added to Jira issues that indicate the corresponding GitHub issue URL. The issue mapping will be provided by a CSV file. Mapping file format (example): ``` JiraKey,GitHubUrl,GitHubNumber LUCENE-10605,https://github.com/mocobeta/migration-test-3/issues/37,37 ``` Jira comment could be: ``` This was moved to GitHub issue. See . ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #16: Break up issue updating script into small sub-steps
mocobeta opened a new issue, #16: URL: https://github.com/apache/lucene-jira-archive/issues/16 The current issue updating script (for the "second pass") does three things: 1. iterate all GitHub issues/comments 2. create re-mapped cross-issue links 3. update issues/comments that include cross-issue links This can be split up into smaller scripts. - export script: iterate issues/comments with their IDs from GitHub and save them to local files - convert script: modify issues/comments to create cross-issue links - update script: update issues/comments using the result of the convert script This breakup makes updating step an idempotent operation in exchange for additional steps/time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10636) Could the partial score sum from essential list scores be cached?
[ https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562453#comment-17562453 ] ASF subversion and git services commented on LUCENE-10636: -- Commit 3dd9a5487c2c3994abdaf5ab0553a3d78ebe50ab in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3dd9a5487c2 ] LUCENE-10636: Avoid computing the same scores multiple times. (#1005) `BlockMaxMaxscoreScorer` would previously compute the score twice for essential scorers. Co-authored-by: zacharymorn > Could the partial score sum from essential list scores be cached? > - > > Key: LUCENE-10636 > URL: https://issues.apache.org/jira/browse/LUCENE-10636 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Zach Chen >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > This is a follow-up issue from discussion > [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently > in the implementation of BlockMaxMaxscoreScorer, there's duplicated > computation of summing up scores from essential list scorers. We would like > to see if this duplicated computation can be cached without introducing much > overhead or data structure that might out-weight the benefit of caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10636) Could the partial score sum from essential list scores be cached?
[ https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562464#comment-17562464 ] ASF subversion and git services commented on LUCENE-10636: -- Commit 2d05f5c623e06b8bafa1f7b1d6be813c14550690 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2d05f5c623e ] LUCENE-10636: Avoid computing the same scores multiple times. (#1005) `BlockMaxMaxscoreScorer` would previously compute the score twice for essential scorers. Co-authored-by: zacharymorn > Could the partial score sum from essential list scores be cached? > - > > Key: LUCENE-10636 > URL: https://issues.apache.org/jira/browse/LUCENE-10636 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Zach Chen >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > This is a follow-up issue from discussion > [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently > in the implementation of BlockMaxMaxscoreScorer, there's duplicated > computation of summing up scores from essential list scorers. We would like > to see if this duplicated computation can be cached without introducing much > overhead or data structure that might out-weight the benefit of caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10636) Could the partial score sum from essential list scores be cached?
[ https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10636. --- Fix Version/s: 9.3 Resolution: Fixed > Could the partial score sum from essential list scores be cached? > - > > Key: LUCENE-10636 > URL: https://issues.apache.org/jira/browse/LUCENE-10636 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Zach Chen >Priority: Minor > Fix For: 9.3 > > Time Spent: 50m > Remaining Estimate: 0h > > This is a follow-up issue from discussion > [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently > in the implementation of BlockMaxMaxscoreScorer, there's duplicated > computation of summing up scores from essential list scorers. We would like > to see if this duplicated computation can be cached without introducing much > overhead or data structure that might out-weight the benefit of caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.
jpountz merged PR #1005: URL: https://github.com/apache/lucene/pull/1005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r913896000 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsWriter.java: ## @@ -58,7 +57,6 @@ public final class Lucene91HnswVectorsWriter extends KnnVectorsWriter { this.maxConn = maxConn; this.beamWidth = beamWidth; -assert state.fieldInfos.hasVectorValues(); Review Comment: In the new model, we initialize vectors' writers during indexing, where `SegmentWriteState` object along with all filled `fiedInfos` is not yet available (it becomes available during flush). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562711#comment-17562711 ] Adrien Grand commented on LUCENE-10480: --- Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562730#comment-17562730 ] Adrien Grand commented on LUCENE-10480: --- Looking at this new scorer from the perspective of disjunctions within conjunctions, maybe there are bits from advance() that we could move to matches() so that we would hand it over to the other clause before we start doing expensive operations like computing scores. What do you think [~zacharymorn]? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator
[ https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562758#comment-17562758 ] Denilson Amorim commented on LUCENE-8806: - Before reformulating my question. Let me see if I understood this patch and the discussion correctly: WANDScorer doesn't support calling children two-phase iterators. Therefore, in an attempt to improve performance, this patch adds calls to these two-phase iterators in WAND. However, it didn't perform well in phrase queries benchmarks because its max score calculation wasn't per-block. [~jim.ferenczi] hacked a solution during the discussion here to have per-block max scores in phrase scorers, showing a positive outcome. Since the discussion went idle, phrase scorers received support for per-block max scores through LUCENE-8311. But this patch hasn't moved. So I was wondering whether it makes sense to move this patch forward. Thanks in advanced. > WANDScorer should support two-phase iterator > > > Key: LUCENE-8806 > URL: https://issues.apache.org/jira/browse/LUCENE-8806 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Major > Attachments: LUCENE-8806.patch, LUCENE-8806.patch > > > Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer > should leverage two-phase iterators in order to be faster when used in > conjunctions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914008509 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -203,8 +204,11 @@ private NeighborQueue searchLevel( return results; } - private void clearScratchState() { + private void clearScratchState(int capacity) { candidates.clear(); +if (visited.length() < capacity) { + visited = FixedBitSet.ensureCapacity((FixedBitSet) visited, capacity); Review Comment: One thing to note that we should't create new objects too often, as over-allocating happens exponentially. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase
[ https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562786#comment-17562786 ] Greg Miller commented on LUCENE-10639: -- As a quick update, I ran benchmarks with just [livedoc checking broken out|https://github.com/gsmiller/lucene/commit/f4e9614a299523b57c854a3bd3371253f0a7fb17] in {{DefaultBulkScorer}}. I surprisingly didn't see any difference. So maybe something else going on here? Note that I ran this with {{wikimedium10m}} instead of {{all}} to get a datapoint a bit quicker: {code:java} TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value Prefix3 118.98 (10.2%) 114.60 (9.9%) -3.7% ( -21% - 18%) 0.247 Wildcard 40.69 (6.9%) 39.62 (7.2%) -2.6% ( -15% - 12%) 0.236 TermDTSort 17.76 (20.4%) 17.33 (14.2%) -2.4% ( -30% - 40%) 0.663 OrNotHighHigh 881.01 (4.4%) 861.34 (3.9%) -2.2% ( -10% -6%) 0.089 AndHighHigh8.87 (5.0%)8.70 (6.2%) -1.8% ( -12% -9%) 0.296 MedTerm 1771.40 (4.2%) 1740.50 (4.4%) -1.7% ( -9% -7%) 0.198 AndHighMed 30.59 (4.0%) 30.06 (5.6%) -1.7% ( -10% -8%) 0.267 OrHighNotLow 782.90 (4.8%) 769.92 (5.1%) -1.7% ( -11% -8%) 0.291 HighPhrase 392.18 (2.7%) 386.50 (2.7%) -1.4% ( -6% -4%) 0.087 OrHighNotHigh 830.76 (4.3%) 818.83 (4.3%) -1.4% ( -9% -7%) 0.295 OrNotHighMed 585.86 (2.6%) 578.07 (3.1%) -1.3% ( -6% -4%) 0.146 OrHighNotMed 966.75 (3.6%) 956.07 (3.9%) -1.1% ( -8% -6%) 0.352 LowPhrase 546.02 (2.1%) 540.42 (2.4%) -1.0% ( -5% -3%) 0.148 MedPhrase 24.65 (2.3%) 24.40 (3.0%) -1.0% ( -6% -4%) 0.225 AndHighLow 508.37 (3.7%) 503.84 (4.7%) -0.9% ( -8% -7%) 0.506 OrNotHighLow 672.15 (2.7%) 666.29 (2.8%) -0.9% ( -6% -4%) 0.313 BrowseMonthTaxoFacets8.92 (14.5%)8.84 (13.9%) -0.9% ( -25% - 32%) 0.846 AndHighMedDayTaxoFacets 39.14 (2.2%) 38.82 (2.2%) -0.8% ( -5% -3%) 0.241 AndHighHighDayTaxoFacets8.01 (2.8%)7.96 (2.8%) -0.7% ( -6% -4%) 0.416 LowSloppyPhrase5.83 (3.8%)5.79 (3.8%) -0.7% ( -8% -7%) 0.556 OrHighLow 128.01 (3.7%) 127.11 (3.8%) -0.7% ( -7% -7%) 0.554 HighTerm 1190.03 (4.4%) 1183.10 (4.1%) -0.6% ( -8% -8%) 0.663 MedSloppyPhrase 11.67 (2.1%) 11.61 (2.6%) -0.5% ( -5% -4%) 0.480 MedTermDayTaxoFacets 14.09 (3.1%) 14.03 (4.1%) -0.5% ( -7% -6%) 0.686 IntNRQ 110.15 (2.3%) 109.69 (2.1%) -0.4% ( -4% -4%) 0.546 HighSloppyPhrase9.56 (4.5%)9.53 (4.5%) -0.4% ( -8% -9%) 0.794 BrowseDateSSDVFacets0.85 (10.4%)0.85 (10.8%) -0.3% ( -19% - 23%) 0.939 Respell 33.65 (1.7%) 33.58 (1.7%) -0.2% ( -3% -3%) 0.684 Fuzzy2 74.16 (1.9%) 74.02 (1.7%) -0.2% ( -3% -3%) 0.740 LowTerm 1522.48 (2.9%) 1520.76 (3.3%) -0.1% ( -6% -6%) 0.909 LowIntervalsOrdered 12.75 (3.3%) 12.74 (3.3%) -0.1% ( -6% -6%) 0.915 HighIntervalsOrdered6.30 (4.2%)6.31 (4.0%)0.1% ( -7% -8%) 0.923 BrowseRandomLabelSSDVFacets2.57 (4.9%)2.57 (4.9%)0.1% ( -9% - 10%) 0.927 Fuzzy1 57.11 (1.9%) 57.26 (1.7%)0.2% ( -3% -3%) 0.666 BrowseRandomLabelTaxoFacets6.32 (9.3%)6.34 (10.3%)0.3% ( -17% - 21%) 0.911 LowSpanNear 15.95 (2.9%) 16.01 (2.7%)0.4% ( -5% -6%) 0.680 MedIntervalsOrdered1.61 (5.8%)1.62 (5.8%)0.4% ( -10% - 12%) 0.834 HighSpanNear2.27 (4.2%)2.28 (4.0%)0.6% ( -7% -
[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts
gsmiller commented on code in PR #974: URL: https://github.com/apache/lucene/pull/974#discussion_r914082183 ## lucene/demo/src/java/org/apache/lucene/demo/facet/DistanceFacetsExample.java: ## @@ -212,7 +212,26 @@ public static Query getBoundingBoxQuery( } /** User runs a query and counts facets. */ - public FacetResult search() throws IOException { + public FacetResult searchAllChildren() throws IOException { + +FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new FacetsCollectorManager()); + +Facets facets = +new DoubleRangeFacetCounts( +"field", +getDistanceValueSource(), +fc, +getBoundingBoxQuery(ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 10.0), +ONE_KM, +TWO_KM, +FIVE_KM, +TEN_KM); + +return facets.getAllChildren("field"); + } + + /** User runs a query and counts facets. */ + public FacetResult searchTopChildren() throws IOException { Review Comment: I see. OK, a couple things confused me in your code. It looks like it's doing what I describe, but what threw me is that, 1) you're updating a variable named `currentTime` to represent the end of each range, and 2) the label says "Past ...", which made me thing it was a trailing window. As a couple suggestions, maybe rename your `then` and `currentTime` variables to something like `startTime` and `endTime`? And maybe rename the labels to just something like "Hour x - y"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts
gsmiller commented on code in PR #974: URL: https://github.com/apache/lucene/pull/974#discussion_r914083105 ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,30 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 7200 sec before (2 +// hours) before "now", ...: +long currentTime = System.currentTimeMillis() / 1000L; +// Index error messages for the past week (24 * 7 = 168 hours) +for (int i = 0; i < 168; i++) { + long then = currentTime - (i + 1) * 3600; + + // conditionally add different number of error messages to the past hour slot + for (int j = 0; j < i % 35; j++) { Review Comment: Why mod with a value of `35`? I'm not getting it. Is there a specific reason you're using the value `35`? If so, could you add a comment? ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,30 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 7200 sec before (2 +// hours) before "now", ...: +long currentTime = System.currentTimeMillis() / 1000L; +// Index error messages for the past week (24 * 7 = 168 hours) +for (int i = 0; i < 168; i++) { Review Comment: minor: I find it a bit confusing that you index the data and create ranges in "reverse" starting from "now." Would it be easier for others to understand if you started "one week ago" and looped "forwards"? Since the demo code is here to help people understand the functionality, I want to make sure we don't create unnecessary confusion. ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,30 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 7200 sec before (2 +// hours) before "now", ...: +long currentTime = System.currentTimeMillis() / 1000L; +// Index error messages for the past week (24 * 7 = 168 hours) +for (int i = 0; i < 168; i++) { + long then = currentTime - (i + 1) * 3600; + + // conditionally add different number of error messages to the past hour slot + for (int j = 0; j < i % 35; j++) { +Document doc = new Document(); +doc.add(new NumericDocValuesField("error log", then)); Review Comment: Could we add some "jitter" to what gets indexed so all the timestamps of the "fake error logs" don't fall right on hour boundaries? That would be a bit of a more realistic example. ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,30 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 7200 sec before (2 +// hours) before "now", ...: +long currentTime = System.currentTimeMillis() / 1000L; +// Index error messages for the past week (24 * 7 = 168 hours) +for (int i = 0; i < 168; i++) { + long then = currentTime - (i + 1) * 3600; + + // conditionally add different number of error messages to the past hour slot + for (int j = 0; j < i % 35; j++) { +Document doc = new Document(); +doc.add(new NumericDocValuesField("error log", then)); +doc.add( +new StringField( +"Error msg", "[Error] Server encountered error at " + currentTime, Field.Store.NO)); Review Comment: Shouldn't the "logged" timestamp here be the same as the one indexed in the previous line (i.e., `then` instead of `currentTime`)? I realize that doesn't impact the functionality of the example, but just trying to avoid confusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914113679 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; + + /** Write field for merging */ + public abstract void writeFieldForMerging(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) Review Comment: @jpountz Yes, indeed this is the same method as `mergeXXXField` in `DocValuesConsumer` or `mergeOneField` in `PointsWriter`. I am not very clear what you meant by: "make this method responsible for creating the merged view (instead of doing it on top)", can you please clarify this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914119921 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: We need a separate `finish()` method as is also used by `merge` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914119921 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: We need a separate from `flush` the `finish()` method as is also used by `merge` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914126920 ## lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.Codec; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.util.Accountable; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.InfoStream; + +/** + * Streams vector values for indexing to the given codec's vectors writer. The codec's vectors + * writer is responsible for buffering and processing vectors. + */ +class VectorValuesConsumer { + private final Codec codec; + private final Directory directory; + private final SegmentInfo segmentInfo; + private final InfoStream infoStream; + + private Accountable accountable = Accountable.NULL_ACCOUNTABLE; + private KnnVectorsWriter writer; + + VectorValuesConsumer( + Codec codec, Directory directory, SegmentInfo segmentInfo, InfoStream infoStream) { +this.codec = codec; +this.directory = directory; +this.segmentInfo = segmentInfo; +this.infoStream = infoStream; + } + + private void initKnnVectorsWriter(String fieldName) throws IOException { +if (writer == null) { + KnnVectorsFormat fmt = codec.knnVectorsFormat(); + if (fmt == null) { +throw new IllegalStateException( +"field=\"" ++ fieldName ++ "\" was indexed as vectors but codec does not support vectors"); + } + SegmentWriteState initialWriteState = + new SegmentWriteState(infoStream, directory, segmentInfo, null, null, IOContext.DEFAULT); + writer = fmt.fieldsWriter(initialWriteState); + accountable = writer; +} + } + + public void addField(FieldInfo fieldInfo) throws IOException { +initKnnVectorsWriter(fieldInfo.name); +writer.addField(fieldInfo); + } + + public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) throws IOException { +writer.addValue(fieldInfo, docID, vectorValue); + } + + void flush(SegmentWriteState state, Sorter.DocMap sortMap) throws IOException { Review Comment: No, I don't think we need it: we pass the information about segment's maxDoc to `writer.flush`. Also stored fields writers need to be passed every doc even though it doesn't contain stored fields, while vectors' writers only need to be passed docs that contain vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914132447 ## lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java: ## @@ -26,233 +26,153 @@ import org.apache.lucene.codecs.KnnVectorsWriter; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.Counter; import org.apache.lucene.util.RamUsageEstimator; /** - * Buffers up pending vector value(s) per doc, then flushes when segment flushes. + * Buffers up pending vector value(s) per doc, then flushes when segment flushes. Used for {@code + * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 . * * @lucene.experimental */ -class VectorValuesWriter { - - private final FieldInfo fieldInfo; - private final Counter iwBytesUsed; - private final List vectors = new ArrayList<>(); - private final DocsWithFieldSet docsWithField; - - private int lastDocID = -1; - - private long bytesUsed; - - VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) { -this.fieldInfo = fieldInfo; -this.iwBytesUsed = iwBytesUsed; -this.docsWithField = new DocsWithFieldSet(); -this.bytesUsed = docsWithField.ramBytesUsed(); -if (iwBytesUsed != null) { - iwBytesUsed.addAndGet(bytesUsed); +public abstract class VectorValuesWriter extends KnnVectorsWriter { Review Comment: +1 for renaming. > I also wonder if we could update SimpleTextKnnVectorsWriter to use the new writer interface. Then we could move this class to the backwards-codecs package, because it would only be used in the old codec tests. This would mean we need to copy all the code form `BufferingKnnVectorsWriter` to `SimpleTextKnnVectorsWriter`? Are we ok with this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion
[ https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562821#comment-17562821 ] ASF subversion and git services commented on LUCENE-10626: -- Commit d537013e70872015364c745e5f320727efc034b7 in lucene's branch refs/heads/main from Peter Gromov [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d537013e708 ] LUCENE-10626: Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (#975) > Hunspell: add tools to aid dictionary editing: analysis introspection, stem > expansion and stem/flag suggestion > -- > > Key: LUCENE-10626 > URL: https://issues.apache.org/jira/browse/LUCENE-10626 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Peter Gromov >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > The following tools would be nice to have when editing and appending an > existing dictionary: > 1. See how Hunspell analyzes a given word, with all the involved affix flags: > `Hunspell.analyzeSimpleWord` > 2. See all forms that the given stem can produce with the given flags: > `Hunspell.expandRoot`, `WordFormGenerator.expandRoot` > 3. Given a number of word forms, suggest a stem and a set of flags that > produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
gsmiller commented on code in PR #1004: URL: https://github.com/apache/lucene/pull/1004#discussion_r914131349 ## lucene/memory/src/test/org/apache/lucene/index/memory/TestMemoryIndex.java: ## @@ -298,10 +298,10 @@ public void testDocValues() throws Exception { assertEquals(3, sortedSetDocValues.getValueCount()); assertEquals(0, sortedSetDocValues.nextDoc()); assertEquals(3, sortedSetDocValues.docValueCount()); +assertEquals(3, sortedSetDocValues.docValueCount()); Review Comment: Looks like we're already asserting this on the previous line :) ## lucene/core/src/test/org/apache/lucene/index/TestSortedSetDocValues.java: ## @@ -1,26 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.lucene.index; - -import org.apache.lucene.tests.util.LuceneTestCase; - -public class TestSortedSetDocValues extends LuceneTestCase { - - public void testNoMoreOrdsConstant() { Review Comment: Let's keep this test as long as we still define `NO_MORE_ORDS`. Even though we've marked it as deprecated and are moving off of using it for iteration, our users are likely still relying on it for iteration so we should still keep this test. We can remove it at the same time we actually remove the constant definition. ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java: ## @@ -1878,14 +1878,12 @@ public void testSortedSetTwoDocumentsMerged() throws IOException { assertEquals(0, dv.nextDoc()); assertEquals(0, dv.nextOrd()); -assertEquals(NO_MORE_ORDS, dv.nextOrd()); Review Comment: Should we assert the docValueCount is 1 here? ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -2576,11 +2576,11 @@ public void assertDocValuesEquals(String info, IndexReader leftReader, IndexRead if (docID == NO_MORE_DOCS) { break; } -long ord; -while ((ord = leftValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { +assertEquals(info, leftValues.docValueCount(), rightValues.docValueCount()); +for (int i = 0; i < leftValues.docValueCount(); i++) { + long ord = leftValues.nextOrd(); Review Comment: minor: I might just one-line the for-loop body to: `assertEquals(info, leftValues.nextOrd(), rightValues.nextOrd());` Alternatively, if you find it more readable to create the local variables, I'd create one for each: ``` long leftOrd = leftValues.nextOrd(); long rightOrd = rightValues.nextOrd(); assertEquals(info, leftOrd, rightOrd); ``` Just feels a little inconsistent as it currently is :) ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene80/BaseLucene80DocValuesFormatTestCase.java: ## @@ -480,15 +476,14 @@ public void testSortedSetAroundBlockSize() throws IOException { for (int i = 0; i < maxDoc; ++i) { assertEquals(i, values.nextDoc()); final int numValues = in.readVInt(); +assertEquals(numValues, values.docValueCount()); for (int j = 0; j < numValues; ++j) { b.setLength(in.readVInt()); b.grow(b.length()); in.readBytes(b.bytes(), 0, b.length()); assertEquals(b.get(), values.lookupOrd(values.nextOrd())); } - -assertEquals(SortedSetDocValues.NO_MORE_ORDS, values.nextOrd()); Review Comment: I think we ought to keep this for now until we actually remove the contract that `nextOrd()` returns `NO_MORE_ORDS` when exhausted (assuming we plan to back-port this to 9.x, which I think we should). Since 9.x will need to continue to return `NO_MORE_ORDS` as part of the API contract, it would be good to have tests for that behavior. When we go to actually remove `NO_MORE_ORDS`, which we should do only to `main` and under a separate Jira issue, we can remove this check. ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -1205,8 +1205,8 @@ public void searchIndex( assertEquals(id, dvShort.longValue()); assertEquals(i,
[GitHub] [lucene] donnerpeter merged pull request #975: LUCENE-10626 Hunspell: add tools to aid dictionary editing
donnerpeter merged PR #975: URL: https://github.com/apache/lucene/pull/975 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
shahrs87 commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1175446230 @jpountz There are no null or terms.EMPTY checks in CheckIndex class anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914171256 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -203,8 +204,11 @@ private NeighborQueue searchLevel( return results; } - private void clearScratchState() { + private void clearScratchState(int capacity) { Review Comment: Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914132447 ## lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java: ## @@ -26,233 +26,153 @@ import org.apache.lucene.codecs.KnnVectorsWriter; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.Counter; import org.apache.lucene.util.RamUsageEstimator; /** - * Buffers up pending vector value(s) per doc, then flushes when segment flushes. + * Buffers up pending vector value(s) per doc, then flushes when segment flushes. Used for {@code + * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 . * * @lucene.experimental */ -class VectorValuesWriter { - - private final FieldInfo fieldInfo; - private final Counter iwBytesUsed; - private final List vectors = new ArrayList<>(); - private final DocsWithFieldSet docsWithField; - - private int lastDocID = -1; - - private long bytesUsed; - - VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) { -this.fieldInfo = fieldInfo; -this.iwBytesUsed = iwBytesUsed; -this.docsWithField = new DocsWithFieldSet(); -this.bytesUsed = docsWithField.ramBytesUsed(); -if (iwBytesUsed != null) { - iwBytesUsed.addAndGet(bytesUsed); +public abstract class VectorValuesWriter extends KnnVectorsWriter { Review Comment: +1 for renaming. Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5 > I also wonder if we could update SimpleTextKnnVectorsWriter to use the new writer interface. Then we could move this class to the backwards-codecs package, because it would only be used in the old codec tests. This would mean we need to copy all the code form `BufferingKnnVectorsWriter` to `SimpleTextKnnVectorsWriter`? Are we ok with this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914172262 ## lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java: ## @@ -116,7 +119,193 @@ public final class Lucene93HnswVectorsWriter extends KnnVectorsWriter { } @Override - public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + public void addField(FieldInfo fieldInfo) throws IOException { +if (fields == null) { + fields = new FieldData[1]; +} else { + FieldData[] newFields = new FieldData[fields.length + 1]; + System.arraycopy(fields, 0, newFields, 0, fields.length); + fields = newFields; +} +fields[fields.length - 1] = +new FieldData(fieldInfo, M, beamWidth, segmentWriteState.infoStream); + } + + @Override + public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) throws IOException { +for (FieldData field : fields) { Review Comment: No longer relevant, as in 2f58350081902bfc13cb02424343ab805c02b0a5 we use `addValue` for `KnnFieldVectorsWriter`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914172811 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; Review Comment: Great feedback! Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r914172979 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException private class FieldsWriter extends KnnVectorsWriter { private final Map formats; private final Map suffixes = new HashMap<>(); +private final Map> writersForFields = +new IdentityHashMap<>(); private final SegmentWriteState segmentWriteState; +// if there is a single writer, cache it for faster indexing +private KnnVectorsWriter singleWriter; Review Comment: Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562832#comment-17562832 ] Greg Miller commented on LUCENE-10603: -- Thanks [~stefanvodita] for jumping in as well to help! I left a little feedback on the PR. Thanks again! > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 4h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
stefanvodita commented on PR #1004: URL: https://github.com/apache/lucene/pull/1004#issuecomment-1175589343 Thanks @gsmiller for patiently checking through all those changes! I’ve reverted the ones you pointed out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
stefanvodita commented on code in PR #1004: URL: https://github.com/apache/lucene/pull/1004#discussion_r914274231 ## lucene/memory/src/test/org/apache/lucene/index/memory/TestMemoryIndex.java: ## @@ -298,10 +298,10 @@ public void testDocValues() throws Exception { assertEquals(3, sortedSetDocValues.getValueCount()); assertEquals(0, sortedSetDocValues.nextDoc()); assertEquals(3, sortedSetDocValues.docValueCount()); +assertEquals(3, sortedSetDocValues.docValueCount()); Review Comment: Oops! Fixed! :)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
stefanvodita commented on code in PR #1004: URL: https://github.com/apache/lucene/pull/1004#discussion_r914274530 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -2576,11 +2576,11 @@ public void assertDocValuesEquals(String info, IndexReader leftReader, IndexRead if (docID == NO_MORE_DOCS) { break; } -long ord; -while ((ord = leftValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { +assertEquals(info, leftValues.docValueCount(), rightValues.docValueCount()); +for (int i = 0; i < leftValues.docValueCount(); i++) { + long ord = leftValues.nextOrd(); Review Comment: I went with the one-liner here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen commented on LUCENE-10480: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47 (26.6%) 63.
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:12 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHig
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:13 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47
[GitHub] [lucene] zacharymorn opened a new pull request, #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction
zacharymorn opened a new pull request, #1006: URL: https://github.com/apache/lucene/pull/1006 ### Description (or a Jira issue link if you have one) Follow-up changes for https://issues.apache.org/jira/browse/LUCENE-10480 to improve performance for disjunction within conjunction queries. Benchmark results with `wikinightly.tasks` boolean queries below: ``` AndHighHigh: +be +up # freq=2115632 freq=824628 AndHighHigh: +cite +had # freq=1367577 freq=1223103 AndHighHigh: +is +he # freq=4214104 freq=1663980 AndHighHigh: +no +4 # freq=1060681 freq=944177 AndHighHigh: +title +see # freq=2077102 freq=1100862 AndHighMed: +2010 +16 # freq=933686 freq=531050 AndHighMed: +5 +power # freq=849829 freq=257919 AndHighMed: +only +particularly # freq=895806 freq=100045 AndHighMed: +united +1983 # freq=1185528 freq=150075 AndHighMed: +who +ed # freq=1201585 freq=127497 OrHighHigh: are last # freq=1921211 freq=830278 OrHighHigh: at united # freq=2834104 freq=1185528 OrHighHigh: but year # freq=1484398 freq=1098425 OrHighHigh: name its # freq=2577591 freq=1160703 OrHighHigh: to but # freq=6105155 freq=1484398 OrHighMed: at mostly # freq=2834104 freq=89401 OrHighMed: his interview # freq=1771920 freq=94736 OrHighMed: http 9 # freq=3289683 freq=541405 OrHighMed: they hard # freq=1031516 freq=92045 OrHighMed: title bay # freq=2077102 freq=117167 AndHighOrMedMed: +be +(mostly interview) # freq=2115632 freq=89401 freq=94736 AndHighOrMedMed: +cite +(9 hard) # freq=1367577 freq=541405 freq=92045 AndHighOrMedMed: +is +(bay 16) # freq=4214104 freq=117167 freq=531050 AndHighOrMedMed: +no +(power particularly) # freq=1060681 freq=257919 freq=100045 AndHighOrMedMed: +title +(1983 ed) # freq=2077102 freq=150075 freq=127497 AndMedOrHighHigh: +mostly +(are last) # freq=89401 freq=1921211 freq=830278 AndMedOrHighHigh: +interview +(at united) # freq=94736 freq=2834104 freq=1185528 AndMedOrHighHigh: +hard +(but year) # freq=92045 freq=1484398 freq=1098425 AndMedOrHighHigh: +9 +(name its) # freq=541405 freq=2577591 freq=1160703 AndMedOrHighHigh: +bay +(to but) # freq=117167 freq=6105155 freq=1484398 ``` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value AndHighHigh 40.93 (2.8%) 40.72 (4.2%) -0.5% ( -7% -6%) 0.659 AndHighMed 150.71 (3.4%) 152.22 (3.7%)1.0% ( -5% -8%) 0.371 PKLookup 250.85 (8.7%) 257.51 (8.9%)2.7% ( -13% - 22%) 0.340 AndHighOrMedMed 66.87 (4.0%) 68.70 (2.7%)2.7% ( -3% -9%) 0.012 AndMedOrHighHigh 89.04 (2.6%) 93.28 (3.1%)4.8% ( 0% - 10%) 0.000 OrHighHigh 21.71 (6.0%) 34.50 (6.8%) 58.9% ( 43% - 76%) 0.000 OrHighMed 85.11 (5.0%) 189.37 (8.0%) 122.5% ( 104% - 142%) 0.000 ``` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value AndMedOrHighHigh 68.90 (4.5%) 67.15 (4.3%) -2.5% ( -10% -6%) 0.074 AndHighHigh 73.07 (3.0%) 72.11 (3.5%) -1.3% ( -7% -5%) 0.212 AndHighMed 146.94 (4.7%) 145.56 (4.9%) -0.9% ( -10% -9%) 0.550 PKLookup 252.01 (9.3%) 249.71 (13.2%) -0.9% ( -21% - 23%) 0.806 AndHighOrMedMed 65.49 (5.8%) 66.09 (4.9%)0.9% ( -9% - 12%) 0.600 OrHighHigh 21.34 (6.7%) 29.63 (6.7%) 38.8% ( 23% - 55%) 0.000 OrHighMed 122.61 (8.2%) 227.04 (9.0%) 85.2% ( 62% - 111%) 0.000 ``` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value AndHighMed 113.58 (2.8%) 113.98 (4.8%)0.3% ( -7% -8%) 0.779 AndHighHigh 51.37 (3.2%) 51.58 (5.2%)0.4% ( -7% -9%) 0.759 PKLookup 272.05 (8.9%) 276.89 (12.6%)1.8% ( -18% - 25%) 0.605 AndHighOrMedMed 102.86 (5.1%) 107.47 (5.4%)4.5% ( -5% - 15%) 0.007 AndMedOrHighHigh 91.55 (3.8%) 96.43 (5.2%)5.3% ( -3% - 14%) 0.000 OrHighHigh 27.08 (6.5%) 47.16 (11.3%) 74.2% ( 52% - 98%) 0.000 OrHighMed 78.78
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562944#comment-17562944 ] Zach Chen commented on LUCENE-10480: {quote}maybe there are bits from advance() that we could move to matches() so that we would hand it over to the other clause before we start doing expensive operations like computing scores. {quote} This approach does help stabilizing performance for disjunction within conjunction queries (and also provide some small gains)! I have opened a PR for it [https://github.com/apache/lucene/pull/1006] . > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 50m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion
[ https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Gromov resolved LUCENE-10626. --- Resolution: Fixed > Hunspell: add tools to aid dictionary editing: analysis introspection, stem > expansion and stem/flag suggestion > -- > > Key: LUCENE-10626 > URL: https://issues.apache.org/jira/browse/LUCENE-10626 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Peter Gromov >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > The following tools would be nice to have when editing and appending an > existing dictionary: > 1. See how Hunspell analyzes a given word, with all the involved affix flags: > `Hunspell.analyzeSimpleWord` > 2. See all forms that the given stem can produce with the given flags: > `Hunspell.expandRoot`, `WordFormGenerator.expandRoot` > 3. Given a number of word forms, suggest a stem and a set of flags that > produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion
[ https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Gromov closed LUCENE-10626. - > Hunspell: add tools to aid dictionary editing: analysis introspection, stem > expansion and stem/flag suggestion > -- > > Key: LUCENE-10626 > URL: https://issues.apache.org/jira/browse/LUCENE-10626 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Peter Gromov >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > The following tools would be nice to have when editing and appending an > existing dictionary: > 1. See how Hunspell analyzes a given word, with all the involved affix flags: > `Hunspell.analyzeSimpleWord` > 2. See all forms that the given stem can produce with the given flags: > `Hunspell.expandRoot`, `WordFormGenerator.expandRoot` > 3. Given a number of word forms, suggest a stem and a set of flags that > produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #17: Split up updating script
mocobeta opened a new pull request, #17: URL: https://github.com/apache/lucene-jira-archive/pull/17 Close #16 Add - src/remap_cross_issue_links.py - src/update_issues.py Deprecate - src/update_issue_links.py -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org