date:20220705

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #15: Make a script to add comments to all Jira issues to indicate that "this was moved to GitHub"

2022-07-05 Thread GitBox



mocobeta opened a new issue, #15:
URL: https://github.com/apache/lucene-jira-archive/issues/15

   After migration, Jira issues should not be updated.
   To prevent updating Jira and guiding people who reach Jira to GitHub, 
comments should be added to Jira issues that indicate the corresponding GitHub 
issue URL. The issue mapping will be provided by a CSV file. 
   
   Mapping file format (example):
   ```
   JiraKey,GitHubUrl,GitHubNumber
   LUCENE-10605,https://github.com/mocobeta/migration-test-3/issues/37,37
   ```
   
   Jira comment could be:
   ```
   This was moved to GitHub issue. See .
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #16: Break up issue updating script into small sub-steps

2022-07-05 Thread GitBox



mocobeta opened a new issue, #16:
URL: https://github.com/apache/lucene-jira-archive/issues/16

   The current issue updating script (for the "second pass") does three things:
   1. iterate all GitHub issues/comments
   2. create re-mapped cross-issue links
   3. update issues/comments that include cross-issue links
   
   This can be split up into smaller scripts.
   - export script: iterate issues/comments with their IDs from GitHub and save 
them to local files
   - convert script: modify issues/comments to create cross-issue links
   - update script: update issues/comments using the result of the convert 
script
   
   This breakup makes updating step an idempotent operation in exchange for 
additional steps/time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-07-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562453#comment-17562453
 ] 

ASF subversion and git services commented on LUCENE-10636:
--

Commit 3dd9a5487c2c3994abdaf5ab0553a3d78ebe50ab in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3dd9a5487c2 ]

LUCENE-10636: Avoid computing the same scores multiple times. (#1005)

`BlockMaxMaxscoreScorer` would previously compute the score twice for essential
scorers.

Co-authored-by: zacharymorn 

> Could the partial score sum from essential list scores be cached?
> -
>
> Key: LUCENE-10636
> URL: https://issues.apache.org/jira/browse/LUCENE-10636
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Zach Chen
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is a follow-up issue from discussion 
> [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
> in the implementation of BlockMaxMaxscoreScorer, there's duplicated 
> computation of summing up scores from essential list scorers. We would like 
> to see if this duplicated computation can be cached without introducing much 
> overhead or data structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-07-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562464#comment-17562464
 ] 

ASF subversion and git services commented on LUCENE-10636:
--

Commit 2d05f5c623e06b8bafa1f7b1d6be813c14550690 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2d05f5c623e ]

LUCENE-10636: Avoid computing the same scores multiple times. (#1005)

`BlockMaxMaxscoreScorer` would previously compute the score twice for essential
scorers.

Co-authored-by: zacharymorn 

> Could the partial score sum from essential list scores be cached?
> -
>
> Key: LUCENE-10636
> URL: https://issues.apache.org/jira/browse/LUCENE-10636
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Zach Chen
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is a follow-up issue from discussion 
> [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
> in the implementation of BlockMaxMaxscoreScorer, there's duplicated 
> computation of summing up scores from essential list scorers. We would like 
> to see if this duplicated computation can be cached without introducing much 
> overhead or data structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-07-05 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10636.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Could the partial score sum from essential list scores be cached?
> -
>
> Key: LUCENE-10636
> URL: https://issues.apache.org/jira/browse/LUCENE-10636
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Zach Chen
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is a follow-up issue from discussion 
> [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
> in the implementation of BlockMaxMaxscoreScorer, there's duplicated 
> computation of summing up scores from essential list scorers. We would like 
> to see if this duplicated computation can be cached without introducing much 
> overhead or data structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #1005: LUCENE-10636: Avoid computing the same scores multiple times.

2022-07-05 Thread GitBox



jpountz merged PR #1005:
URL: https://github.com/apache/lucene/pull/1005


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r913896000


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsWriter.java:
##
@@ -58,7 +57,6 @@ public final class Lucene91HnswVectorsWriter extends 
KnnVectorsWriter {
 this.maxConn = maxConn;
 this.beamWidth = beamWidth;
 
-assert state.fieldInfos.hasVectorValues();

Review Comment:
   In the new model, we initialize vectors' writers during indexing, where 
`SegmentWriteState` object along with all filled `fiedInfos` is not yet 
available (it becomes available during flush). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562711#comment-17562711
 ] 

Adrien Grand commented on LUCENE-10480:
---

Nightly benchmarks picked up the change and top-level disjunctions are seeing 
massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562730#comment-17562730
 ] 

Adrien Grand commented on LUCENE-10480:
---

Looking at this new scorer from the perspective of disjunctions within 
conjunctions, maybe there are bits from advance() that we could move to 
matches() so that we would hand it over to the other clause before we start 
doing expensive operations like computing scores. What do you think 
[~zacharymorn]?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2022-07-05 Thread Denilson Amorim (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562758#comment-17562758
 ] 

Denilson Amorim commented on LUCENE-8806:
-

Before reformulating my question. Let me see if I understood this patch and the 
discussion correctly:

WANDScorer doesn't support calling children two-phase iterators. Therefore, in 
an attempt to improve performance, this patch adds calls to these two-phase 
iterators in WAND. However, it didn't perform well in phrase queries benchmarks 
because its max score calculation wasn't per-block. [~jim.ferenczi] hacked a 
solution during the discussion here to have per-block max scores in phrase 
scorers, showing a positive outcome.

Since the discussion went idle, phrase scorers received support for per-block 
max scores through LUCENE-8311. But this patch hasn't moved. So I was wondering 
whether it makes sense to move this patch forward.

Thanks in advanced.

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914008509


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -203,8 +204,11 @@ private NeighborQueue searchLevel(
 return results;
   }
 
-  private void clearScratchState() {
+  private void clearScratchState(int capacity) {
 candidates.clear();
+if (visited.length() < capacity) {
+  visited = FixedBitSet.ensureCapacity((FixedBitSet) visited, capacity);

Review Comment:
   One thing to note that we should't create new objects too often, as 
over-allocating happens exponentially.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-05 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562786#comment-17562786
 ] 

Greg Miller commented on LUCENE-10639:
--

As a quick update, I ran benchmarks with just [livedoc checking broken 
out|https://github.com/gsmiller/lucene/commit/f4e9614a299523b57c854a3bd3371253f0a7fb17]
 in {{DefaultBulkScorer}}. I surprisingly didn't see any difference. So maybe 
something else going on here?

Note that I ran this with {{wikimedium10m}} instead of {{all}} to get a 
datapoint a bit quicker:

{code:java}
TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
 Prefix3  118.98 (10.2%)  114.60  
(9.9%)   -3.7% ( -21% -   18%) 0.247
Wildcard   40.69  (6.9%)   39.62  
(7.2%)   -2.6% ( -15% -   12%) 0.236
  TermDTSort   17.76 (20.4%)   17.33 
(14.2%)   -2.4% ( -30% -   40%) 0.663
   OrNotHighHigh  881.01  (4.4%)  861.34  
(3.9%)   -2.2% ( -10% -6%) 0.089
 AndHighHigh8.87  (5.0%)8.70  
(6.2%)   -1.8% ( -12% -9%) 0.296
 MedTerm 1771.40  (4.2%) 1740.50  
(4.4%)   -1.7% (  -9% -7%) 0.198
  AndHighMed   30.59  (4.0%)   30.06  
(5.6%)   -1.7% ( -10% -8%) 0.267
OrHighNotLow  782.90  (4.8%)  769.92  
(5.1%)   -1.7% ( -11% -8%) 0.291
  HighPhrase  392.18  (2.7%)  386.50  
(2.7%)   -1.4% (  -6% -4%) 0.087
   OrHighNotHigh  830.76  (4.3%)  818.83  
(4.3%)   -1.4% (  -9% -7%) 0.295
OrNotHighMed  585.86  (2.6%)  578.07  
(3.1%)   -1.3% (  -6% -4%) 0.146
OrHighNotMed  966.75  (3.6%)  956.07  
(3.9%)   -1.1% (  -8% -6%) 0.352
   LowPhrase  546.02  (2.1%)  540.42  
(2.4%)   -1.0% (  -5% -3%) 0.148
   MedPhrase   24.65  (2.3%)   24.40  
(3.0%)   -1.0% (  -6% -4%) 0.225
  AndHighLow  508.37  (3.7%)  503.84  
(4.7%)   -0.9% (  -8% -7%) 0.506
OrNotHighLow  672.15  (2.7%)  666.29  
(2.8%)   -0.9% (  -6% -4%) 0.313
   BrowseMonthTaxoFacets8.92 (14.5%)8.84 
(13.9%)   -0.9% ( -25% -   32%) 0.846
 AndHighMedDayTaxoFacets   39.14  (2.2%)   38.82  
(2.2%)   -0.8% (  -5% -3%) 0.241
AndHighHighDayTaxoFacets8.01  (2.8%)7.96  
(2.8%)   -0.7% (  -6% -4%) 0.416
 LowSloppyPhrase5.83  (3.8%)5.79  
(3.8%)   -0.7% (  -8% -7%) 0.556
   OrHighLow  128.01  (3.7%)  127.11  
(3.8%)   -0.7% (  -7% -7%) 0.554
HighTerm 1190.03  (4.4%) 1183.10  
(4.1%)   -0.6% (  -8% -8%) 0.663
 MedSloppyPhrase   11.67  (2.1%)   11.61  
(2.6%)   -0.5% (  -5% -4%) 0.480
MedTermDayTaxoFacets   14.09  (3.1%)   14.03  
(4.1%)   -0.5% (  -7% -6%) 0.686
  IntNRQ  110.15  (2.3%)  109.69  
(2.1%)   -0.4% (  -4% -4%) 0.546
HighSloppyPhrase9.56  (4.5%)9.53  
(4.5%)   -0.4% (  -8% -9%) 0.794
BrowseDateSSDVFacets0.85 (10.4%)0.85 
(10.8%)   -0.3% ( -19% -   23%) 0.939
 Respell   33.65  (1.7%)   33.58  
(1.7%)   -0.2% (  -3% -3%) 0.684
  Fuzzy2   74.16  (1.9%)   74.02  
(1.7%)   -0.2% (  -3% -3%) 0.740
 LowTerm 1522.48  (2.9%) 1520.76  
(3.3%)   -0.1% (  -6% -6%) 0.909
 LowIntervalsOrdered   12.75  (3.3%)   12.74  
(3.3%)   -0.1% (  -6% -6%) 0.915
HighIntervalsOrdered6.30  (4.2%)6.31  
(4.0%)0.1% (  -7% -8%) 0.923
 BrowseRandomLabelSSDVFacets2.57  (4.9%)2.57  
(4.9%)0.1% (  -9% -   10%) 0.927
  Fuzzy1   57.11  (1.9%)   57.26  
(1.7%)0.2% (  -3% -3%) 0.666
 BrowseRandomLabelTaxoFacets6.32  (9.3%)6.34 
(10.3%)0.3% ( -17% -   21%) 0.911
 LowSpanNear   15.95  (2.9%)   16.01  
(2.7%)0.4% (  -5% -6%) 0.680
 MedIntervalsOrdered1.61  (5.8%)1.62  
(5.8%)0.4% ( -10% -   12%) 0.834
HighSpanNear2.27  (4.2%)2.28  
(4.0%)0.6% (  -7% -

[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-07-05 Thread GitBox



gsmiller commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r914082183


##
lucene/demo/src/java/org/apache/lucene/demo/facet/DistanceFacetsExample.java:
##
@@ -212,7 +212,26 @@ public static Query getBoundingBoxQuery(
   }
 
   /** User runs a query and counts facets. */
-  public FacetResult search() throws IOException {
+  public FacetResult searchAllChildren() throws IOException {
+
+FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new 
FacetsCollectorManager());
+
+Facets facets =
+new DoubleRangeFacetCounts(
+"field",
+getDistanceValueSource(),
+fc,
+getBoundingBoxQuery(ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 10.0),
+ONE_KM,
+TWO_KM,
+FIVE_KM,
+TEN_KM);
+
+return facets.getAllChildren("field");
+  }
+
+  /** User runs a query and counts facets. */
+  public FacetResult searchTopChildren() throws IOException {

Review Comment:
   I see. OK, a couple things confused me in your code. It looks like it's 
doing what I describe, but what threw me is that, 1) you're updating a variable 
named `currentTime` to represent the end of each range, and 2) the label says 
"Past ...", which made me thing it was a trailing window. As a couple 
suggestions, maybe rename your `then` and `currentTime` variables to something 
like `startTime` and `endTime`? And maybe rename the labels to just something 
like "Hour x - y"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-07-05 Thread GitBox



gsmiller commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r914083105


##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,30 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 
7200 sec before (2
+// hours) before "now", ...:
+long currentTime = System.currentTimeMillis() / 1000L;
+// Index error messages for the past week (24 * 7 = 168 hours)
+for (int i = 0; i < 168; i++) {
+  long then = currentTime - (i + 1) * 3600;
+
+  // conditionally add different number of error messages to the past hour 
slot
+  for (int j = 0; j < i % 35; j++) {

Review Comment:
   Why mod with a value of `35`? I'm not getting it. Is there a specific reason 
you're using the value `35`? If so, could you add a comment?



##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,30 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 
7200 sec before (2
+// hours) before "now", ...:
+long currentTime = System.currentTimeMillis() / 1000L;
+// Index error messages for the past week (24 * 7 = 168 hours)
+for (int i = 0; i < 168; i++) {

Review Comment:
   minor: I find it a bit confusing that you index the data and create ranges 
in "reverse" starting from "now." Would it be easier for others to understand 
if you started "one week ago" and looped "forwards"? Since the demo code is 
here to help people understand the functionality, I want to make sure we don't 
create unnecessary confusion.



##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,30 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 
7200 sec before (2
+// hours) before "now", ...:
+long currentTime = System.currentTimeMillis() / 1000L;
+// Index error messages for the past week (24 * 7 = 168 hours)
+for (int i = 0; i < 168; i++) {
+  long then = currentTime - (i + 1) * 3600;
+
+  // conditionally add different number of error messages to the past hour 
slot
+  for (int j = 0; j < i % 35; j++) {
+Document doc = new Document();
+doc.add(new NumericDocValuesField("error log", then));

Review Comment:
   Could we add some "jitter" to what gets indexed so all the timestamps of the 
"fake error logs" don't fall right on hour boundaries? That would be a bit of a 
more realistic example.



##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,30 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) before "now", 
7200 sec before (2
+// hours) before "now", ...:
+long currentTime = System.currentTimeMillis() / 1000L;
+// Index error messages for the past week (24 * 7 = 168 hours)
+for (int i = 0; i < 168; i++) {
+  long then = currentTime - (i + 1) * 3600;
+
+  // conditionally add different number of error messages to the past hour 
slot
+  for (int j = 0; j < i % 35; j++) {
+Document doc = new Document();
+doc.add(new NumericDocValuesField("error log", then));
+doc.add(
+new StringField(
+"Error msg", "[Error] Server encountered error at " + 
currentTime, Field.Store.NO));

Review Comment:
   Shouldn't the "logged" timestamp here be the same as the one indexed in the 
previous line (i.e., `then` instead of `currentTime`)? I realize that doesn't 
impact the functionality of the example, but just trying to avoid confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914113679


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;
+
+  /** Write field for merging */
+  public abstract void writeFieldForMerging(FieldInfo fieldInfo, 
KnnVectorsReader knnVectorsReader)

Review Comment:
   @jpountz Yes, indeed this is the same method as `mergeXXXField` in 
`DocValuesConsumer` or `mergeOneField` in `PointsWriter`. 
   
   I am not very clear what you meant by: "make this method responsible for 
creating the merged view (instead of doing it on top)", can you please clarify 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914119921


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   We need a separate `finish()`  method as is also used by `merge` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914119921


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   We need a separate from `flush` the  `finish()`  method as is also used by 
`merge` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914126920


##
lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java:
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.Codec;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.InfoStream;
+
+/**
+ * Streams vector values for indexing to the given codec's vectors writer. The 
codec's vectors
+ * writer is responsible for buffering and processing vectors.
+ */
+class VectorValuesConsumer {
+  private final Codec codec;
+  private final Directory directory;
+  private final SegmentInfo segmentInfo;
+  private final InfoStream infoStream;
+
+  private Accountable accountable = Accountable.NULL_ACCOUNTABLE;
+  private KnnVectorsWriter writer;
+
+  VectorValuesConsumer(
+  Codec codec, Directory directory, SegmentInfo segmentInfo, InfoStream 
infoStream) {
+this.codec = codec;
+this.directory = directory;
+this.segmentInfo = segmentInfo;
+this.infoStream = infoStream;
+  }
+
+  private void initKnnVectorsWriter(String fieldName) throws IOException {
+if (writer == null) {
+  KnnVectorsFormat fmt = codec.knnVectorsFormat();
+  if (fmt == null) {
+throw new IllegalStateException(
+"field=\""
++ fieldName
++ "\" was indexed as vectors but codec does not support 
vectors");
+  }
+  SegmentWriteState initialWriteState =
+  new SegmentWriteState(infoStream, directory, segmentInfo, null, 
null, IOContext.DEFAULT);
+  writer = fmt.fieldsWriter(initialWriteState);
+  accountable = writer;
+}
+  }
+
+  public void addField(FieldInfo fieldInfo) throws IOException {
+initKnnVectorsWriter(fieldInfo.name);
+writer.addField(fieldInfo);
+  }
+
+  public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) 
throws IOException {
+writer.addValue(fieldInfo, docID, vectorValue);
+  }
+
+  void flush(SegmentWriteState state, Sorter.DocMap sortMap) throws 
IOException {

Review Comment:
   No, I don't think we need it: we pass the information about segment's maxDoc 
to `writer.flush`. Also stored fields writers need to be passed every doc even 
though it doesn't contain stored fields, while vectors' writers only need to be 
passed docs that contain vectors. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914132447


##
lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java:
##
@@ -26,233 +26,153 @@
 import org.apache.lucene.codecs.KnnVectorsWriter;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.Counter;
 import org.apache.lucene.util.RamUsageEstimator;
 
 /**
- * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes.
+ * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes. Used for {@code
+ * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 .
  *
  * @lucene.experimental
  */
-class VectorValuesWriter {
-
-  private final FieldInfo fieldInfo;
-  private final Counter iwBytesUsed;
-  private final List vectors = new ArrayList<>();
-  private final DocsWithFieldSet docsWithField;
-
-  private int lastDocID = -1;
-
-  private long bytesUsed;
-
-  VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
-this.fieldInfo = fieldInfo;
-this.iwBytesUsed = iwBytesUsed;
-this.docsWithField = new DocsWithFieldSet();
-this.bytesUsed = docsWithField.ramBytesUsed();
-if (iwBytesUsed != null) {
-  iwBytesUsed.addAndGet(bytesUsed);
+public abstract class VectorValuesWriter extends KnnVectorsWriter {

Review Comment:
   +1 for renaming.
   
   > I also wonder if we could update SimpleTextKnnVectorsWriter to use the new 
writer interface. Then we could move this class to the backwards-codecs 
package, because it would only be used in the old codec tests.
   
   This would mean we need to copy all the code form 
`BufferingKnnVectorsWriter` to `SimpleTextKnnVectorsWriter`? Are we ok with 
this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion

2022-07-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562821#comment-17562821
 ] 

ASF subversion and git services commented on LUCENE-10626:
--

Commit d537013e70872015364c745e5f320727efc034b7 in lucene's branch 
refs/heads/main from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d537013e708 ]

LUCENE-10626: Hunspell: add tools to aid dictionary editing: analysis 
introspection, stem expansion and stem/flag suggestion (#975)



> Hunspell: add tools to aid dictionary editing: analysis introspection, stem 
> expansion and stem/flag suggestion
> --
>
> Key: LUCENE-10626
> URL: https://issues.apache.org/jira/browse/LUCENE-10626
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The following tools would be nice to have when editing and appending an 
> existing dictionary:
> 1. See how Hunspell analyzes a given word, with all the involved affix flags: 
> `Hunspell.analyzeSimpleWord`
> 2. See all forms that the given stem can produce with the given flags: 
> `Hunspell.expandRoot`, `WordFormGenerator.expandRoot`
> 3. Given a number of word forms, suggest a stem and a set of flags that 
> produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-05 Thread GitBox



gsmiller commented on code in PR #1004:
URL: https://github.com/apache/lucene/pull/1004#discussion_r914131349


##
lucene/memory/src/test/org/apache/lucene/index/memory/TestMemoryIndex.java:
##
@@ -298,10 +298,10 @@ public void testDocValues() throws Exception {
 assertEquals(3, sortedSetDocValues.getValueCount());
 assertEquals(0, sortedSetDocValues.nextDoc());
 assertEquals(3, sortedSetDocValues.docValueCount());
+assertEquals(3, sortedSetDocValues.docValueCount());

Review Comment:
   Looks like we're already asserting this on the previous line :)



##
lucene/core/src/test/org/apache/lucene/index/TestSortedSetDocValues.java:
##
@@ -1,26 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.lucene.index;
-
-import org.apache.lucene.tests.util.LuceneTestCase;
-
-public class TestSortedSetDocValues extends LuceneTestCase {
-
-  public void testNoMoreOrdsConstant() {

Review Comment:
   Let's keep this test as long as we still define `NO_MORE_ORDS`. Even though 
we've marked it as deprecated and are moving off of using it for iteration, our 
users are likely still relying on it for iteration so we should still keep this 
test. We can remove it at the same time we actually remove the constant 
definition.



##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseDocValuesFormatTestCase.java:
##
@@ -1878,14 +1878,12 @@ public void testSortedSetTwoDocumentsMerged() throws 
IOException {
 
 assertEquals(0, dv.nextDoc());
 assertEquals(0, dv.nextOrd());
-assertEquals(NO_MORE_ORDS, dv.nextOrd());

Review Comment:
   Should we assert the docValueCount is 1 here?



##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -2576,11 +2576,11 @@ public void assertDocValuesEquals(String info, 
IndexReader leftReader, IndexRead
 if (docID == NO_MORE_DOCS) {
   break;
 }
-long ord;
-while ((ord = leftValues.nextOrd()) != 
SortedSetDocValues.NO_MORE_ORDS) {
+assertEquals(info, leftValues.docValueCount(), 
rightValues.docValueCount());
+for (int i = 0; i < leftValues.docValueCount(); i++) {
+  long ord = leftValues.nextOrd();

Review Comment:
   minor: I might just one-line the for-loop body to: `assertEquals(info, 
leftValues.nextOrd(), rightValues.nextOrd());` Alternatively, if you find it 
more readable to create the local variables, I'd create one for each:
   ```
   long leftOrd = leftValues.nextOrd();
   long rightOrd = rightValues.nextOrd();
   assertEquals(info, leftOrd, rightOrd);
   ```
   Just feels a little inconsistent as it currently is :)



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene80/BaseLucene80DocValuesFormatTestCase.java:
##
@@ -480,15 +476,14 @@ public void testSortedSetAroundBlockSize() throws 
IOException {
   for (int i = 0; i < maxDoc; ++i) {
 assertEquals(i, values.nextDoc());
 final int numValues = in.readVInt();
+assertEquals(numValues, values.docValueCount());
 
 for (int j = 0; j < numValues; ++j) {
   b.setLength(in.readVInt());
   b.grow(b.length());
   in.readBytes(b.bytes(), 0, b.length());
   assertEquals(b.get(), values.lookupOrd(values.nextOrd()));
 }
-
-assertEquals(SortedSetDocValues.NO_MORE_ORDS, values.nextOrd());

Review Comment:
   I think we ought to keep this for now until we actually remove the contract 
that `nextOrd()` returns `NO_MORE_ORDS` when exhausted (assuming we plan to 
back-port this to 9.x, which I think we should). Since 9.x will need to 
continue to return `NO_MORE_ORDS` as part of the API contract, it would be good 
to have tests for that behavior. When we go to actually remove `NO_MORE_ORDS`, 
which we should do only to `main` and under a separate Jira issue, we can 
remove this check.



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -1205,8 +1205,8 @@ public void searchIndex(
   assertEquals(id, dvShort.longValue());
 
   assertEquals(i,

[GitHub] [lucene] donnerpeter merged pull request #975: LUCENE-10626 Hunspell: add tools to aid dictionary editing

2022-07-05 Thread GitBox



donnerpeter merged PR #975:
URL: https://github.com/apache/lucene/pull/975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-07-05 Thread GitBox



shahrs87 commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1175446230

   @jpountz  There are no null or terms.EMPTY checks in CheckIndex class 
anymore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914171256


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -203,8 +204,11 @@ private NeighborQueue searchLevel(
 return results;
   }
 
-  private void clearScratchState() {
+  private void clearScratchState(int capacity) {

Review Comment:
   Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914132447


##
lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java:
##
@@ -26,233 +26,153 @@
 import org.apache.lucene.codecs.KnnVectorsWriter;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.Counter;
 import org.apache.lucene.util.RamUsageEstimator;
 
 /**
- * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes.
+ * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes. Used for {@code
+ * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 .
  *
  * @lucene.experimental
  */
-class VectorValuesWriter {
-
-  private final FieldInfo fieldInfo;
-  private final Counter iwBytesUsed;
-  private final List vectors = new ArrayList<>();
-  private final DocsWithFieldSet docsWithField;
-
-  private int lastDocID = -1;
-
-  private long bytesUsed;
-
-  VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
-this.fieldInfo = fieldInfo;
-this.iwBytesUsed = iwBytesUsed;
-this.docsWithField = new DocsWithFieldSet();
-this.bytesUsed = docsWithField.ramBytesUsed();
-if (iwBytesUsed != null) {
-  iwBytesUsed.addAndGet(bytesUsed);
+public abstract class VectorValuesWriter extends KnnVectorsWriter {

Review Comment:
   +1 for renaming. Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5
   
   > I also wonder if we could update SimpleTextKnnVectorsWriter to use the new 
writer interface. Then we could move this class to the backwards-codecs 
package, because it would only be used in the old codec tests.
   
   This would mean we need to copy all the code form 
`BufferingKnnVectorsWriter` to `SimpleTextKnnVectorsWriter`? Are we ok with 
this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914172262


##
lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java:
##
@@ -116,7 +119,193 @@ public final class Lucene93HnswVectorsWriter extends 
KnnVectorsWriter {
   }
 
   @Override
-  public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  public void addField(FieldInfo fieldInfo) throws IOException {
+if (fields == null) {
+  fields = new FieldData[1];
+} else {
+  FieldData[] newFields = new FieldData[fields.length + 1];
+  System.arraycopy(fields, 0, newFields, 0, fields.length);
+  fields = newFields;
+}
+fields[fields.length - 1] =
+new FieldData(fieldInfo, M, beamWidth, segmentWriteState.infoStream);
+  }
+
+  @Override
+  public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) 
throws IOException {
+for (FieldData field : fields) {

Review Comment:
   No longer relevant, as in 2f58350081902bfc13cb02424343ab805c02b0a5 we use 
`addValue` for `KnnFieldVectorsWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914172811


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;

Review Comment:
   Great feedback! Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-05 Thread GitBox



mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r914172979


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState 
state) throws IOException
   private class FieldsWriter extends KnnVectorsWriter {
 private final Map formats;
 private final Map suffixes = new HashMap<>();
+private final Map> writersForFields =
+new IdentityHashMap<>();
 private final SegmentWriteState segmentWriteState;
 
+// if there is a single writer, cache it for faster indexing
+private KnnVectorsWriter singleWriter;

Review Comment:
   Addressed in 2f58350081902bfc13cb02424343ab805c02b0a5



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-05 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562832#comment-17562832
 ] 

Greg Miller commented on LUCENE-10603:
--

Thanks [~stefanvodita] for jumping in as well to help! I left a little feedback 
on the PR. Thanks again!

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] stefanvodita commented on pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-05 Thread GitBox



stefanvodita commented on PR #1004:
URL: https://github.com/apache/lucene/pull/1004#issuecomment-1175589343

   Thanks @gsmiller for patiently checking through all those changes! I’ve 
reverted the ones you pointed out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] stefanvodita commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-05 Thread GitBox



stefanvodita commented on code in PR #1004:
URL: https://github.com/apache/lucene/pull/1004#discussion_r914274231


##
lucene/memory/src/test/org/apache/lucene/index/memory/TestMemoryIndex.java:
##
@@ -298,10 +298,10 @@ public void testDocValues() throws Exception {
 assertEquals(3, sortedSetDocValues.getValueCount());
 assertEquals(0, sortedSetDocValues.nextDoc());
 assertEquals(3, sortedSetDocValues.docValueCount());
+assertEquals(3, sortedSetDocValues.docValueCount());

Review Comment:
   Oops! Fixed! :))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] stefanvodita commented on a diff in pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-05 Thread GitBox



stefanvodita commented on code in PR #1004:
URL: https://github.com/apache/lucene/pull/1004#discussion_r914274530


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -2576,11 +2576,11 @@ public void assertDocValuesEquals(String info, 
IndexReader leftReader, IndexRead
 if (docID == NO_MORE_DOCS) {
   break;
 }
-long ord;
-while ((ord = leftValues.nextOrd()) != 
SortedSetDocValues.NO_MORE_ORDS) {
+assertEquals(info, leftValues.docValueCount(), 
rightValues.docValueCount());
+for (int i = 0; i < leftValues.docValueCount(); i++) {
+  long ord = leftValues.nextOrd();

Review Comment:
   I went with the one-liner here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919
 ] 

Zach Chen commented on LUCENE-10480:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       
44.47     (26.6%)       63.

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:12 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHig

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:13 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                   AndHighOrMedMed

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                    AndHighOrMedMed       44.

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       44.47

[GitHub] [lucene] zacharymorn opened a new pull request, #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction

2022-07-05 Thread GitBox



zacharymorn opened a new pull request, #1006:
URL: https://github.com/apache/lucene/pull/1006

   ### Description (or a Jira issue link if you have one) 
   
   Follow-up changes for https://issues.apache.org/jira/browse/LUCENE-10480 to 
improve performance for disjunction within conjunction queries.
   
   Benchmark results with `wikinightly.tasks` boolean queries below:
   
   ```
   AndHighHigh: +be +up # freq=2115632 freq=824628
   AndHighHigh: +cite +had # freq=1367577 freq=1223103
   AndHighHigh: +is +he # freq=4214104 freq=1663980
   AndHighHigh: +no +4 # freq=1060681 freq=944177
   AndHighHigh: +title +see # freq=2077102 freq=1100862
   AndHighMed: +2010 +16 # freq=933686 freq=531050
   AndHighMed: +5 +power # freq=849829 freq=257919
   AndHighMed: +only +particularly # freq=895806 freq=100045
   AndHighMed: +united +1983 # freq=1185528 freq=150075
   AndHighMed: +who +ed # freq=1201585 freq=127497
   OrHighHigh: are last # freq=1921211 freq=830278
   OrHighHigh: at united # freq=2834104 freq=1185528
   OrHighHigh: but year # freq=1484398 freq=1098425
   OrHighHigh: name its # freq=2577591 freq=1160703
   OrHighHigh: to but # freq=6105155 freq=1484398
   OrHighMed: at mostly # freq=2834104 freq=89401
   OrHighMed: his interview # freq=1771920 freq=94736
   OrHighMed: http 9 # freq=3289683 freq=541405
   OrHighMed: they hard # freq=1031516 freq=92045
   OrHighMed: title bay # freq=2077102 freq=117167
   AndHighOrMedMed: +be +(mostly interview) # freq=2115632 freq=89401 freq=94736
   AndHighOrMedMed: +cite +(9 hard) # freq=1367577 freq=541405 freq=92045
   AndHighOrMedMed: +is +(bay 16) # freq=4214104 freq=117167 freq=531050
   AndHighOrMedMed: +no +(power particularly) # freq=1060681 freq=257919 
freq=100045
   AndHighOrMedMed: +title +(1983 ed) # freq=2077102 freq=150075 freq=127497
   AndMedOrHighHigh: +mostly +(are last) # freq=89401 freq=1921211 freq=830278
   AndMedOrHighHigh: +interview +(at united) # freq=94736 freq=2834104 
freq=1185528
   AndMedOrHighHigh: +hard +(but year) # freq=92045 freq=1484398 freq=1098425
   AndMedOrHighHigh: +9 +(name its) # freq=541405 freq=2577591 freq=1160703
   AndMedOrHighHigh: +bay +(to but) # freq=117167 freq=6105155 freq=1484398
   ```
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
AndHighHigh   40.93  (2.8%)   40.72  
(4.2%)   -0.5% (  -7% -6%) 0.659
 AndHighMed  150.71  (3.4%)  152.22  
(3.7%)1.0% (  -5% -8%) 0.371
   PKLookup  250.85  (8.7%)  257.51  
(8.9%)2.7% ( -13% -   22%) 0.340
AndHighOrMedMed   66.87  (4.0%)   68.70  
(2.7%)2.7% (  -3% -9%) 0.012
   AndMedOrHighHigh   89.04  (2.6%)   93.28  
(3.1%)4.8% (   0% -   10%) 0.000
 OrHighHigh   21.71  (6.0%)   34.50  
(6.8%)   58.9% (  43% -   76%) 0.000
  OrHighMed   85.11  (5.0%)  189.37  
(8.0%)  122.5% ( 104% -  142%) 0.000
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   AndMedOrHighHigh   68.90  (4.5%)   67.15  
(4.3%)   -2.5% ( -10% -6%) 0.074
AndHighHigh   73.07  (3.0%)   72.11  
(3.5%)   -1.3% (  -7% -5%) 0.212
 AndHighMed  146.94  (4.7%)  145.56  
(4.9%)   -0.9% ( -10% -9%) 0.550
   PKLookup  252.01  (9.3%)  249.71 
(13.2%)   -0.9% ( -21% -   23%) 0.806
AndHighOrMedMed   65.49  (5.8%)   66.09  
(4.9%)0.9% (  -9% -   12%) 0.600
 OrHighHigh   21.34  (6.7%)   29.63  
(6.7%)   38.8% (  23% -   55%) 0.000
  OrHighMed  122.61  (8.2%)  227.04  
(9.0%)   85.2% (  62% -  111%) 0.000
   
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 AndHighMed  113.58  (2.8%)  113.98  
(4.8%)0.3% (  -7% -8%) 0.779
AndHighHigh   51.37  (3.2%)   51.58  
(5.2%)0.4% (  -7% -9%) 0.759
   PKLookup  272.05  (8.9%)  276.89 
(12.6%)1.8% ( -18% -   25%) 0.605
AndHighOrMedMed  102.86  (5.1%)  107.47  
(5.4%)4.5% (  -5% -   15%) 0.007
   AndMedOrHighHigh   91.55  (3.8%)   96.43  
(5.2%)5.3% (  -3% -   14%) 0.000
 OrHighHigh   27.08  (6.5%)   47.16 
(11.3%)   74.2% (  52% -   98%) 0.000
  OrHighMed   78.78

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562944#comment-17562944
 ] 

Zach Chen commented on LUCENE-10480:


{quote}maybe there are bits from advance() that we could move to matches() so 
that we would hand it over to the other clause before we start doing expensive 
operations like computing scores.
{quote}
This approach does help stabilizing performance for disjunction within 
conjunction queries (and also provide some small gains)! I have opened a PR for 
it [https://github.com/apache/lucene/pull/1006] .

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion

2022-07-05 Thread Peter Gromov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Gromov resolved LUCENE-10626.
---
Resolution: Fixed

> Hunspell: add tools to aid dictionary editing: analysis introspection, stem 
> expansion and stem/flag suggestion
> --
>
> Key: LUCENE-10626
> URL: https://issues.apache.org/jira/browse/LUCENE-10626
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The following tools would be nice to have when editing and appending an 
> existing dictionary:
> 1. See how Hunspell analyzes a given word, with all the involved affix flags: 
> `Hunspell.analyzeSimpleWord`
> 2. See all forms that the given stem can produce with the given flags: 
> `Hunspell.expandRoot`, `WordFormGenerator.expandRoot`
> 3. Given a number of word forms, suggest a stem and a set of flags that 
> produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Closed] (LUCENE-10626) Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion

2022-07-05 Thread Peter Gromov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Gromov closed LUCENE-10626.
-

> Hunspell: add tools to aid dictionary editing: analysis introspection, stem 
> expansion and stem/flag suggestion
> --
>
> Key: LUCENE-10626
> URL: https://issues.apache.org/jira/browse/LUCENE-10626
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Peter Gromov
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The following tools would be nice to have when editing and appending an 
> existing dictionary:
> 1. See how Hunspell analyzes a given word, with all the involved affix flags: 
> `Hunspell.analyzeSimpleWord`
> 2. See all forms that the given stem can produce with the given flags: 
> `Hunspell.expandRoot`, `WordFormGenerator.expandRoot`
> 3. Given a number of word forms, suggest a stem and a set of flags that 
> produce these word forms: `Hunspell.compress`, `WordFormGenerator.compress`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #17: Split up updating script

2022-07-05 Thread GitBox



mocobeta opened a new pull request, #17:
URL: https://github.com/apache/lucene-jira-archive/pull/17

   Close #16 
   
   Add
   - src/remap_cross_issue_links.py
   - src/update_issues.py
   
   Deprecate
   - src/update_issue_links.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

42 matches

Mail list logo