date:20220630

[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171959478

   I think pre- or post- processing to tweak the converter's result is risky 
since any text processing required contexts (which block element the character 
sequence resides in - even LFs and spaces have special meanings in both Jira 
markup and Markdown).
   
   **For your information**: I don't know how much effort we should/can invest 
in that though, maybe we would need a customized version of the converter tool 
if we want to safely fix conversion errors.
   https://github.com/catcombo/jira2markdown
   I haven't closely looked at it, but it's built upon 
[pyparsing](https://github.com/pyparsing/pyparsing/) (a well-known python 
parser generator library) and it already supports all Jira syntax; it'd be a 
good start.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-30 Thread GitBox



zacharymorn commented on code in PR #972:
URL: https://github.com/apache/lucene/pull/972#discussion_r911590359


##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -0,0 +1,322 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.LinkedList;
+import java.util.List;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+  // list of scorers ordered by maxScore
+  private final LinkedList maxScoreSortedEssentialScorers;
+
+  private final DisiWrapper[] allScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private float nonEssentialMaxScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  // scaled min competitive score
+  private float minCompetitiveScore = 0;
+
+  private int cachedScoredDoc = -1;
+  private float cachedScore = 0;
+
+  /**
+   * Constructs a Scorer that scores doc based on Block-Max-Maxscore (BMM) 
algorithm
+   * http://engineering.nyu.edu/~suel/papers/bmm.pdf . This algorithm has 
lower overhead compared to
+   * WANDScorer, and could be used for simple disjunction queries.
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional 
clauses
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List scorers) throws 
IOException {
+super(weight);
+
+this.doc = -1;
+this.allScorers = new DisiWrapper[scorers.size()];
+this.essentialsScorers = new DisiPriorityQueue(scorers.size());
+this.maxScoreSortedEssentialScorers = new LinkedList<>();
+
+long cost = 0;
+for (int i = 0; i < scorers.size(); i++) {
+  DisiWrapper w = new DisiWrapper(scorers.get(i));
+  cost += w.cost;
+  allScorers[i] = w;
+}
+
+this.cost = cost;
+maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+// twoPhaseIterator needed to honor scorer.setMinCompetitiveScore guarantee
+return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+DocIdSetIterator approximation =
+new DocIdSetIterator() {
+
+  @Override
+  public int docID() {
+return doc;
+  }
+
+  @Override
+  public int nextDoc() throws IOException {
+return advance(doc + 1);
+  }
+
+  @Override
+  public int advance(int target) throws IOException {
+while (true) {
+
+  if (target > upTo) {
+updateMaxScoresAndLists(target);
+  } else {
+// minCompetitiveScore might have increased,
+// move potentially no-longer-competitive scorers from 
essential to non-essential
+// list
+movePotentiallyNonCompetitiveScorers();
+  }
+
+  assert target <= upTo;
+
+  DisiWrapper top = essentialsScorers.top();
+
+  if (top == null) {
+// all scorers in non-essential list, skip to next boundary or 
return no_more_docs
+if (upTo == NO_MORE_DOCS) {
+  return doc = NO_MORE_DOCS;
+} else {
+  target = upTo + 1;
+}
+  } else {
+// position all scorers in essential list to on or after target
+while (top.doc < target) {
+  top.doc = top.iterator.advance(target);
+  top = essentialsScorers.updateTop();
+}
+
+if (top.doc == NO_MORE_DOCS) {
+  return doc = NO_MORE_DOCS;
+} else if

[jira] [Created] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-06-30 Thread Zach Chen (Jira)

Zach Chen created LUCENE-10636:
--

 Summary: Could the partial score sum from essential list scores be 
cached?
 Key: LUCENE-10636
 URL: https://issues.apache.org/jira/browse/LUCENE-10636
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Zach Chen


This is a follow-up issue from discussion 
[https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
in the implementation of BlockMaxMaxscoreScorer, there's duplicated computation 
of summing up scores from essential list scorers. We would like to see if this 
duplicated computation can be cached without introducing much overhead or data 
structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-30 Thread GitBox



zacharymorn commented on PR #972:
URL: https://github.com/apache/lucene/pull/972#issuecomment-1171893517

   > I like the idea of creating WANDScorer more explicitly in tests. It 
doesn't look easy though and this change is already great so I wonder if we 
should keep it for a follow-up.
   
   Sounds good. I've created this follow-up issue 
https://issues.apache.org/jira/browse/LUCENE-10635 .
   
   > I reviewed the change and left some very minor comments but it looks great 
to me overall. Let's get it in.
   
   Awesome, thanks for all the review and feedback @jpountz, I really 
appreciate it! Iterating on the solution and seeing it improved each time is a 
lot of fun and I enjoy this process a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added

2022-06-30 Thread Zach Chen (Jira)

Zach Chen created LUCENE-10635:
--

 Summary: Ensure test coverage for WANDScorer after additional 
scorers get added
 Key: LUCENE-10635
 URL: https://issues.apache.org/jira/browse/LUCENE-10635
 Project: Lucene - Core
  Issue Type: Test
Reporter: Zach Chen


This is a follow-up issue from discussions 
[https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & 
[https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] .

 

As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in 
TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer 
instead, reducing test coverage for WANDScorer. We would like to see how we can 
ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating the 
scorer directly inside the tests?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-30 Thread GitBox



zacharymorn commented on code in PR #972:
URL: https://github.com/apache/lucene/pull/972#discussion_r911579468


##
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##
@@ -39,6 +39,9 @@ public class DisiWrapper {
   // For WANDScorer
   long maxScore;
 
+  // For BlockMaxMaxscoreScorer
+  float maxScoreFloat;

Review Comment:
   Ah I like this idea. Updated.



##
lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java:
##
@@ -118,6 +118,21 @@ private Scorer getInternal(long leadCost) throws 
IOException {
   leadCost);
 }
 
+// pure two terms disjunction
+if (scoreMode == ScoreMode.TOP_SCORES
+&& minShouldMatch == 0

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-30 Thread GitBox



zacharymorn commented on code in PR #972:
URL: https://github.com/apache/lucene/pull/972#discussion_r911579245


##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.LinkedList;
+import java.util.List;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {

Review Comment:
   Updated.



##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.LinkedList;
+import java.util.List;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers ordered by maxScore
+  private final LinkedList maxScoreSortedEssentialScorers;
+
+  private final DisiWrapper[] allScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private double nonEssentialMaxScoreSum;
+
+  private long cost;

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10246) Support getting counts from "association" facets

2022-06-30 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222
 ] 

Greg Miller edited comment on LUCENE-10246 at 7/1/22 12:27 AM:
---

[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-association 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

Please reach out here as you have questions and I'll do my best to answer in a 
timely fashion. Thanks for having a look at this!


was (Author: gsmiller):
[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-assocation 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets

2022-06-30 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222
 ] 

Greg Miller commented on LUCENE-10246:
--

[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-assocation 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets

2022-06-30 Thread Rushabh Shah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561207#comment-17561207
 ] 

Rushabh Shah commented on LUCENE-10246:
---

[~gsmiller] [~sokolov]
I am pretty new to LUCENE project and want to contribute to small jiras to 
improve my lucene knowledge. Can you please help me scope the work required for 
this patch and point me to some relevant classes in the source code. Thank you.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10345) remove non-NRT replication support

2022-06-30 Thread Rushabh Shah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561205#comment-17561205
 ] 

Rushabh Shah commented on LUCENE-10345:
---

Hi [~rcmuir]
Can you please help me scope the changes required in this jira. I can try to 
put a patch. Thank you.

> remove non-NRT replication support
> --
>
> Key: LUCENE-10345
> URL: https://issues.apache.org/jira/browse/LUCENE-10345
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lucene's {{replicator/}} module has really two replication APIs: NRT and the 
> older non-NRT.
> The NRT replication is nice, it is actually JUST an API, hence there's no 
> network support or anything like that. 
> The non-NRT replication has some issues:
> * Uses HTTP but in a non-standard way: binary blobs etc. Responses can't be 
> cached or leverage CDNs or anything.
> * legacy plaintext HTTP 1.x only, No support for HTTPS, HTTP/2, etc
> * Giant security hole: uses java (de)serialization
> * legacy web apis (servlet). it is 2021
> * drags in third party http client, doesn't use the one in the standard 
> library
> * drags in third party logging jars, doesn't use the one in the standard 
> library
> I'd like to deprecate the non-NRT support in 9.x and remove in 10.x. I 
> suspect anyone using this module is using the newer NRT mode? If anyone is 
> still using the legacy non-NRT mode, please let me know on this issue and 
> give me your IP address, so I can try to pop a shell.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-30 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561203#comment-17561203
 ] 

Greg Miller commented on LUCENE-10603:
--

I pushed another commit that takes care of the remaining "production" code 
iteration. I think the next step is to knock out all remaining iteration 
patterns, which should only exist in "test" related code. When I get some more 
free time I'll take a pass at it, but might be a week or so. Happy to have 
someone beat me to it :)

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-06-30 Thread GitBox



shahrs87 commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1171746326

   > could you now try to remove all instances of if (terms == Terms.EMPTY)?
   
@jpountz, I tried to remove all the instances of `if (terms == 
Terms.EMPTY)?` but couldn't remove the remaining ones in the patch. Otherwise 
it will cause test failures.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures

2022-06-30 Thread Marc D'Mello (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561200#comment-17561200
 ] 

Marc D'Mello commented on LUCENE-10628:
---

I am planning on working on this.

> Enable MatchingFacetSetCounts to use space partitioning data structures
> ---
>
> Key: LUCENE-10628
> URL: https://issues.apache.org/jira/browse/LUCENE-10628
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Marc D'Mello
>Priority: Minor
>
> Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} 
> instances passed into it linearly. While this is fine in some cases, if we 
> have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We 
> should provide the option to users to enable the use of space partitioning 
> data structures (namely R trees and KD trees) so we can potentially scan over 
> these {{FacetSetMatcher}}'s in sub-linear time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10546) Update Faceting user guide

2022-06-30 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561199#comment-17561199
 ] 

Greg Miller commented on LUCENE-10546:
--

Great, thanks [~epotiom]! I'm not aware of anyone else working on this.

> Update Faceting user guide
> --
>
> Key: LUCENE-10546
> URL: https://issues.apache.org/jira/browse/LUCENE-10546
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> The  [facet user 
> guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html]
>  was written based on 4.1. Since there's been a fair amount of active 
> facet-related development over the last year+, it would be nice to review the 
> guide and see what updates make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-30 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561189#comment-17561189
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit 3e268805024cf98abb11f6de45b32403b088eb5b in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3e268805024 ]

LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in 
production code (#1000)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #1000: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code

2022-06-30 Thread GitBox



gsmiller merged PR #1000:
URL: https://github.com/apache/lucene/pull/1000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #1000: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code

2022-06-30 Thread GitBox



gsmiller opened a new pull request, #1000:
URL: https://github.com/apache/lucene/pull/1000

   PR only for backport. No review requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-30 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561187#comment-17561187
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit 5f2a4998a079278ada89ce7bfa3992673a91c5c9 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5f2a4998a07 ]

LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in 
production code (#995)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #995: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code

2022-06-30 Thread GitBox



gsmiller merged PR #995:
URL: https://github.com/apache/lucene/pull/995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #995: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code

2022-06-30 Thread GitBox



gsmiller commented on PR #995:
URL: https://github.com/apache/lucene/pull/995#issuecomment-1171671380

   Merging as I've addressed the outstanding feedback and the change is 
otherwise straight-forward. Thanks @jpountz for the suggestions and review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-30 Thread Weiming Wu (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561166#comment-17561166
 ] 

Weiming Wu edited comment on LUCENE-10624 at 6/30/22 7:55 PM:
--

I started a new AWS EC2 host and reran the test. The performance candidate vs 
baseline is very close. Therefore, my original benchmark data points are 
invalid. Maybe there were some mess up during my previous test. I have crossed 
out my original benchmark data.

We noticed performance improvement in our system's use case because we're using 
parent-child doc block to index data and run some customized queries similar to 
BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match 
one child doc and parent doc from one doc block so DocValues to retrieve are 
very sparse.

For example,
||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B||
|0|1|Fruit| | |
|1| | |100|Apple|
|2| | |101|Orange|
|3|10001|Beverage| | |
|4| | |201|Coke|
|5| | |202|Water|

I think the next step could be?

A) Find (Create one if can't find) benchmark dataset that can show the 
performance improvement for sparse DocValues;

B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know 
whether binary search or exponential search can cause notable performance 
regression for use case where "relatively dense fields that get advanced by 
small increments"

 


was (Author: JIRAUSER290435):
I started a new AWS EC2 host and reran the test. The performance candidate vs 
baseline is very close. Therefore, my original benchmark data points are 
invalid. Maybe there were some mess up during my previous test. I have crossed 
out my original benchmark data.

We noticed performance improvement in our system's use case because we're using 
parent-child doc block to index data and run some customized queries similar to 
BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match 
one child doc and parent doc from one doc block so DocValues to retrieve are 
very sparse.

For example,
||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B||
|0|1|Fruit| | |
|1| | |100|Apple|
|2| | |101|Orange|
|3|10001|Beverage| | |
|4| | |201|Coke|
|5| | |202|Water|

I think the next step could be?

A) Find (Create one if can't find) benchmark dataset that can show the 
performance improvement for sparse DocValues;

B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know 
whether binary search or exponential search can cause performance regression 
for use case where "relatively dense fields that get advanced by small 
increments"

 

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> 06/30/2022 Update: The below benchmark data points are invalid. I started a 
> new AWS EC2 instance and run the test. The performance of candidate and 
> baseline are very close.
>  
> -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}-
> -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}-
> -{color:#1d1c1d}2. Some highlights (>20%):{color}-
>  * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*-
>  ** -{color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 
> msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}-
>  * -*{color:#1d1c1d}T0 cab_color:y cab_color:g

[GitHub] [lucene] wuwm commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…

2022-06-30 Thread GitBox



wuwm commented on PR #968:
URL: https://github.com/apache/lucene/pull/968#issuecomment-1171617717

   @yuzhoujianxia There are some discussion on if binary or exponential search 
can cause performance regression in some use cases. We need to address the 
concerns before merging. https://issues.apache.org/jira/browse/LUCENE-10624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-30 Thread Weiming Wu (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561166#comment-17561166
 ] 

Weiming Wu commented on LUCENE-10624:
-

I started a new AWS EC2 host and reran the test. The performance candidate vs 
baseline is very close. Therefore, my original benchmark data points are 
invalid. Maybe there were some mess up during my previous test. I have crossed 
out my original benchmark data.

We noticed performance improvement in our system's use case because we're using 
parent-child doc block to index data and run some customized queries similar to 
BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match 
one child doc and parent doc from one doc block so DocValues to retrieve are 
very sparse.

For example,
||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B||
|0|1|Fruit| | |
|1| | |100|Apple|
|2| | |101|Orange|
|3|10001|Beverage| | |
|4| | |201|Coke|
|5| | |202|Water|

I think the next step could be?

A) Find (Create one if can't find) benchmark dataset that can show the 
performance improvement for sparse DocValues;

B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know 
whether binary search or exponential search can cause performance regression 
for use case where "relatively dense fields that get advanced by small 
increments"

 

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> 06/30/2022 Update: The below benchmark data points are invalid. I started a 
> new AWS EC2 instance and run the test. The performance of candidate and 
> baseline are very close.
>  
> -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}-
> -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}-
> -{color:#1d1c1d}2. Some highlights (>20%):{color}-
>  * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*-
>  ** -{color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 
> msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}-
>  * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*-
>  ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 
> msec*{color}-
>  * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}-
>  ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 
> msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{{*}}-
>  * -{color:#1d1c1d}*...*{color}-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-30 Thread Weiming Wu (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiming Wu updated LUCENE-10624:

Description: 
h3. Problem Statement

We noticed DocValue read performance regression with the iterative API when 
upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
degradation is similar to what's described in 
https://issues.apache.org/jira/browse/SOLR-9599 

By analyzing profiling data, we found method "advanceWithinBlock" and 
"advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
their O(N) doc lookup algorithm.
h3. Changes

Used binary search algorithm to replace current O(N) lookup algorithm in Sparse 
IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are 
in ascending order.
h3. Test
{code:java}
./gradlew tidy
./gradlew check {code}
h3. Benchmark

06/30/2022 Update: The below benchmark data points are invalid. I started a new 
AWS EC2 instance and run the test. The performance of candidate and baseline 
are very close.

 

-Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
reports of baseline and candidates in attachments section.{color}-

-{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}-

-{color:#1d1c1d}2. Some highlights (>20%):{color}-
 * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*-
 ** -{color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}-
 ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
msec*{color}-
 * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*-
 ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}-
 ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}-
 * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}-
 ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}-
 ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
msec*{color}{*}{{*}}-
 * -{color:#1d1c1d}*...*{color}-

  was:
h3. Problem Statement

We noticed DocValue read performance regression with the iterative API when 
upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
degradation is similar to what's described in 
https://issues.apache.org/jira/browse/SOLR-9599 

By analyzing profiling data, we found method "advanceWithinBlock" and 
"advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
their O(N) doc lookup algorithm.
h3. Changes

Used binary search algorithm to replace current O(N) lookup algorithm in Sparse 
IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are 
in ascending order.
h3. Test
{code:java}
./gradlew tidy
./gradlew check {code}
h3. Benchmark

Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports 
of baseline and candidates in attachments section.{color}

{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}

{color:#1d1c1d}2. Some highlights (>20%):{color}
 * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
 ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
 ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color}
 * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
 ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
 ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
 * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
 ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
 ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
msec*{color}{*}{*}
 * {color:#1d1c1d}*...*{color}


> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
>

[GitHub] [lucene] yuzhoujianxia commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…

2022-06-30 Thread GitBox



yuzhoujianxia commented on PR #968:
URL: https://github.com/apache/lucene/pull/968#issuecomment-1171575344

   Can we get this merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171540949

   Unfortunately, it was not so trivial. I forgot code blocks. In code blocs, 
spaces and line feed characters in the original text should be preserved and my 
solution breaks them.
   I tried to deal with it with look-ahead and look-behind regex though, looks 
like it didn't help. I think it'd be not solvable with regular expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mikemccand commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171490096

   > Looks like the converter library does not support Carriage Return `\r` and 
succeeding spaces after Line Feed `\n`
   
   Sigh, will our species ever get past the different EOL characters/problems!! 
 Thanks for tracking this down @mocobeta.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171467919

   The conversion tool seems to erase consecutive LFs (`\n\n`); this causes 
indent errors n Markdown.
   
   Removed LFs would be recovered by this regex (hack).
   
   ```
   text = re.sub(r"\n\s*(?!\s*-)", "\n\n", text)
   ``` 
   
   ![Screenshot from 2022-07-01 
01-45-47](https://user-images.githubusercontent.com/1825333/176733562-1835e865-a597-4b59-89c5-2b51d8e7baed.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-30 Thread LuYunCheng (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561107#comment-17561107
 ] 

LuYunCheng commented on LUCENE-10627:
-

it is a nice suggestion, i try to combine it

> Using CompositeByteBuf to Reduce Memory Copy
> 
>
> Key: LUCENE-10627
> URL: https://issues.apache.org/jira/browse/LUCENE-10627
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/store
>Reporter: LuYunCheng
>Priority: Major
>
> Code: [https://github.com/apache/lucene/pull/987]
> I see When Lucene Do flush and merge store fields, need many memory copies:
> {code:java}
> Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
> elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
> [0x7f17718db000]
>    java.lang.Thread.State: RUNNABLE
>     at 
> org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
>     at 
> org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
>  {code}
> When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
> memory copies:
> With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # compressor copy dict and data into one block buffer
>  # do compress
>  # copy compressed data out
> With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # do compress
>  # copy compressed data out
>  
> I think we can use CompositeByteBuf to reduce temp memory copies:
>  # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
> content for chunk compress
>  
> I write a simple mini benchamrk in test code ([link 
> |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
> *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
> elapse:5391ms , New elapse:5297ms
> *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
> elapse:{*}115ms{*}, New elapse:{*}12ms{*}
>  
> And I run runStoredFieldsBenchmark with doc_limit=-1:
> shows:
> ||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
> |Baseline|318877.00|606288.00|
> |Candidate|314442.00|604719.00|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-06-30 Thread GitBox



jpountz commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r911190808


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;

Review Comment:
   It's not great to need to pass the field on every value and require 
implementations to look up the right data structure on every doc. Should we add 
one more layer to the API to look more like this:
   
   ```
   KnnFieldVectorsWriter {
 addValue(int docID, float[] vectorValue);
   }
   
   KnnVectorsWriter {
 KnnFieldVectorsWriter addField(FieldInfo info);
 flush(int maxDoc);
 // merge(), etc.
   }
   ```



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;
+
+  /** Write field for merging */
+  public abstract void writeFieldForMerging(FieldInfo fieldInfo, 
KnnVectorsReader knnVectorsReader)

Review Comment:
   Is it the same as `mergeXXXField` in `DocValuesConsumer` or `mergeOneField` 
in `PointsWriter`? Maybe we should rename to `mergeOneField` and make this 
method responsible for creating the merged view (instead of doing it on top)?



##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState 
state) throws IOException
   private class FieldsWriter extends KnnVectorsWriter {
 private final Map formats;
 private final Map suffixes = new HashMap<>();
+private final Map> writersForFields =
+new IdentityHashMap<>();
 private final SegmentWriteState segmentWriteState;
 
+// if there is a single writer, cache it for faster indexing
+private KnnVectorsWriter singleWriter;

Review Comment:
   We should design the API in such a way that such tricks are not needed, I 
left a commen on `KnnVectorsWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-30 Thread LuYunCheng (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10627:

Description: 
Code: [https://github.com/apache/lucene/pull/987]

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we can use CompositeByteBuf to reduce temp memory copies:
 # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
content for chunk compress

 

I write a simple mini benchamrk in test code ([link 
|https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
*LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
elapse:5391ms , New elapse:5297ms
*DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
elapse:{*}115ms{*}, New elapse:{*}12ms{*}
 
And I run runStoredFieldsBenchmark with doc_limit=-1:
shows:
||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
|Baseline|318877.00|606288.00|
|Candidate|314442.00|604719.00|

  was:
Code: [https://github.com/apache/lucene/pull/987]

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-30 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561089#comment-17561089
 ] 

Tomoko Uchida commented on LUCENE-10557:


I'm sorry for the noise - Jira's special emojis should be converted to 
corresponding Unicode emojis. This is a test post to make sure the mapping is 
correct.
(y) (n) (i) (/) (x) (!) (+) (-) (?) (on) (off) (*) (*r) (*g) (*b) (*) (flag) 
(flagoff)

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171325947

   ![Screenshot from 2022-06-30 
23-54-46](https://user-images.githubusercontent.com/1825333/176709310-dbe249df-5f86-439d-95ec-cbe932905d16.png)
   
   Indents are still not preserved - this should be another problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-06-30 Thread GitBox



mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171316604

   Looks like the converter library does not support Carriage Return `\r` and 
succeeding spaces after Line Feed `\n` and that causes the conversion errors. 
This quick fix in pre-processing may solve many conversion errors.
   ```
   text = re.sub(r"\r\n\s*", "\n", text)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-06-30 Thread GitBox



mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1171306876

   > I am a bit surprised about the benchmark results. In 
[LUCENE-10375](https://issues.apache.org/jira/browse/LUCENE-10375), we found 
that writing all vectors to disk before building the graph sped up indexing 
(not just merging). This change goes back to the strategy of using on-heap 
vectors to build the graph, so I'd expect a slowdown. Here are the benchmark 
results from that issue:
   
   @jtibshirani Thank for the initial review. "write vectors" here means the 
time for the whole flush operation. In for the main branch, as we build a graph 
during flush it takes a lot of time (840392 msec); while for this PR flush 
operation is fast (1017 msec )
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10581) Optimize stored fields merges on the first segment

2022-06-30 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10581.
---
Resolution: Won't Fix

> Optimize stored fields merges on the first segment
> --
>
> Key: LUCENE-10581
> URL: https://issues.apache.org/jira/browse/LUCENE-10581
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is mostly repurposing LUCENE-10573. Even though our merge policies no 
> longer perform quadratic merging, it's still possible to configure them with 
> low merge factors (e.g. 2) or they might decide to create unbalanced merges 
> where the biggest segment of the merge accounts for a large part of the 
> merge. In such cases, copying compressed data directly still yields 
> significant benefits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-06-30 Thread GitBox



jpountz commented on PR #892:
URL: https://github.com/apache/lucene/pull/892#issuecomment-1171282676

   Thinking more about it, I'm thinking of not merging this change. In the 
normal case when merges are balanced, it doesn't help because the first segment 
would generally have a dirty block pretty early. I tried to reason through 
whether other use-cases would benefit from this change, but I don't think that 
any would benefit significantly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz closed pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-06-30 Thread GitBox



jpountz closed pull request #892: LUCENE-10581: Optimize stored fields bulk 
merges on the first segment
URL: https://github.com/apache/lucene/pull/892


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-30 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10557:
---
Description: 
A few (not the majority) Apache projects already use the GitHub issue instead 
of Jira. For example,

Airflow: [https://github.com/apache/airflow/issues]

BookKeeper: [https://github.com/apache/bookkeeper/issues]

So I think it'd be technically possible that we move to GitHub issue. I have 
little knowledge of how to proceed with it, I'd like to discuss whether we 
should migrate to it, and if so, how to smoothly handle the migration.

The major tasks would be:
 * (/) Get a consensus about the migration among committers
 * (/) Choose issues that should be moved to GitHub - We'll migrate all issues 
towards an atomic switch to GitHub if no major technical obstacles show up.
 ** Discussion thread 
[https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
 ** -Conclusion for now: We don't migrate any issues. Only new issues should be 
opened on GitHub.-
 ** Write a prototype migration script - the decision could be made on that. 
Things to consider:
 *** version numbers - labels or milestones?
 *** add a comment/ prepend a link to the source Jira issue on github side,
 *** add a comment/ prepend a link on the jira side to the new issue on github 
side (for people who access jira from blogs, mailing list archives and other 
sources that will have stale links),
 *** convert cross-issue automatic links in comments/ descriptions (as 
suggested by Robert),
 *** strategy to deal with sub-issues (hierarchies),
 *** maybe prefix (or postfix) the issue title on github side with the original 
LUCENE-XYZ key so that it is easier to search for a particular issue there?
 *** how to deal with user IDs (author, reporter, commenters)? Do they have to 
be github users? Will information about people not registered on github be lost?
 *** create an extra mapping file of old-issue-new-issue URLs for any potential 
future uses.
 *** what to do with issue numbers in git/svn commits? These could be rewritten 
but it'd change the entire git history tree - I don't think this is practical, 
while doable.
* Prepare a complete migration tool
** See https://github.com/apache/lucene-jira-archive/issues/5 
* Build the convention for issue label/milestone management
 ** See [https://github.com/apache/lucene-jira-archive/issues/6]
 ** Do some experiments on a sandbox repository 
[https://github.com/mocobeta/sandbox-lucene-10557]
 ** Make documentation for metadata (label/milestone) management 
 * (/) Enable Github issue on the lucene's repository
 ** Raise an issue on INFRA
 ** (Create an issue-only private repository for sensitive issues if it's 
needed and allowed)
 ** Set a mail hook to 
[issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to the 
general mail group name)
 * Set a schedule for migration
 ** See [https://github.com/apache/lucene-jira-archive/issues/7]
 ** Give some time to committers to play around with issues/labels/milestones 
before the actual migration
 ** Make an announcement on the mail lists
 ** Show some text messages when opening a new Jira issue (in issue template?)

  was:
A few (not the majority) Apache projects already use the GitHub issue instead 
of Jira. For example,

Airflow: [https://github.com/apache/airflow/issues]

BookKeeper: [https://github.com/apache/bookkeeper/issues]

So I think it'd be technically possible that we move to GitHub issue. I have 
little knowledge of how to proceed with it, I'd like to discuss whether we 
should migrate to it, and if so, how to smoothly handle the migration.

The major tasks would be:
 * (/) Get a consensus about the migration among committers
 * (/) Choose issues that should be moved to GitHub - We'll migrate all issues 
towards an atomic switch to GitHub if no major technical obstacles show up.
 ** Discussion thread 
[https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
 ** -Conclusion for now: We don't migrate any issues. Only new issues should be 
opened on GitHub.-
 ** Write a prototype migration script - the decision could be made on that. 
Things to consider:
 *** version numbers - labels or milestones?
 *** add a comment/ prepend a link to the source Jira issue on github side,
 *** add a comment/ prepend a link on the jira side to the new issue on github 
side (for people who access jira from blogs, mailing list archives and other 
sources that will have stale links),
 *** convert cross-issue automatic links in comments/ descriptions (as 
suggested by Robert),
 *** strategy to deal with sub-issues (hierarchies),
 *** maybe prefix (or postfix) the issue title on github side with the original 
LUCENE-XYZ key so that it is easier to search for a particular issue there?
 *** how to deal with user IDs (author, reporter, commenters)? Do they have to 
be github users? Will information about people not

[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-30 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10557:
---
Description: 
A few (not the majority) Apache projects already use the GitHub issue instead 
of Jira. For example,

Airflow: [https://github.com/apache/airflow/issues]

BookKeeper: [https://github.com/apache/bookkeeper/issues]

So I think it'd be technically possible that we move to GitHub issue. I have 
little knowledge of how to proceed with it, I'd like to discuss whether we 
should migrate to it, and if so, how to smoothly handle the migration.

The major tasks would be:
 * (/) Get a consensus about the migration among committers
 * (/) Choose issues that should be moved to GitHub - We'll migrate all issues 
towards an atomic switch to GitHub if no major technical obstacles show up.
 ** Discussion thread 
[https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
 ** -Conclusion for now: We don't migrate any issues. Only new issues should be 
opened on GitHub.-
 ** Write a prototype migration script - the decision could be made on that. 
Things to consider:
 *** version numbers - labels or milestones?
 *** add a comment/ prepend a link to the source Jira issue on github side,
 *** add a comment/ prepend a link on the jira side to the new issue on github 
side (for people who access jira from blogs, mailing list archives and other 
sources that will have stale links),
 *** convert cross-issue automatic links in comments/ descriptions (as 
suggested by Robert),
 *** strategy to deal with sub-issues (hierarchies),
 *** maybe prefix (or postfix) the issue title on github side with the original 
LUCENE-XYZ key so that it is easier to search for a particular issue there?
 *** how to deal with user IDs (author, reporter, commenters)? Do they have to 
be github users? Will information about people not registered on github be lost?
 *** create an extra mapping file of old-issue-new-issue URLs for any potential 
future uses.
 *** what to do with issue numbers in git/svn commits? These could be rewritten 
but it'd change the entire git history tree - I don't think this is practical, 
while doable.
 * Build the convention for issue label/milestone management
 ** See [https://github.com/apache/lucene-jira-archive/issues/6]
 ** Do some experiments on a sandbox repository 
[https://github.com/mocobeta/sandbox-lucene-10557]
 ** Make documentation for metadata (label/milestone) management 
 * (/) Enable Github issue on the lucene's repository
 ** Raise an issue on INFRA
 ** (Create an issue-only private repository for sensitive issues if it's 
needed and allowed)
 ** Set a mail hook to 
[issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to the 
general mail group name)
 * Set a schedule for migration
 ** See [https://github.com/apache/lucene-jira-archive/issues/7]
 ** Give some time to committers to play around with issues/labels/milestones 
before the actual migration
 ** Make an announcement on the mail lists
 ** Show some text messages when opening a new Jira issue (in issue template?)

  was:
A few (not the majority) Apache projects already use the GitHub issue instead 
of Jira. For example,

Airflow: [https://github.com/apache/airflow/issues]

BookKeeper: [https://github.com/apache/bookkeeper/issues]

So I think it'd be technically possible that we move to GitHub issue. I have 
little knowledge of how to proceed with it, I'd like to discuss whether we 
should migrate to it, and if so, how to smoothly handle the migration.

The major tasks would be:
 * (/) Get a consensus about the migration among committers
 * Choose issues that should be moved to GitHub
 ** Discussion thread 
[https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
 ** -Conclusion for now: We don't migrate any issues. Only new issues should be 
opened on GitHub.-
 ** Write a prototype migration script - the decision could be made on that. 
Things to consider:
 *** version numbers - labels or milestones?
 *** add a comment/ prepend a link to the source Jira issue on github side,
 *** add a comment/ prepend a link on the jira side to the new issue on github 
side (for people who access jira from blogs, mailing list archives and other 
sources that will have stale links),
 *** convert cross-issue automatic links in comments/ descriptions (as 
suggested by Robert),
 *** strategy to deal with sub-issues (hierarchies),
 *** maybe prefix (or postfix) the issue title on github side with the original 
LUCENE-XYZ key so that it is easier to search for a particular issue there?
 *** how to deal with user IDs (author, reporter, commenters)? Do they have to 
be github users? Will information about people not registered on github be lost?
 *** create an extra mapping file of old-issue-new-issue URLs for any potential 
future uses. 
 *** what to do with issue numbers in git/svn commits? These could be rewritten 
but

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #7: Make a detailed migration plan

2022-06-30 Thread GitBox



mocobeta opened a new issue, #7:
URL: https://github.com/apache/lucene-jira-archive/issues/7

   It will take at least a few days and there will be some moratorium time 
where GitHub issue is not lifted yet but a Jira issues snapshot was already 
taken. We need a detailed migration plan to avoid possible conflicts/confusion.
   
   A draft plan would be:
   
   1. Announce that the migration is started just before starting to take a 
Jira snapshot in the mail list.
  - Issues/comments created after that should be manually migrated 
afterward.
   2. Run the download script to take a snapshot of the whole Lucene Jira.
  - This would take 4 hours~ (needs intervals between Jira API calls).
   3. Commit all attachments to `lucene-jira-archive` (this repository).
   4. Run the conversion script that generates GitHub importable data from the 
Jira dump.
  - This would take one or two hours depending on the speed of conversion.
   5. [First pass] Run the import script to initialize all issues and comments.
  - This would take 15 hours~ (needs intervals between GitHub API calls)
   6. [Second pass] Run the update script to create re-mapped cross-issues 
links.
  - This would take 24 hours~ (needs intervals between GitHub API calls)
   7. Manually recover migration errors if possible.
   8. Annonce that the migration is finished in the mail list.
   - GitHub issues is available at this point.
   - Issues should not be raised in Jira, and existing Jira issues should 
not be updated after that
   9. Show some texts that say "Jira is deprecated" when opening Jira issues.
   10. Add comments to each Jira issue that say "Moved to GitHub ".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz opened a new pull request, #999: LUCENE-10634: Speed up WANDScorer.

2022-06-30 Thread GitBox



jpountz opened a new pull request, #999:
URL: https://github.com/apache/lucene/pull/999

   This speeds up WANDScorer by computing scores of docs that are positioned on
   the next candidate competitive document in order to potentially detect that 
no
   further match is possible, before advancing scorers that are still located in
   the tail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10634) Speed up WANDScorer by computing scores before advancing tail scorers

2022-06-30 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-10634:
-

 Summary: Speed up WANDScorer by computing scores before advancing 
tail scorers
 Key: LUCENE-10634
 URL: https://issues.apache.org/jira/browse/LUCENE-10634
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


While looking at performance numbers on LUCENE-10480, I noticed that it is 
often faster to compute a score in order to finer-grained estimation of the 
best score that the current document can possibly get before advancing a tail 
scorer.

Making this change to WANDScorer yielded a small but reproducible speedup:

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
  IntNRQ  186.50 (11.8%)  175.34 
(19.1%)   -6.0% ( -33% -   28%) 0.234
HighTermTitleBDVSort  167.27 (20.6%)  161.85 
(17.2%)   -3.2% ( -34% -   43%) 0.591
 MedSloppyPhrase  194.77  (5.5%)  190.45  
(7.8%)   -2.2% ( -14% -   11%) 0.299
   HighTermDayOfYearSort  229.61  (7.7%)  225.74  
(7.1%)   -1.7% ( -15% -   14%) 0.471
 LowSloppyPhrase   20.22  (4.3%)   19.95  
(4.8%)   -1.3% ( -10% -8%) 0.366
  TermDTSort  319.62  (7.7%)  316.78  
(7.5%)   -0.9% ( -14% -   15%) 0.712
OrHighNotLow 1856.44  (5.6%) 1842.88  
(5.7%)   -0.7% ( -11% -   11%) 0.682
AndMedOrHighHigh   73.87  (3.8%)   73.51  
(3.6%)   -0.5% (  -7% -7%) 0.677
   OrHighNotHigh 2000.56  (5.6%) 1991.65  
(6.9%)   -0.4% ( -12% -   12%) 0.823
   LowPhrase  106.90  (2.4%)  106.61  
(2.9%)   -0.3% (  -5% -5%) 0.750
  AndHighLow 1661.80  (3.5%) 1658.56  
(3.7%)   -0.2% (  -7% -7%) 0.865
  Fuzzy2  110.64  (1.8%)  110.43  
(1.9%)   -0.2% (  -3% -3%) 0.752
   HighTermMonthSort   73.74 (17.5%)   73.68 
(20.8%)   -0.1% ( -32% -   46%) 0.989
PKLookup  242.86  (1.8%)  242.75  
(1.8%)   -0.0% (  -3% -3%) 0.934
OrHighNotMed 1454.98  (5.3%) 1456.26  
(5.8%)0.1% ( -10% -   11%) 0.960
  HighPhrase  523.22  (2.9%)  524.01  
(2.6%)0.2% (  -5% -5%) 0.862
   MedPhrase  140.65  (2.7%)  140.87  
(2.9%)0.2% (  -5% -5%) 0.862
HighSloppyPhrase8.74  (4.6%)8.75  
(5.5%)0.2% (  -9% -   10%) 0.914
 LowSpanNear   28.05  (3.6%)   28.14  
(3.0%)0.3% (  -6% -7%) 0.777
 MedSpanNear7.59  (3.5%)7.61  
(3.4%)0.3% (  -6% -7%) 0.778
 Respell   67.62  (1.9%)   67.82  
(1.8%)0.3% (  -3% -4%) 0.595
   OrAndHigMedAndHighMed  127.87  (3.1%)  128.27  
(4.0%)0.3% (  -6% -7%) 0.780
OrNotHighLow 1513.24  (2.1%) 1520.33  
(2.6%)0.5% (  -4% -5%) 0.528
  OrHighPhraseHighPhrase   25.26  (3.0%)   25.38  
(3.0%)0.5% (  -5% -6%) 0.616
OrNotHighMed 1544.04  (4.5%) 1552.26  
(4.2%)0.5% (  -7% -9%) 0.697
 AndHighHigh   92.24  (4.8%)   92.79  
(6.6%)0.6% ( -10% -   12%) 0.744
  AndHighMed  420.42  (3.1%)  423.19  
(5.2%)0.7% (  -7% -9%) 0.624
  Fuzzy1  117.42  (1.9%)  118.19  
(2.2%)0.7% (  -3% -4%) 0.307
 MedTerm 2209.36  (4.6%) 2224.54  
(5.3%)0.7% (  -8% -   11%) 0.661
 MedIntervalsOrdered  124.18  (8.1%)  125.12  
(8.0%)0.8% ( -14% -   18%) 0.767
   OrNotHighHigh 1239.43  (4.6%) 1249.63  
(4.8%)0.8% (  -8% -   10%) 0.580
 AndHighOrMedMed   95.02  (4.3%)   95.82  
(3.8%)0.8% (  -6% -9%) 0.515
Wildcard  315.22 (23.3%)  317.98 
(22.5%)0.9% ( -36% -   60%) 0.904
 LowTerm 2775.81  (4.0%) 2808.32  
(5.2%)1.2% (  -7% -   10%) 0.425
HighIntervalsOrdered   14.24  (8.0%)   14.41  
(8.4%)1.2% ( -14% -   19%) 0.646
 LowIntervalsOrdered  120.62  (5.8%)  122.09  
(6.6%)1.2% ( -10% -   14%) 0.534
HighSpanNear   39.04  (6.7%)   39.71  
(4.3%)1.7% (  -8% -   13%) 0.332

[jira] [Created] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-06-30 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-10633:
-

 Summary: Dynamic pruning for queries sorted by SORTED(_SET) field
 Key: LUCENE-10633
 URL: https://issues.apache.org/jira/browse/LUCENE-10633
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
when sorting by a numeric field, by leveraging the points index to skip 
documents that do not compare better than the top of the priority queue 
maintained by the field comparator.

However queries sorted by a SORTED(_SET) field still look at all hits, which is 
disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #6: Document issue label / template managiment policy

2022-06-30 Thread GitBox



mocobeta opened a new issue, #6:
URL: https://github.com/apache/lucene-jira-archive/issues/6

   - Explicitly define label families (e.g., `type:xxx`, `fixVersion:x.x.x`)
   - Clarify the mapping between labels and index templates
   - Write documentation and make it accessible to developers (e.g., place it 
under `dev-docs` in the lucene repo)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-06-30 Thread GitBox



jtibshirani commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r910836731


##
lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java:
##
@@ -26,233 +26,153 @@
 import org.apache.lucene.codecs.KnnVectorsWriter;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.Counter;
 import org.apache.lucene.util.RamUsageEstimator;
 
 /**
- * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes.
+ * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes. Used for {@code
+ * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 .
  *
  * @lucene.experimental
  */
-class VectorValuesWriter {
-
-  private final FieldInfo fieldInfo;
-  private final Counter iwBytesUsed;
-  private final List vectors = new ArrayList<>();
-  private final DocsWithFieldSet docsWithField;
-
-  private int lastDocID = -1;
-
-  private long bytesUsed;
-
-  VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
-this.fieldInfo = fieldInfo;
-this.iwBytesUsed = iwBytesUsed;
-this.docsWithField = new DocsWithFieldSet();
-this.bytesUsed = docsWithField.ramBytesUsed();
-if (iwBytesUsed != null) {
-  iwBytesUsed.addAndGet(bytesUsed);
+public abstract class VectorValuesWriter extends KnnVectorsWriter {

Review Comment:
   Would renaming this to `BufferingKnnVectorsWriter` be clearer? I assumed it 
did something different because of the very general name `VectorValuesWriter`.
   
   I also wonder if we could update `SimpleTextKnnVectorsWriter` to use the new 
writer interface. Then we could move this class to the backwards-codecs 
package, because it would only be used in the old codec tests.



##
lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java:
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.Codec;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.InfoStream;
+
+/**
+ * Streams vector values for indexing to the given codec's vectors writer. The 
codec's vectors
+ * writer is responsible for buffering and processing vectors.
+ */
+class VectorValuesConsumer {
+  private final Codec codec;
+  private final Directory directory;
+  private final SegmentInfo segmentInfo;
+  private final InfoStream infoStream;
+
+  private Accountable accountable = Accountable.NULL_ACCOUNTABLE;
+  private KnnVectorsWriter writer;
+
+  VectorValuesConsumer(
+  Codec codec, Directory directory, SegmentInfo segmentInfo, InfoStream 
infoStream) {
+this.codec = codec;
+this.directory = directory;
+this.segmentInfo = segmentInfo;
+this.infoStream = infoStream;
+  }
+
+  private void initKnnVectorsWriter(String fieldName) throws IOException {
+if (writer == null) {
+  KnnVectorsFormat fmt = codec.knnVectorsFormat();
+  if (fmt == null) {
+throw new IllegalStateException(
+"field=\""
++ fieldName
++ "\" was indexed as vectors but codec does not support 
vectors");
+  }
+  SegmentWriteState initialWriteState =
+  new SegmentWriteState(infoStream, directory, segmentInfo, null, 
null, IOContext.DEFAULT);
+  writer = fmt.fieldsWriter(initialWriteState);
+  accountable = writer;
+}
+  }
+
+  public void addField(FieldInfo fieldInfo) throws IOException {
+initKnnVectorsWriter(fieldInfo.name);
+writer.addField(fieldInfo);
+  }
+
+  public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) 
throws IOException {
+writer.addValue(fieldInfo, docID, vectorValue);
+  }
+
+  void flush(SegmentWriteState

[GitHub] [lucene] jtibshirani opened a new pull request, #998: LUCENE-10577: Add vectors format unit test and fix toString

2022-06-30 Thread GitBox



jtibshirani opened a new pull request, #998:
URL: https://github.com/apache/lucene/pull/998

   We forgot to add this unit test when introducing the new 9.3 vectors format.
   This commit adds the test and fixes issues it uncovered in toString.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10546) Update Faceting user guide

2022-06-30 Thread Egor Potemkin (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560917#comment-17560917
 ] 

Egor Potemkin commented on LUCENE-10546:


I will work on this if no one else is already doing it. 

> Update Faceting user guide
> --
>
> Key: LUCENE-10546
> URL: https://issues.apache.org/jira/browse/LUCENE-10546
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> The  [facet user 
> guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html]
>  was written based on 4.1. Since there's been a fair amount of active 
> facet-related development over the last year+, it would be nice to review the 
> guide and see what updates make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction

2022-06-30 Thread GitBox



jpountz commented on code in PR #972:
URL: https://github.com/apache/lucene/pull/972#discussion_r910683270


##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.LinkedList;
+import java.util.List;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers ordered by maxScore
+  private final LinkedList maxScoreSortedEssentialScorers;
+
+  private final DisiWrapper[] allScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private double nonEssentialMaxScoreSum;
+
+  private long cost;

Review Comment:
   nit: let's make it final



##
lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java:
##
@@ -0,0 +1,322 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.LinkedList;
+import java.util.List;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+  // list of scorers ordered by maxScore
+  private final LinkedList maxScoreSortedEssentialScorers;
+
+  private final DisiWrapper[] allScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private float nonEssentialMaxScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  // scaled min competitive score
+  private float minCompetitiveScore = 0;
+
+  private int cachedScoredDoc = -1;
+  private float cachedScore = 0;
+
+  /**
+   * Constructs a Scorer that scores doc based on Block-Max-Maxscore (BMM) 
algorithm
+   * http://engineering.nyu.edu/~suel/papers/bmm.pdf . This algorithm has 
lower overhead compared to
+   * WANDScorer, and could be used for simple disjunction queries.
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional 
clauses
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List scorers) throws 
IOException {
+super(weight);
+
+this.doc = -1;
+this.allScorers = new DisiWrapper[scorers.size()];
+this.essentialsScorers = new DisiPriorityQueue(scorers.size());
+this.maxScoreSortedEssentialScorers = new LinkedList<>();
+
+long cost = 0;
+for (int i = 0; i < scorers.size(); i++) {
+  DisiWrapper w = new DisiWrapper(scorers.get(i));
+  cost += w.cost;
+  allScorers[i] = w;
+}
+
+this.cost = cost;
+maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+  }
+
+  @Override
+  public

[jira] [Created] (LUCENE-10632) Change getAllChildren to return all children regardless of the count

2022-06-30 Thread Yuting Gan (Jira)

Yuting Gan created LUCENE-10632:
---

 Summary: Change getAllChildren to return all children regardless 
of the count
 Key: LUCENE-10632
 URL: https://issues.apache.org/jira/browse/LUCENE-10632
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Yuting Gan


Currently, the getAllChildren functionality is implemented in a way that is 
similar to getTopChildren, where they only return children with count that is 
greater than zero.

However, he original getTopChildren in RangeFacetCounts returned all children 
whether-or-not the count was zero. This actually has good use cases and we 
should continue supporting the feature in getAllChildren, so that we will not 
lose it after properly supporting getTopChildren in RangeFacetCounts.

As discussed with [~gsmiller] in the [LUCENE-10614 
pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to 
behave differently from getTopChildren can actually be more helpful for users. 
If users want to get children with only positive count, we have getTopChildren 
supporting this behavior already. Therefore, the getAllChildren API should 
provide all children in all of the implementations, whether-or-not the count is 
zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

51 matches

Mail list logo