[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171959478 I think pre- or post- processing to tweak the converter's result is risky since any text processing required contexts (which block element the character sequence resides in - even LFs and spaces have special meanings in both Jira markup and Markdown). **For your information**: I don't know how much effort we should/can invest in that though, maybe we would need a customized version of the converter tool if we want to safely fix conversion errors. https://github.com/catcombo/jira2markdown I haven't closely looked at it, but it's built upon [pyparsing](https://github.com/pyparsing/pyparsing/) (a well-known python parser generator library) and it already supports all Jira syntax; it'd be a good start. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn commented on code in PR #972: URL: https://github.com/apache/lucene/pull/972#discussion_r911590359 ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -0,0 +1,322 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Comparator; +import java.util.LinkedList; +import java.util.List; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { + // current doc ID of the leads + private int doc; + + // doc id boundary that all scorers maxScore are valid + private int upTo = -1; + + // heap of scorers ordered by doc ID + private final DisiPriorityQueue essentialsScorers; + // list of scorers ordered by maxScore + private final LinkedList maxScoreSortedEssentialScorers; + + private final DisiWrapper[] allScorers; + + // sum of max scores of scorers in nonEssentialScorers list + private float nonEssentialMaxScoreSum; + + private long cost; + + private final MaxScoreSumPropagator maxScoreSumPropagator; + + // scaled min competitive score + private float minCompetitiveScore = 0; + + private int cachedScoredDoc = -1; + private float cachedScore = 0; + + /** + * Constructs a Scorer that scores doc based on Block-Max-Maxscore (BMM) algorithm + * http://engineering.nyu.edu/~suel/papers/bmm.pdf . This algorithm has lower overhead compared to + * WANDScorer, and could be used for simple disjunction queries. + * + * @param weight The weight to be used. + * @param scorers The sub scorers this Scorer should iterate on for optional clauses + */ + public BlockMaxMaxscoreScorer(Weight weight, List scorers) throws IOException { +super(weight); + +this.doc = -1; +this.allScorers = new DisiWrapper[scorers.size()]; +this.essentialsScorers = new DisiPriorityQueue(scorers.size()); +this.maxScoreSortedEssentialScorers = new LinkedList<>(); + +long cost = 0; +for (int i = 0; i < scorers.size(); i++) { + DisiWrapper w = new DisiWrapper(scorers.get(i)); + cost += w.cost; + allScorers[i] = w; +} + +this.cost = cost; +maxScoreSumPropagator = new MaxScoreSumPropagator(scorers); + } + + @Override + public DocIdSetIterator iterator() { +// twoPhaseIterator needed to honor scorer.setMinCompetitiveScore guarantee +return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator()); + } + + @Override + public TwoPhaseIterator twoPhaseIterator() { +DocIdSetIterator approximation = +new DocIdSetIterator() { + + @Override + public int docID() { +return doc; + } + + @Override + public int nextDoc() throws IOException { +return advance(doc + 1); + } + + @Override + public int advance(int target) throws IOException { +while (true) { + + if (target > upTo) { +updateMaxScoresAndLists(target); + } else { +// minCompetitiveScore might have increased, +// move potentially no-longer-competitive scorers from essential to non-essential +// list +movePotentiallyNonCompetitiveScorers(); + } + + assert target <= upTo; + + DisiWrapper top = essentialsScorers.top(); + + if (top == null) { +// all scorers in non-essential list, skip to next boundary or return no_more_docs +if (upTo == NO_MORE_DOCS) { + return doc = NO_MORE_DOCS; +} else { + target = upTo + 1; +} + } else { +// position all scorers in essential list to on or after target +while (top.doc < target) { + top.doc = top.iterator.advance(target); + top = essentialsScorers.updateTop(); +} + +if (top.doc == NO_MORE_DOCS) { + return doc = NO_MORE_DOCS; +} else if
[jira] [Created] (LUCENE-10636) Could the partial score sum from essential list scores be cached?
Zach Chen created LUCENE-10636: -- Summary: Could the partial score sum from essential list scores be cached? Key: LUCENE-10636 URL: https://issues.apache.org/jira/browse/LUCENE-10636 Project: Lucene - Core Issue Type: Improvement Reporter: Zach Chen This is a follow-up issue from discussion [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently in the implementation of BlockMaxMaxscoreScorer, there's duplicated computation of summing up scores from essential list scorers. We would like to see if this duplicated computation can be cached without introducing much overhead or data structure that might out-weight the benefit of caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn commented on PR #972: URL: https://github.com/apache/lucene/pull/972#issuecomment-1171893517 > I like the idea of creating WANDScorer more explicitly in tests. It doesn't look easy though and this change is already great so I wonder if we should keep it for a follow-up. Sounds good. I've created this follow-up issue https://issues.apache.org/jira/browse/LUCENE-10635 . > I reviewed the change and left some very minor comments but it looks great to me overall. Let's get it in. Awesome, thanks for all the review and feedback @jpountz, I really appreciate it! Iterating on the solution and seeing it improved each time is a lot of fun and I enjoy this process a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added
Zach Chen created LUCENE-10635: -- Summary: Ensure test coverage for WANDScorer after additional scorers get added Key: LUCENE-10635 URL: https://issues.apache.org/jira/browse/LUCENE-10635 Project: Lucene - Core Issue Type: Test Reporter: Zach Chen This is a follow-up issue from discussions [https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & [https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] . As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer instead, reducing test coverage for WANDScorer. We would like to see how we can ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating the scorer directly inside the tests? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn commented on code in PR #972: URL: https://github.com/apache/lucene/pull/972#discussion_r911579468 ## lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java: ## @@ -39,6 +39,9 @@ public class DisiWrapper { // For WANDScorer long maxScore; + // For BlockMaxMaxscoreScorer + float maxScoreFloat; Review Comment: Ah I like this idea. Updated. ## lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java: ## @@ -118,6 +118,21 @@ private Scorer getInternal(long leadCost) throws IOException { leadCost); } +// pure two terms disjunction +if (scoreMode == ScoreMode.TOP_SCORES +&& minShouldMatch == 0 Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
zacharymorn commented on code in PR #972: URL: https://github.com/apache/lucene/pull/972#discussion_r911579245 ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Comparator; +import java.util.LinkedList; +import java.util.List; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { Review Comment: Updated. ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Comparator; +import java.util.LinkedList; +import java.util.List; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { + // current doc ID of the leads + private int doc; + + // doc id boundary that all scorers maxScore are valid + private int upTo; + + // heap of scorers ordered by doc ID + private final DisiPriorityQueue essentialsScorers; + + // list of scorers ordered by maxScore + private final LinkedList maxScoreSortedEssentialScorers; + + private final DisiWrapper[] allScorers; + + // sum of max scores of scorers in nonEssentialScorers list + private double nonEssentialMaxScoreSum; + + private long cost; Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10246) Support getting counts from "association" facets
[ https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222 ] Greg Miller edited comment on LUCENE-10246 at 7/1/22 12:27 AM: --- [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-association cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. Please reach out here as you have questions and I'll do my best to answer in a timely fashion. Thanks for having a look at this! was (Author: gsmiller): [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-assocation cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. > Support getting counts from "association" facets > > > Key: LUCENE-10246 > URL: https://issues.apache.org/jira/browse/LUCENE-10246 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > We have these nice "association" facet implementations today that aggregate > "weights" from the docs that facet over, but they don't keep track of counts. > So the user can get "top-n" values for a dim by aggregated weight (great!), > but can't know how many docs matched each value. It would be nice to support > this so users could show the top-n values but _also_ show counts associated > with each. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets
[ https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222 ] Greg Miller commented on LUCENE-10246: -- [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-assocation cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. > Support getting counts from "association" facets > > > Key: LUCENE-10246 > URL: https://issues.apache.org/jira/browse/LUCENE-10246 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > We have these nice "association" facet implementations today that aggregate > "weights" from the docs that facet over, but they don't keep track of counts. > So the user can get "top-n" values for a dim by aggregated weight (great!), > but can't know how many docs matched each value. It would be nice to support > this so users could show the top-n values but _also_ show counts associated > with each. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets
[ https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561207#comment-17561207 ] Rushabh Shah commented on LUCENE-10246: --- [~gsmiller] [~sokolov] I am pretty new to LUCENE project and want to contribute to small jiras to improve my lucene knowledge. Can you please help me scope the work required for this patch and point me to some relevant classes in the source code. Thank you. > Support getting counts from "association" facets > > > Key: LUCENE-10246 > URL: https://issues.apache.org/jira/browse/LUCENE-10246 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > We have these nice "association" facet implementations today that aggregate > "weights" from the docs that facet over, but they don't keep track of counts. > So the user can get "top-n" values for a dim by aggregated weight (great!), > but can't know how many docs matched each value. It would be nice to support > this so users could show the top-n values but _also_ show counts associated > with each. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10345) remove non-NRT replication support
[ https://issues.apache.org/jira/browse/LUCENE-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561205#comment-17561205 ] Rushabh Shah commented on LUCENE-10345: --- Hi [~rcmuir] Can you please help me scope the changes required in this jira. I can try to put a patch. Thank you. > remove non-NRT replication support > -- > > Key: LUCENE-10345 > URL: https://issues.apache.org/jira/browse/LUCENE-10345 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Lucene's {{replicator/}} module has really two replication APIs: NRT and the > older non-NRT. > The NRT replication is nice, it is actually JUST an API, hence there's no > network support or anything like that. > The non-NRT replication has some issues: > * Uses HTTP but in a non-standard way: binary blobs etc. Responses can't be > cached or leverage CDNs or anything. > * legacy plaintext HTTP 1.x only, No support for HTTPS, HTTP/2, etc > * Giant security hole: uses java (de)serialization > * legacy web apis (servlet). it is 2021 > * drags in third party http client, doesn't use the one in the standard > library > * drags in third party logging jars, doesn't use the one in the standard > library > I'd like to deprecate the non-NRT support in 9.x and remove in 10.x. I > suspect anyone using this module is using the newer NRT mode? If anyone is > still using the legacy non-NRT mode, please let me know on this issue and > give me your IP address, so I can try to pop a shell. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561203#comment-17561203 ] Greg Miller commented on LUCENE-10603: -- I pushed another commit that takes care of the remaining "production" code iteration. I think the next step is to knock out all remaining iteration patterns, which should only exist in "test" related code. When I get some more free time I'll take a pass at it, but might be a week or so. Happy to have someone beat me to it :) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 3h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
shahrs87 commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1171746326 > could you now try to remove all instances of if (terms == Terms.EMPTY)? @jpountz, I tried to remove all the instances of `if (terms == Terms.EMPTY)?` but couldn't remove the remaining ones in the patch. Otherwise it will cause test failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures
[ https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561200#comment-17561200 ] Marc D'Mello commented on LUCENE-10628: --- I am planning on working on this. > Enable MatchingFacetSetCounts to use space partitioning data structures > --- > > Key: LUCENE-10628 > URL: https://issues.apache.org/jira/browse/LUCENE-10628 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Marc D'Mello >Priority: Minor > > Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} > instances passed into it linearly. While this is fine in some cases, if we > have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We > should provide the option to users to enable the use of space partitioning > data structures (namely R trees and KD trees) so we can potentially scan over > these {{FacetSetMatcher}}'s in sub-linear time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10546) Update Faceting user guide
[ https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561199#comment-17561199 ] Greg Miller commented on LUCENE-10546: -- Great, thanks [~epotiom]! I'm not aware of anyone else working on this. > Update Faceting user guide > -- > > Key: LUCENE-10546 > URL: https://issues.apache.org/jira/browse/LUCENE-10546 > Project: Lucene - Core > Issue Type: Wish > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > The [facet user > guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html] > was written based on 4.1. Since there's been a fair amount of active > facet-related development over the last year+, it would be nice to review the > guide and see what updates make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561189#comment-17561189 ] ASF subversion and git services commented on LUCENE-10603: -- Commit 3e268805024cf98abb11f6de45b32403b088eb5b in lucene's branch refs/heads/branch_9x from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3e268805024 ] LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code (#1000) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 3h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #1000: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code
gsmiller merged PR #1000: URL: https://github.com/apache/lucene/pull/1000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #1000: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code
gsmiller opened a new pull request, #1000: URL: https://github.com/apache/lucene/pull/1000 PR only for backport. No review requested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561187#comment-17561187 ] ASF subversion and git services commented on LUCENE-10603: -- Commit 5f2a4998a079278ada89ce7bfa3992673a91c5c9 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5f2a4998a07 ] LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code (#995) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 3h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #995: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code
gsmiller merged PR #995: URL: https://github.com/apache/lucene/pull/995 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #995: LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code
gsmiller commented on PR #995: URL: https://github.com/apache/lucene/pull/995#issuecomment-1171671380 Merging as I've addressed the outstanding feedback and the change is otherwise straight-forward. Thanks @jpountz for the suggestions and review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561166#comment-17561166 ] Weiming Wu edited comment on LUCENE-10624 at 6/30/22 7:55 PM: -- I started a new AWS EC2 host and reran the test. The performance candidate vs baseline is very close. Therefore, my original benchmark data points are invalid. Maybe there were some mess up during my previous test. I have crossed out my original benchmark data. We noticed performance improvement in our system's use case because we're using parent-child doc block to index data and run some customized queries similar to BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match one child doc and parent doc from one doc block so DocValues to retrieve are very sparse. For example, ||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B|| |0|1|Fruit| | | |1| | |100|Apple| |2| | |101|Orange| |3|10001|Beverage| | | |4| | |201|Coke| |5| | |202|Water| I think the next step could be? A) Find (Create one if can't find) benchmark dataset that can show the performance improvement for sparse DocValues; B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know whether binary search or exponential search can cause notable performance regression for use case where "relatively dense fields that get advanced by small increments" was (Author: JIRAUSER290435): I started a new AWS EC2 host and reran the test. The performance candidate vs baseline is very close. Therefore, my original benchmark data points are invalid. Maybe there were some mess up during my previous test. I have crossed out my original benchmark data. We noticed performance improvement in our system's use case because we're using parent-child doc block to index data and run some customized queries similar to BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match one child doc and parent doc from one doc block so DocValues to retrieve are very sparse. For example, ||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B|| |0|1|Fruit| | | |1| | |100|Apple| |2| | |101|Orange| |3|10001|Beverage| | | |4| | |201|Coke| |5| | |202|Water| I think the next step could be? A) Find (Create one if can't find) benchmark dataset that can show the performance improvement for sparse DocValues; B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know whether binary search or exponential search can cause performance regression for use case where "relatively dense fields that get advanced by small increments" > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 1h > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > 06/30/2022 Update: The below benchmark data points are invalid. I started a > new AWS EC2 instance and run the test. The performance of candidate and > baseline are very close. > > -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color}- > -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}- > -{color:#1d1c1d}2. Some highlights (>20%):{color}- > * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*- > ** -{color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 > msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color}- > * -*{color:#1d1c1d}T0 cab_color:y cab_color:g
[GitHub] [lucene] wuwm commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…
wuwm commented on PR #968: URL: https://github.com/apache/lucene/pull/968#issuecomment-1171617717 @yuzhoujianxia There are some discussion on if binary or exponential search can cause performance regression in some use cases. We need to address the concerns before merging. https://issues.apache.org/jira/browse/LUCENE-10624 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561166#comment-17561166 ] Weiming Wu commented on LUCENE-10624: - I started a new AWS EC2 host and reran the test. The performance candidate vs baseline is very close. Therefore, my original benchmark data points are invalid. Maybe there were some mess up during my previous test. I have crossed out my original benchmark data. We noticed performance improvement in our system's use case because we're using parent-child doc block to index data and run some customized queries similar to BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match one child doc and parent doc from one doc block so DocValues to retrieve are very sparse. For example, ||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B|| |0|1|Fruit| | | |1| | |100|Apple| |2| | |101|Orange| |3|10001|Beverage| | | |4| | |201|Coke| |5| | |202|Water| I think the next step could be? A) Find (Create one if can't find) benchmark dataset that can show the performance improvement for sparse DocValues; B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know whether binary search or exponential search can cause performance regression for use case where "relatively dense fields that get advanced by small increments" > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 50m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > 06/30/2022 Update: The below benchmark data points are invalid. I started a > new AWS EC2 instance and run the test. The performance of candidate and > baseline are very close. > > -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color}- > -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}- > -{color:#1d1c1d}2. Some highlights (>20%):{color}- > * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*- > ** -{color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 > msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color}- > * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*- > ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 > msec*{color}- > * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}- > ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 > msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{{*}}- > * -{color:#1d1c1d}*...*{color}- -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Description: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark 06/30/2022 Update: The below benchmark data points are invalid. I started a new AWS EC2 instance and run the test. The performance of candidate and baseline are very close. -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates in attachments section.{color}- -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}- -{color:#1d1c1d}2. Some highlights (>20%):{color}- * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*- ** -{color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color}- ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color}- * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*- ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}- ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}- * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}- ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}- ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 msec*{color}{*}{{*}}- * -{color:#1d1c1d}*...*{color}- was: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates in attachments section.{color} {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color} * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 msec*{color}{*}{*} * {color:#1d1c1d}*...*{color} > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 50m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and >
[GitHub] [lucene] yuzhoujianxia commented on pull request #968: [LUCENE-10624] Binary Search for Sparse IndexedDISI advanceWithinBloc…
yuzhoujianxia commented on PR #968: URL: https://github.com/apache/lucene/pull/968#issuecomment-1171575344 Can we get this merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171540949 Unfortunately, it was not so trivial. I forgot code blocks. In code blocs, spaces and line feed characters in the original text should be preserved and my solution breaks them. I tried to deal with it with look-ahead and look-behind regex though, looks like it didn't help. I think it'd be not solvable with regular expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error
mikemccand commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171490096 > Looks like the converter library does not support Carriage Return `\r` and succeeding spaces after Line Feed `\n` Sigh, will our species ever get past the different EOL characters/problems!! Thanks for tracking this down @mocobeta. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171467919 The conversion tool seems to erase consecutive LFs (`\n\n`); this causes indent errors n Markdown. Removed LFs would be recovered by this regex (hack). ``` text = re.sub(r"\n\s*(?!\s*-)", "\n\n", text) ``` ![Screenshot from 2022-07-01 01-45-47](https://user-images.githubusercontent.com/1825333/176733562-1835e865-a597-4b59-89c5-2b51d8e7baed.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561107#comment-17561107 ] LuYunCheng commented on LUCENE-10627: - it is a nice suggestion, i try to combine it > Using CompositeByteBuf to Reduce Memory Copy > > > Key: LUCENE-10627 > URL: https://issues.apache.org/jira/browse/LUCENE-10627 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/store >Reporter: LuYunCheng >Priority: Major > > Code: [https://github.com/apache/lucene/pull/987] > I see When Lucene Do flush and merge store fields, need many memory copies: > {code:java} > Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms > elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable > [0x7f17718db000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) > at > org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) > {code} > When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many > memory copies: > With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # compressor copy dict and data into one block buffer > # do compress > # copy compressed data out > With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # do compress > # copy compressed data out > > I think we can use CompositeByteBuf to reduce temp memory copies: > # we do not have to *bufferedDocs.toArrayCopy* when just need continues > content for chunk compress > > I write a simple mini benchamrk in test code ([link > |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]): > *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin > elapse:5391ms , New elapse:5297ms > *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin > elapse:{*}115ms{*}, New elapse:{*}12ms{*} > > And I run runStoredFieldsBenchmark with doc_limit=-1: > shows: > ||Msec to index||BEST_SPEED ||BEST_COMPRESSION|| > |Baseline|318877.00|606288.00| > |Candidate|314442.00|604719.00| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jpountz commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r911190808 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; Review Comment: It's not great to need to pass the field on every value and require implementations to look up the right data structure on every doc. Should we add one more layer to the API to look more like this: ``` KnnFieldVectorsWriter { addValue(int docID, float[] vectorValue); } KnnVectorsWriter { KnnFieldVectorsWriter addField(FieldInfo info); flush(int maxDoc); // merge(), etc. } ``` ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; + + /** Write field for merging */ + public abstract void writeFieldForMerging(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) Review Comment: Is it the same as `mergeXXXField` in `DocValuesConsumer` or `mergeOneField` in `PointsWriter`? Maybe we should rename to `mergeOneField` and make this method responsible for creating the merged view (instead of doing it on top)? ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -94,17 +95,61 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException private class FieldsWriter extends KnnVectorsWriter { private final Map formats; private final Map suffixes = new HashMap<>(); +private final Map> writersForFields = +new IdentityHashMap<>(); private final SegmentWriteState segmentWriteState; +// if there is a single writer, cache it for faster indexing +private KnnVectorsWriter singleWriter; Review Comment: We should design the API in such a way that such tricks are not needed, I left a commen on `KnnVectorsWriter`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LuYunCheng updated LUCENE-10627: Description: Code: [https://github.com/apache/lucene/pull/987] I see When Lucene Do flush and merge store fields, need many memory copies: {code:java} Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable [0x7f17718db000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) {code} When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many memory copies: With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk compress # compressor copy dict and data into one block buffer # do compress # copy compressed data out With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}: # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk compress # do compress # copy compressed data out I think we can use CompositeByteBuf to reduce temp memory copies: # we do not have to *bufferedDocs.toArrayCopy* when just need continues content for chunk compress I write a simple mini benchamrk in test code ([link |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]): *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin elapse:5391ms , New elapse:5297ms *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin elapse:{*}115ms{*}, New elapse:{*}12ms{*} And I run runStoredFieldsBenchmark with doc_limit=-1: shows: ||Msec to index||BEST_SPEED ||BEST_COMPRESSION|| |Baseline|318877.00|606288.00| |Candidate|314442.00|604719.00| was: Code: [https://github.com/apache/lucene/pull/987] I see When Lucene Do flush and merge store fields, need many memory copies: {code:java} Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable [0x7f17718db000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) {code} When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many memory copies: With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk compress # compressor copy dict and data into one block buffer # do
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561089#comment-17561089 ] Tomoko Uchida commented on LUCENE-10557: I'm sorry for the noise - Jira's special emojis should be converted to corresponding Unicode emojis. This is a test post to make sure the mapping is correct. (y) (n) (i) (/) (x) (!) (+) (-) (?) (on) (off) (*) (*r) (*g) (*b) (*) (flag) (flagoff) > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, > image-2022-06-29-13-36-57-365.png, screenshot-1.png > > Time Spent: 20m > Remaining Estimate: 0h > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * (/) Choose issues that should be moved to GitHub - We'll migrate all > issues towards an atomic switch to GitHub if no major technical obstacles > show up. > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Prepare a complete migration tool > ** See https://github.com/apache/lucene-jira-archive/issues/5 > * Build the convention for issue label/milestone management > ** See [https://github.com/apache/lucene-jira-archive/issues/6] > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * (/) Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** See [https://github.com/apache/lucene-jira-archive/issues/7] > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171325947 ![Screenshot from 2022-06-30 23-54-46](https://user-images.githubusercontent.com/1825333/176709310-dbe249df-5f86-439d-95ec-cbe932905d16.png) Indents are still not preserved - this should be another problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1171316604 Looks like the converter library does not support Carriage Return `\r` and succeeding spaces after Line Feed `\n` and that causes the conversion errors. This quick fix in pre-processing may solve many conversion errors. ``` text = re.sub(r"\r\n\s*", "\n", text) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1171306876 > I am a bit surprised about the benchmark results. In [LUCENE-10375](https://issues.apache.org/jira/browse/LUCENE-10375), we found that writing all vectors to disk before building the graph sped up indexing (not just merging). This change goes back to the strategy of using on-heap vectors to build the graph, so I'd expect a slowdown. Here are the benchmark results from that issue: @jtibshirani Thank for the initial review. "write vectors" here means the time for the whole flush operation. In for the main branch, as we build a graph during flush it takes a lot of time (840392 msec); while for this PR flush operation is fast (1017 msec ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10581) Optimize stored fields merges on the first segment
[ https://issues.apache.org/jira/browse/LUCENE-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10581. --- Resolution: Won't Fix > Optimize stored fields merges on the first segment > -- > > Key: LUCENE-10581 > URL: https://issues.apache.org/jira/browse/LUCENE-10581 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > This is mostly repurposing LUCENE-10573. Even though our merge policies no > longer perform quadratic merging, it's still possible to configure them with > low merge factors (e.g. 2) or they might decide to create unbalanced merges > where the biggest segment of the merge accounts for a large part of the > merge. In such cases, copying compressed data directly still yields > significant benefits. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
jpountz commented on PR #892: URL: https://github.com/apache/lucene/pull/892#issuecomment-1171282676 Thinking more about it, I'm thinking of not merging this change. In the normal case when merges are balanced, it doesn't help because the first segment would generally have a dirty block pretty early. I tried to reason through whether other use-cases would benefit from this change, but I don't think that any would benefit significantly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz closed pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
jpountz closed pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment URL: https://github.com/apache/lucene/pull/892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-10557: --- Description: A few (not the majority) Apache projects already use the GitHub issue instead of Jira. For example, Airflow: [https://github.com/apache/airflow/issues] BookKeeper: [https://github.com/apache/bookkeeper/issues] So I think it'd be technically possible that we move to GitHub issue. I have little knowledge of how to proceed with it, I'd like to discuss whether we should migrate to it, and if so, how to smoothly handle the migration. The major tasks would be: * (/) Get a consensus about the migration among committers * (/) Choose issues that should be moved to GitHub - We'll migrate all issues towards an atomic switch to GitHub if no major technical obstacles show up. ** Discussion thread [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] ** -Conclusion for now: We don't migrate any issues. Only new issues should be opened on GitHub.- ** Write a prototype migration script - the decision could be made on that. Things to consider: *** version numbers - labels or milestones? *** add a comment/ prepend a link to the source Jira issue on github side, *** add a comment/ prepend a link on the jira side to the new issue on github side (for people who access jira from blogs, mailing list archives and other sources that will have stale links), *** convert cross-issue automatic links in comments/ descriptions (as suggested by Robert), *** strategy to deal with sub-issues (hierarchies), *** maybe prefix (or postfix) the issue title on github side with the original LUCENE-XYZ key so that it is easier to search for a particular issue there? *** how to deal with user IDs (author, reporter, commenters)? Do they have to be github users? Will information about people not registered on github be lost? *** create an extra mapping file of old-issue-new-issue URLs for any potential future uses. *** what to do with issue numbers in git/svn commits? These could be rewritten but it'd change the entire git history tree - I don't think this is practical, while doable. * Prepare a complete migration tool ** See https://github.com/apache/lucene-jira-archive/issues/5 * Build the convention for issue label/milestone management ** See [https://github.com/apache/lucene-jira-archive/issues/6] ** Do some experiments on a sandbox repository [https://github.com/mocobeta/sandbox-lucene-10557] ** Make documentation for metadata (label/milestone) management * (/) Enable Github issue on the lucene's repository ** Raise an issue on INFRA ** (Create an issue-only private repository for sensitive issues if it's needed and allowed) ** Set a mail hook to [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to the general mail group name) * Set a schedule for migration ** See [https://github.com/apache/lucene-jira-archive/issues/7] ** Give some time to committers to play around with issues/labels/milestones before the actual migration ** Make an announcement on the mail lists ** Show some text messages when opening a new Jira issue (in issue template?) was: A few (not the majority) Apache projects already use the GitHub issue instead of Jira. For example, Airflow: [https://github.com/apache/airflow/issues] BookKeeper: [https://github.com/apache/bookkeeper/issues] So I think it'd be technically possible that we move to GitHub issue. I have little knowledge of how to proceed with it, I'd like to discuss whether we should migrate to it, and if so, how to smoothly handle the migration. The major tasks would be: * (/) Get a consensus about the migration among committers * (/) Choose issues that should be moved to GitHub - We'll migrate all issues towards an atomic switch to GitHub if no major technical obstacles show up. ** Discussion thread [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] ** -Conclusion for now: We don't migrate any issues. Only new issues should be opened on GitHub.- ** Write a prototype migration script - the decision could be made on that. Things to consider: *** version numbers - labels or milestones? *** add a comment/ prepend a link to the source Jira issue on github side, *** add a comment/ prepend a link on the jira side to the new issue on github side (for people who access jira from blogs, mailing list archives and other sources that will have stale links), *** convert cross-issue automatic links in comments/ descriptions (as suggested by Robert), *** strategy to deal with sub-issues (hierarchies), *** maybe prefix (or postfix) the issue title on github side with the original LUCENE-XYZ key so that it is easier to search for a particular issue there? *** how to deal with user IDs (author, reporter, commenters)? Do they have to be github users? Will information about people not
[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-10557: --- Description: A few (not the majority) Apache projects already use the GitHub issue instead of Jira. For example, Airflow: [https://github.com/apache/airflow/issues] BookKeeper: [https://github.com/apache/bookkeeper/issues] So I think it'd be technically possible that we move to GitHub issue. I have little knowledge of how to proceed with it, I'd like to discuss whether we should migrate to it, and if so, how to smoothly handle the migration. The major tasks would be: * (/) Get a consensus about the migration among committers * (/) Choose issues that should be moved to GitHub - We'll migrate all issues towards an atomic switch to GitHub if no major technical obstacles show up. ** Discussion thread [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] ** -Conclusion for now: We don't migrate any issues. Only new issues should be opened on GitHub.- ** Write a prototype migration script - the decision could be made on that. Things to consider: *** version numbers - labels or milestones? *** add a comment/ prepend a link to the source Jira issue on github side, *** add a comment/ prepend a link on the jira side to the new issue on github side (for people who access jira from blogs, mailing list archives and other sources that will have stale links), *** convert cross-issue automatic links in comments/ descriptions (as suggested by Robert), *** strategy to deal with sub-issues (hierarchies), *** maybe prefix (or postfix) the issue title on github side with the original LUCENE-XYZ key so that it is easier to search for a particular issue there? *** how to deal with user IDs (author, reporter, commenters)? Do they have to be github users? Will information about people not registered on github be lost? *** create an extra mapping file of old-issue-new-issue URLs for any potential future uses. *** what to do with issue numbers in git/svn commits? These could be rewritten but it'd change the entire git history tree - I don't think this is practical, while doable. * Build the convention for issue label/milestone management ** See [https://github.com/apache/lucene-jira-archive/issues/6] ** Do some experiments on a sandbox repository [https://github.com/mocobeta/sandbox-lucene-10557] ** Make documentation for metadata (label/milestone) management * (/) Enable Github issue on the lucene's repository ** Raise an issue on INFRA ** (Create an issue-only private repository for sensitive issues if it's needed and allowed) ** Set a mail hook to [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to the general mail group name) * Set a schedule for migration ** See [https://github.com/apache/lucene-jira-archive/issues/7] ** Give some time to committers to play around with issues/labels/milestones before the actual migration ** Make an announcement on the mail lists ** Show some text messages when opening a new Jira issue (in issue template?) was: A few (not the majority) Apache projects already use the GitHub issue instead of Jira. For example, Airflow: [https://github.com/apache/airflow/issues] BookKeeper: [https://github.com/apache/bookkeeper/issues] So I think it'd be technically possible that we move to GitHub issue. I have little knowledge of how to proceed with it, I'd like to discuss whether we should migrate to it, and if so, how to smoothly handle the migration. The major tasks would be: * (/) Get a consensus about the migration among committers * Choose issues that should be moved to GitHub ** Discussion thread [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] ** -Conclusion for now: We don't migrate any issues. Only new issues should be opened on GitHub.- ** Write a prototype migration script - the decision could be made on that. Things to consider: *** version numbers - labels or milestones? *** add a comment/ prepend a link to the source Jira issue on github side, *** add a comment/ prepend a link on the jira side to the new issue on github side (for people who access jira from blogs, mailing list archives and other sources that will have stale links), *** convert cross-issue automatic links in comments/ descriptions (as suggested by Robert), *** strategy to deal with sub-issues (hierarchies), *** maybe prefix (or postfix) the issue title on github side with the original LUCENE-XYZ key so that it is easier to search for a particular issue there? *** how to deal with user IDs (author, reporter, commenters)? Do they have to be github users? Will information about people not registered on github be lost? *** create an extra mapping file of old-issue-new-issue URLs for any potential future uses. *** what to do with issue numbers in git/svn commits? These could be rewritten but
[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #7: Make a detailed migration plan
mocobeta opened a new issue, #7: URL: https://github.com/apache/lucene-jira-archive/issues/7 It will take at least a few days and there will be some moratorium time where GitHub issue is not lifted yet but a Jira issues snapshot was already taken. We need a detailed migration plan to avoid possible conflicts/confusion. A draft plan would be: 1. Announce that the migration is started just before starting to take a Jira snapshot in the mail list. - Issues/comments created after that should be manually migrated afterward. 2. Run the download script to take a snapshot of the whole Lucene Jira. - This would take 4 hours~ (needs intervals between Jira API calls). 3. Commit all attachments to `lucene-jira-archive` (this repository). 4. Run the conversion script that generates GitHub importable data from the Jira dump. - This would take one or two hours depending on the speed of conversion. 5. [First pass] Run the import script to initialize all issues and comments. - This would take 15 hours~ (needs intervals between GitHub API calls) 6. [Second pass] Run the update script to create re-mapped cross-issues links. - This would take 24 hours~ (needs intervals between GitHub API calls) 7. Manually recover migration errors if possible. 8. Annonce that the migration is finished in the mail list. - GitHub issues is available at this point. - Issues should not be raised in Jira, and existing Jira issues should not be updated after that 9. Show some texts that say "Jira is deprecated" when opening Jira issues. 10. Add comments to each Jira issue that say "Moved to GitHub ". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #999: LUCENE-10634: Speed up WANDScorer.
jpountz opened a new pull request, #999: URL: https://github.com/apache/lucene/pull/999 This speeds up WANDScorer by computing scores of docs that are positioned on the next candidate competitive document in order to potentially detect that no further match is possible, before advancing scorers that are still located in the tail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10634) Speed up WANDScorer by computing scores before advancing tail scorers
Adrien Grand created LUCENE-10634: - Summary: Speed up WANDScorer by computing scores before advancing tail scorers Key: LUCENE-10634 URL: https://issues.apache.org/jira/browse/LUCENE-10634 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand While looking at performance numbers on LUCENE-10480, I noticed that it is often faster to compute a score in order to finer-grained estimation of the best score that the current document can possibly get before advancing a tail scorer. Making this change to WANDScorer yielded a small but reproducible speedup: {noformat} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ 186.50 (11.8%) 175.34 (19.1%) -6.0% ( -33% - 28%) 0.234 HighTermTitleBDVSort 167.27 (20.6%) 161.85 (17.2%) -3.2% ( -34% - 43%) 0.591 MedSloppyPhrase 194.77 (5.5%) 190.45 (7.8%) -2.2% ( -14% - 11%) 0.299 HighTermDayOfYearSort 229.61 (7.7%) 225.74 (7.1%) -1.7% ( -15% - 14%) 0.471 LowSloppyPhrase 20.22 (4.3%) 19.95 (4.8%) -1.3% ( -10% -8%) 0.366 TermDTSort 319.62 (7.7%) 316.78 (7.5%) -0.9% ( -14% - 15%) 0.712 OrHighNotLow 1856.44 (5.6%) 1842.88 (5.7%) -0.7% ( -11% - 11%) 0.682 AndMedOrHighHigh 73.87 (3.8%) 73.51 (3.6%) -0.5% ( -7% -7%) 0.677 OrHighNotHigh 2000.56 (5.6%) 1991.65 (6.9%) -0.4% ( -12% - 12%) 0.823 LowPhrase 106.90 (2.4%) 106.61 (2.9%) -0.3% ( -5% -5%) 0.750 AndHighLow 1661.80 (3.5%) 1658.56 (3.7%) -0.2% ( -7% -7%) 0.865 Fuzzy2 110.64 (1.8%) 110.43 (1.9%) -0.2% ( -3% -3%) 0.752 HighTermMonthSort 73.74 (17.5%) 73.68 (20.8%) -0.1% ( -32% - 46%) 0.989 PKLookup 242.86 (1.8%) 242.75 (1.8%) -0.0% ( -3% -3%) 0.934 OrHighNotMed 1454.98 (5.3%) 1456.26 (5.8%)0.1% ( -10% - 11%) 0.960 HighPhrase 523.22 (2.9%) 524.01 (2.6%)0.2% ( -5% -5%) 0.862 MedPhrase 140.65 (2.7%) 140.87 (2.9%)0.2% ( -5% -5%) 0.862 HighSloppyPhrase8.74 (4.6%)8.75 (5.5%)0.2% ( -9% - 10%) 0.914 LowSpanNear 28.05 (3.6%) 28.14 (3.0%)0.3% ( -6% -7%) 0.777 MedSpanNear7.59 (3.5%)7.61 (3.4%)0.3% ( -6% -7%) 0.778 Respell 67.62 (1.9%) 67.82 (1.8%)0.3% ( -3% -4%) 0.595 OrAndHigMedAndHighMed 127.87 (3.1%) 128.27 (4.0%)0.3% ( -6% -7%) 0.780 OrNotHighLow 1513.24 (2.1%) 1520.33 (2.6%)0.5% ( -4% -5%) 0.528 OrHighPhraseHighPhrase 25.26 (3.0%) 25.38 (3.0%)0.5% ( -5% -6%) 0.616 OrNotHighMed 1544.04 (4.5%) 1552.26 (4.2%)0.5% ( -7% -9%) 0.697 AndHighHigh 92.24 (4.8%) 92.79 (6.6%)0.6% ( -10% - 12%) 0.744 AndHighMed 420.42 (3.1%) 423.19 (5.2%)0.7% ( -7% -9%) 0.624 Fuzzy1 117.42 (1.9%) 118.19 (2.2%)0.7% ( -3% -4%) 0.307 MedTerm 2209.36 (4.6%) 2224.54 (5.3%)0.7% ( -8% - 11%) 0.661 MedIntervalsOrdered 124.18 (8.1%) 125.12 (8.0%)0.8% ( -14% - 18%) 0.767 OrNotHighHigh 1239.43 (4.6%) 1249.63 (4.8%)0.8% ( -8% - 10%) 0.580 AndHighOrMedMed 95.02 (4.3%) 95.82 (3.8%)0.8% ( -6% -9%) 0.515 Wildcard 315.22 (23.3%) 317.98 (22.5%)0.9% ( -36% - 60%) 0.904 LowTerm 2775.81 (4.0%) 2808.32 (5.2%)1.2% ( -7% - 10%) 0.425 HighIntervalsOrdered 14.24 (8.0%) 14.41 (8.4%)1.2% ( -14% - 19%) 0.646 LowIntervalsOrdered 120.62 (5.8%) 122.09 (6.6%)1.2% ( -10% - 14%) 0.534 HighSpanNear 39.04 (6.7%) 39.71 (4.3%)1.7% ( -8% - 13%) 0.332
[jira] [Created] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
Adrien Grand created LUCENE-10633: - Summary: Dynamic pruning for queries sorted by SORTED(_SET) field Key: LUCENE-10633 URL: https://issues.apache.org/jira/browse/LUCENE-10633 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand LUCENE-9280 introduced the ability to dynamically prune non-competitive hits when sorting by a numeric field, by leveraging the points index to skip documents that do not compare better than the top of the priority queue maintained by the field comparator. However queries sorted by a SORTED(_SET) field still look at all hits, which is disappointing. Could we leverage the terms index to skip hits? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #6: Document issue label / template managiment policy
mocobeta opened a new issue, #6: URL: https://github.com/apache/lucene-jira-archive/issues/6 - Explicitly define label families (e.g., `type:xxx`, `fixVersion:x.x.x`) - Clarify the mapping between labels and index templates - Write documentation and make it accessible to developers (e.g., place it under `dev-docs` in the lucene repo) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r910836731 ## lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java: ## @@ -26,233 +26,153 @@ import org.apache.lucene.codecs.KnnVectorsWriter; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.Counter; import org.apache.lucene.util.RamUsageEstimator; /** - * Buffers up pending vector value(s) per doc, then flushes when segment flushes. + * Buffers up pending vector value(s) per doc, then flushes when segment flushes. Used for {@code + * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 . * * @lucene.experimental */ -class VectorValuesWriter { - - private final FieldInfo fieldInfo; - private final Counter iwBytesUsed; - private final List vectors = new ArrayList<>(); - private final DocsWithFieldSet docsWithField; - - private int lastDocID = -1; - - private long bytesUsed; - - VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) { -this.fieldInfo = fieldInfo; -this.iwBytesUsed = iwBytesUsed; -this.docsWithField = new DocsWithFieldSet(); -this.bytesUsed = docsWithField.ramBytesUsed(); -if (iwBytesUsed != null) { - iwBytesUsed.addAndGet(bytesUsed); +public abstract class VectorValuesWriter extends KnnVectorsWriter { Review Comment: Would renaming this to `BufferingKnnVectorsWriter` be clearer? I assumed it did something different because of the very general name `VectorValuesWriter`. I also wonder if we could update `SimpleTextKnnVectorsWriter` to use the new writer interface. Then we could move this class to the backwards-codecs package, because it would only be used in the old codec tests. ## lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.Codec; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.util.Accountable; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.InfoStream; + +/** + * Streams vector values for indexing to the given codec's vectors writer. The codec's vectors + * writer is responsible for buffering and processing vectors. + */ +class VectorValuesConsumer { + private final Codec codec; + private final Directory directory; + private final SegmentInfo segmentInfo; + private final InfoStream infoStream; + + private Accountable accountable = Accountable.NULL_ACCOUNTABLE; + private KnnVectorsWriter writer; + + VectorValuesConsumer( + Codec codec, Directory directory, SegmentInfo segmentInfo, InfoStream infoStream) { +this.codec = codec; +this.directory = directory; +this.segmentInfo = segmentInfo; +this.infoStream = infoStream; + } + + private void initKnnVectorsWriter(String fieldName) throws IOException { +if (writer == null) { + KnnVectorsFormat fmt = codec.knnVectorsFormat(); + if (fmt == null) { +throw new IllegalStateException( +"field=\"" ++ fieldName ++ "\" was indexed as vectors but codec does not support vectors"); + } + SegmentWriteState initialWriteState = + new SegmentWriteState(infoStream, directory, segmentInfo, null, null, IOContext.DEFAULT); + writer = fmt.fieldsWriter(initialWriteState); + accountable = writer; +} + } + + public void addField(FieldInfo fieldInfo) throws IOException { +initKnnVectorsWriter(fieldInfo.name); +writer.addField(fieldInfo); + } + + public void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) throws IOException { +writer.addValue(fieldInfo, docID, vectorValue); + } + + void flush(SegmentWriteState
[GitHub] [lucene] jtibshirani opened a new pull request, #998: LUCENE-10577: Add vectors format unit test and fix toString
jtibshirani opened a new pull request, #998: URL: https://github.com/apache/lucene/pull/998 We forgot to add this unit test when introducing the new 9.3 vectors format. This commit adds the test and fixes issues it uncovered in toString. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10546) Update Faceting user guide
[ https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560917#comment-17560917 ] Egor Potemkin commented on LUCENE-10546: I will work on this if no one else is already doing it. > Update Faceting user guide > -- > > Key: LUCENE-10546 > URL: https://issues.apache.org/jira/browse/LUCENE-10546 > Project: Lucene - Core > Issue Type: Wish > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > The [facet user > guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html] > was written based on 4.1. Since there's been a fair amount of active > facet-related development over the last year+, it would be nice to review the > guide and see what updates make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #972: LUCENE-10480: Use BMM scorer for 2 clauses disjunction
jpountz commented on code in PR #972: URL: https://github.com/apache/lucene/pull/972#discussion_r910683270 ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Comparator; +import java.util.LinkedList; +import java.util.List; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { + // current doc ID of the leads + private int doc; + + // doc id boundary that all scorers maxScore are valid + private int upTo; + + // heap of scorers ordered by doc ID + private final DisiPriorityQueue essentialsScorers; + + // list of scorers ordered by maxScore + private final LinkedList maxScoreSortedEssentialScorers; + + private final DisiWrapper[] allScorers; + + // sum of max scores of scorers in nonEssentialScorers list + private double nonEssentialMaxScoreSum; + + private long cost; Review Comment: nit: let's make it final ## lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java: ## @@ -0,0 +1,322 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Comparator; +import java.util.LinkedList; +import java.util.List; + +/** Scorer implementing Block-Max Maxscore algorithm */ +public class BlockMaxMaxscoreScorer extends Scorer { + // current doc ID of the leads + private int doc; + + // doc id boundary that all scorers maxScore are valid + private int upTo = -1; + + // heap of scorers ordered by doc ID + private final DisiPriorityQueue essentialsScorers; + // list of scorers ordered by maxScore + private final LinkedList maxScoreSortedEssentialScorers; + + private final DisiWrapper[] allScorers; + + // sum of max scores of scorers in nonEssentialScorers list + private float nonEssentialMaxScoreSum; + + private long cost; + + private final MaxScoreSumPropagator maxScoreSumPropagator; + + // scaled min competitive score + private float minCompetitiveScore = 0; + + private int cachedScoredDoc = -1; + private float cachedScore = 0; + + /** + * Constructs a Scorer that scores doc based on Block-Max-Maxscore (BMM) algorithm + * http://engineering.nyu.edu/~suel/papers/bmm.pdf . This algorithm has lower overhead compared to + * WANDScorer, and could be used for simple disjunction queries. + * + * @param weight The weight to be used. + * @param scorers The sub scorers this Scorer should iterate on for optional clauses + */ + public BlockMaxMaxscoreScorer(Weight weight, List scorers) throws IOException { +super(weight); + +this.doc = -1; +this.allScorers = new DisiWrapper[scorers.size()]; +this.essentialsScorers = new DisiPriorityQueue(scorers.size()); +this.maxScoreSortedEssentialScorers = new LinkedList<>(); + +long cost = 0; +for (int i = 0; i < scorers.size(); i++) { + DisiWrapper w = new DisiWrapper(scorers.get(i)); + cost += w.cost; + allScorers[i] = w; +} + +this.cost = cost; +maxScoreSumPropagator = new MaxScoreSumPropagator(scorers); + } + + @Override + public
[jira] [Created] (LUCENE-10632) Change getAllChildren to return all children regardless of the count
Yuting Gan created LUCENE-10632: --- Summary: Change getAllChildren to return all children regardless of the count Key: LUCENE-10632 URL: https://issues.apache.org/jira/browse/LUCENE-10632 Project: Lucene - Core Issue Type: Improvement Reporter: Yuting Gan Currently, the getAllChildren functionality is implemented in a way that is similar to getTopChildren, where they only return children with count that is greater than zero. However, he original getTopChildren in RangeFacetCounts returned all children whether-or-not the count was zero. This actually has good use cases and we should continue supporting the feature in getAllChildren, so that we will not lose it after properly supporting getTopChildren in RangeFacetCounts. As discussed with [~gsmiller] in the [LUCENE-10614 pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to behave differently from getTopChildren can actually be more helpful for users. If users want to get children with only positive count, we have getTopChildren supporting this behavior already. Therefore, the getAllChildren API should provide all children in all of the implementations, whether-or-not the count is zero. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org