[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
zacharymorn commented on pull request #101: URL: https://github.com/apache/lucene/pull/101#issuecomment-841606431 > > in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed? > > I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time. > > > The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause. > > Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons. Makes sense! I'll not attempt to fix it for now then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on pull request #133: URL: https://github.com/apache/lucene/pull/133#issuecomment-841561720 I went ahead and added a sparse counting approach since it wasn't complicated to do. I borrowed heuristics and some logic from `IntTaxonomyFacets` in doing so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public
[ https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344925#comment-17344925 ] Greg Miller commented on LUCENE-9956: - Ah, thanks [~gworah]. Sounds like a good use-case for needing to expose the base query then. What I'm hearing is that there are optimizations that should _only_ apply to the base query, so {{rewriting}} the query isn't helpful here since that will produce a {{BooleanQuery}} that contains all of the drill down dims applied. Thanks for clarifying! +1 to adding public access to the base query. I left a comment on your PR with regards to the approach of making drill down dims public as well. I like making everything public for consistency, but if that work kind of balloons, I don't mind tackling just the base query for now. That's my opinion at least. > Make getBaseQuery API from DrillDownQuery public > - > > Key: LUCENE-9956 > URL: https://issues.apache.org/jira/browse/LUCENE-9956 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Gautam Worah >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > It would be great if users could access the baseQuery of a DrillDownQuery. I > think this can be useful for folks who want to access/test the clauses of a > BooleanQuery (for example) after they've already wrapped it into a > DrillDownQuery. > > Currently the {{Query getBaseQuery()}} method is package private by default. > If this proposed change does not make sense, or if this change breaks the > semantic of the class, I am happy to explore other ways of doing this! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public
[ https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344924#comment-17344924 ] Gautam Worah commented on LUCENE-9956: -- Here is why I need just the baseQuery and drill down queries from a {{DrillDownQuery}} object. I have an initial {{DrillDownQuery}} that I construct by parsing the user's {{BooleanQuery}} and add custom {{subQuery}} s to it using the {{add(String dim, Query subQuery)}} API. Then later on, I try to optimize the base {{BooleanQuery}} (remove some terms) and create a new {{DrillDownQuery}} object and add the original \{{subQuery}} s back. Can we make rewrite() do this? Yes Pros: Does not expose the {{baseQuery}} and limits the access Cons: 1. {{rewrite}} returns a so to say combined form of the {{DrillDownQuery}}. The user will have to again parse the first BQ clause and get the {{baseQuery}} then parse all the remaining clauses and get the \{{subQuery}}s. 2. Using the {{rewrite}} function to get the {{baseQuery}} is a bit non intuitive > Make getBaseQuery API from DrillDownQuery public > - > > Key: LUCENE-9956 > URL: https://issues.apache.org/jira/browse/LUCENE-9956 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Gautam Worah >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > It would be great if users could access the baseQuery of a DrillDownQuery. I > think this can be useful for folks who want to access/test the clauses of a > BooleanQuery (for example) after they've already wrapped it into a > DrillDownQuery. > > Currently the {{Query getBaseQuery()}} method is package private by default. > If this proposed change does not make sense, or if this change breaks the > semantic of the class, I am happy to explore other ways of doing this! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on a change in pull request #133: URL: https://github.com/apache/lucene/pull/133#discussion_r632839564 ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java ## @@ -0,0 +1,379 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.MultiDocValues; +import org.apache.lucene.index.OrdinalMap; +import org.apache.lucene.index.ReaderUtil; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.ConjunctionDISI; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LongValues; + +/** + * Compute facet counts from a previously indexed {@link SortedSetDocValues} or {@link + * org.apache.lucene.index.SortedDocValues} field. This approach will execute facet counting against + * the string values found in the specified field, with no assumptions on their format. Unlike + * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no assumption is made + * about a "dimension" path component being indexed. Because of this, the field itself is + * effectively treated as the "dimension", and counts for all unique string values are produced. + * This approach is meant to complement {@link LongValueFacetCounts} in that they both provide facet + * counting on a doc value field with no assumptions of content. + * + * This implementation is useful if you want to dynamically count against any string doc value + * field without relying on {@link FacetField} and {@link FacetsConfig}. The disadvantage is that a + * separate field is required for each "dimension". If you want to pack multiple dimensions into the + * same doc values field, you probably want one of {@link + * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link + * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}. + * + * Note that there is an added cost on every {@link IndexReader} open to create a new {@link + * StringDocValuesReaderState}. Also note that this class should be instantiated and used from a + * single thread, because it holds a thread-private instance of {@link SortedSetDocValues}. + * + * Also note that counting does not use a sparse data structure, so heap memory cost scales with + * the number of unique ordinals for the field being counting. For high-cardinality fields, this + * could be costly. + * + * @lucene.experimental + */ +// TODO: Add a concurrent version much like ConcurrentSortedSetDocValuesFacetCounts? +public class StringValueFacetCounts extends Facets { + + private final IndexReader reader; + private final String field; + private final OrdinalMap ordinalMap; + private final SortedSetDocValues docValues; + + // TODO: There's an optimization opportunity here to use a sparse counting structure in some + // cases, + // much like what IntTaxonomyFacetCounts does. + /** Dense counting array indexed by ordinal. */ + private final int[] counts; + + private int totalDocCount; + + /** + * Returns all facet counts for the field, same result as searching on {@link MatchAllDocsQuery} + * but faster. + */ + public StringValueFacetCounts(StringDocValuesReaderState state) throws IOException { +this(state, null); + } + + /** Counts facets across the provided hits. */ + public StringValueFacetCounts(StringDocValuesReaderState state, FacetsCollector facetsCollector) + throws IOException { +reader = state.reader; +field = state.field; +ordinalMap = state.ordinalMap; +docValues = getDocValues(); + +// Since we accumulate counts in an array, we need to ensure the number of unique ordinals +// doesn't overflow an integer: +if (docValues.getValueCount() > Integer.MAX_VALUE) { + throw new IllegalArgumentException( + "can only handle
[GitHub] [lucene] gsmiller commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public
gsmiller commented on a change in pull request #138: URL: https://github.com/apache/lucene/pull/138#discussion_r632800907 ## File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java ## @@ -170,11 +170,22 @@ private BooleanQuery getBooleanQuery() { return bq.build(); } - Query getBaseQuery() { + /** + * Returns the internal baseQuery of the DrillDownQuery + * + * @return The baseQuery used on initialization of DrillDownQuery + */ + public Query getBaseQuery() { return baseQuery; } - Query[] getDrillDownQueries() { + /** + * Returns the dimension queries added either via {@link #add(String, Query)} or {@link + * #add(String, String...)} + * + * @return The array of dimQueries + */ + public Query[] getDrillDownQueries() { Query[] dimQueries = new Query[this.dimQueries.size()]; Review comment: I wonder if it would make more sense to build each dim query when they're added and change dimQueries to `List`. From a quick glance at the code, I don't see a need to store these as boolean clauses and then only build them when needed. If we expose this as public, we could wind up building these queries multiple times, which is wasteful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on pull request #133: URL: https://github.com/apache/lucene/pull/133#issuecomment-841450992 @mikemccand yeah, this works for both single- and multi-valued fields. In `getDocValues()` I'm relying on `DocValues.getSortedSet()` which will first try to load stored values as `SortedSetDocValues` but will fall back to trying `SortedDocValues`. Pretty handy helper functionality. I cover this case in `testBasicSingleValuedUsingSortedDoc` to confirm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on a change in pull request #133: URL: https://github.com/apache/lucene/pull/133#discussion_r632722791 ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringDocValuesReaderState.java ## @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.OrdinalMap; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.util.packed.PackedInts; + +/** + * Stores an {@link OrdinalMap} created for a specific {@link IndexReader} ({@code reader}) + {@code + * field}. Enables re-use of the {@code ordinalMap} once created since creation is costly. + * + * Note: It's important that callers confirm the ordinal map is still valid for their cases. + * Specifically, callers should confirm that the reader used to create the map ({@code reader}) + * matches their use-case. + */ +class StringDocValuesReaderState { Review comment: Ah, yes-- it absolutely should along with its ctor. Good catch! I've got my IDE setup to generate all new classes as package-private by default so I have to have a good reason to make something public. This one slipped through the cracks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on a change in pull request #133: URL: https://github.com/apache/lucene/pull/133#discussion_r632720493 ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.MultiDocValues; +import org.apache.lucene.index.OrdinalMap; +import org.apache.lucene.index.ReaderUtil; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.ConjunctionDISI; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LongValues; + +/** + * Compute facet counts from a previously indexed {@link SortedSetDocValues} or {@link + * org.apache.lucene.index.SortedDocValues} field. This approach will execute facet counting against + * the string values found in the specified field, with no assumptions on their format. Unlike + * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no assumption is made + * about a "dimension" path component being indexed. Because of this, the field itself is + * effectively treated as the "dimension", and counts for all unique string values are produced. + * This approach is meant to compliment {@link LongValueFacetCounts} in that they both provide facet + * counting on a doc value field with no assumptions of content. + * + * This implementation is useful if you want to dynamically count against any string doc value + * field without relying on {@link FacetField} and {@link FacetsConfig}. The disadvantage is that a + * separate field is required for each "dimension". If you want to pack multiple dimensions into the + * same doc values field, you probably want one of {@link + * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link + * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}. + * + * Note that there is an added cost on every {@link IndexReader} open to create a new {@link + * StringDocValuesReaderState}. Also note that this class should be instantiated and used from a + * single thread, because it holds a thread-private instance of {@link SortedSetDocValues}. + * + * @lucene.experimental + */ +// TODO: Add a concurrent version much like ConcurrentSortedSetDocValuesFacetCounts? +public class StringValueFacetCounts extends Facets { + + private final IndexReader reader; + private final String field; + private final OrdinalMap ordinalMap; + private final SortedSetDocValues docValues; + + private final int[] counts; Review comment: That's correct. I'll add some documentation. I considered having both sparse and dense approaches triggered by different thresholds, similar to what `IntTaxonomyFacetCounts` does, but opted not to for now. There should at least be some fairly common cases where this counting is pretty dense, assuming most unique values end up being seen at least once for a given field on any given match set. For very restrictive queries though, this could certainly get sparse. Anyway, maybe the most relevant reason I took this approach for now is that it's the existing approach used by `SortedSetDocValueFacetCounts`, so seemed like a reasonable starting place. But yes, optimization opportunities exist :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
gsmiller commented on a change in pull request #133: URL: https://github.com/apache/lucene/pull/133#discussion_r632717530 ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.MultiDocValues; +import org.apache.lucene.index.OrdinalMap; +import org.apache.lucene.index.ReaderUtil; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.ConjunctionDISI; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LongValues; + +/** + * Compute facet counts from a previously indexed {@link SortedSetDocValues} or {@link + * org.apache.lucene.index.SortedDocValues} field. This approach will execute facet counting against + * the string values found in the specified field, with no assumptions on their format. Unlike + * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no assumption is made + * about a "dimension" path component being indexed. Because of this, the field itself is + * effectively treated as the "dimension", and counts for all unique string values are produced. + * This approach is meant to compliment {@link LongValueFacetCounts} in that they both provide facet + * counting on a doc value field with no assumptions of content. + * + * This implementation is useful if you want to dynamically count against any string doc value + * field without relying on {@link FacetField} and {@link FacetsConfig}. The disadvantage is that a + * separate field is required for each "dimension". If you want to pack multiple dimensions into the + * same doc values field, you probably want one of {@link + * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link + * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}. + * + * Note that there is an added cost on every {@link IndexReader} open to create a new {@link + * StringDocValuesReaderState}. Also note that this class should be instantiated and used from a + * single thread, because it holds a thread-private instance of {@link SortedSetDocValues}. + * + * @lucene.experimental + */ +// TODO: Add a concurrent version much like ConcurrentSortedSetDocValuesFacetCounts? +public class StringValueFacetCounts extends Facets { + + private final IndexReader reader; + private final String field; + private final OrdinalMap ordinalMap; + private final SortedSetDocValues docValues; + + private final int[] counts; + + private int totalDocCount = 0; Review comment: Good point; will remove. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public
[ https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344803#comment-17344803 ] Greg Miller commented on LUCENE-9956: - {quote}I agree it seems unreasonable now to not be able to {{get}} the things you had {{set}} / passed to {{ctor}} {quote} Yeah that's fair. It's a little nuanced though I think. DDQ supports a few different ways to create the base query and the drill down dims, some of which are not as simple as having the user pass something in. For example, there's a ctor that allows the user to pass in an existing DDQ and an additional Query filter. The base query from the reference DDQ + filter becomes the new base query. Does it really make sense to expose that directly to the user? Also, the "standard" approach to adding drill down dimensions is to specify a dim + path, and DDQ constructs the appropriate Query using the user-provided {{FacetsConfig}}. Again, in these cases should we be exposing the Queries created under the hood? I don't think there's any harm really in exposing them, but it does feel like a bit of an "advanced" feature. My intention isn't really to push back on adding this functionality, more to say "let's think about it a little bit and if {{rewrite}} is all that's really needed in this case, maybe we don't add this stuff yet". > Make getBaseQuery API from DrillDownQuery public > - > > Key: LUCENE-9956 > URL: https://issues.apache.org/jira/browse/LUCENE-9956 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Gautam Worah >Priority: Trivial > > It would be great if users could access the baseQuery of a DrillDownQuery. I > think this can be useful for folks who want to access/test the clauses of a > BooleanQuery (for example) after they've already wrapped it into a > DrillDownQuery. > > Currently the {{Query getBaseQuery()}} method is package private by default. > If this proposed change does not make sense, or if this change breaks the > semantic of the class, I am happy to explore other ways of doing this! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public
[ https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344801#comment-17344801 ] Gautam Worah commented on LUCENE-9956: -- Here is a PR that I opened yesterday: https://github.com/apache/lucene/pull/138 > Make getBaseQuery API from DrillDownQuery public > - > > Key: LUCENE-9956 > URL: https://issues.apache.org/jira/browse/LUCENE-9956 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Gautam Worah >Priority: Trivial > > It would be great if users could access the baseQuery of a DrillDownQuery. I > think this can be useful for folks who want to access/test the clauses of a > BooleanQuery (for example) after they've already wrapped it into a > DrillDownQuery. > > Currently the {{Query getBaseQuery()}} method is package private by default. > If this proposed change does not make sense, or if this change breaks the > semantic of the class, I am happy to explore other ways of doing this! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796 ] Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:05 PM: - benchmark: python src/python/localrun.py -source wikimedium5m !image-2021-05-15-02-04-43-405.png|width=591,height=503! was (Author: chrislu): benchmark: python src/python/localrun.py -source wikimedium5m !image-2021-05-15-02-03-06-167.png! > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > Attachments: image-2021-05-15-02-03-06-167.png, > image-2021-05-15-02-04-09-085.png, image-2021-05-15-02-04-43-405.png > > Time Spent: 10m > Remaining Estimate: 0h > > When all values were sorted, using DirectMonotonicWriter to store them can > get relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796 ] Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:03 PM: - benchmark: python src/python/localrun.py -source wikimedium5m !image-2021-05-15-02-03-06-167.png! was (Author: chrislu): benchmark: python src/python/localrun.py -source wikimedium5m {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets5.11 (4.6%)5.17 (3.7%) 1.1% ( -6% -9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 (4.5%) 29.90 (2.6%) 1.3% ( -5% -8%) 0.273 HighTermMonthSort 225.07 (16.4%) 228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets5.47 (4.6%)5.55 (3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets5.12 (4.4%)5.19 (3.6%) 1.5% ( -6% -9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 (15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 (11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( -6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 (8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 (5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 (7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( -7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 (9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 (12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% ( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 8.8% ( -6% - 27%) 0.001 {code} > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues >
[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796 ] Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:01 PM: - benchmark: python src/python/localrun.py -source wikimedium5m {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets5.11 (4.6%)5.17 (3.7%) 1.1% ( -6% -9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 (4.5%) 29.90 (2.6%) 1.3% ( -5% -8%) 0.273 HighTermMonthSort 225.07 (16.4%) 228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets5.47 (4.6%)5.55 (3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets5.12 (4.4%)5.19 (3.6%) 1.5% ( -6% -9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 (15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 (11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( -6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 (8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 (5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 (7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( -7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 (9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 (12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% ( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 8.8% ( -6% - 27%) 0.001 {code} was (Author: chrislu): benchmark: python src/python/localrun.py -source wikimedium5m > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues >
[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796 ] Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:00 PM: - benchmark: python src/python/localrun.py -source wikimedium5m was (Author: chrislu): benchmark: python src/python/localrun.py -source wikimedium5m result: {code:java} //代码占位符 {code} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets 5.11 (4.6%) 5.17 (3.7%) 1.1% ( -6% - 9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 (4.5%) 29.90 (2.6%) 1.3% ( -5% - 8%) 0.273 HighTermMonthSort 225.07 (16.4%) 228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets 5.47 (4.6%) 5.55 (3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets 5.12 (4.4%) 5.19 (3.6%) 1.5% ( -6% - 9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 (15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 (11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( -6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 (8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 (5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 (7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( -7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 (9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 (12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% ( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 8.8% ( -6% - 27%) 0.001 > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When all values were sorted, using DirectMonotonicWriter to store them can > get relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796 ] Lu Xugang commented on LUCENE-9957: --- benchmark: python src/python/localrun.py -source wikimedium5m result: {code:java} //代码占位符 {code} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets 5.11 (4.6%) 5.17 (3.7%) 1.1% ( -6% - 9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 (4.5%) 29.90 (2.6%) 1.3% ( -5% - 8%) 0.273 HighTermMonthSort 225.07 (16.4%) 228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets 5.47 (4.6%) 5.55 (3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets 5.12 (4.4%) 5.19 (3.6%) 1.5% ( -6% - 9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 (15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 (11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( -6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 (8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 (5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 (7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( -7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 (9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 (12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% ( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 8.8% ( -6% - 27%) 0.001 > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When all values were sorted, using DirectMonotonicWriter to store them can > get relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #71: LUCENE-9651: Make benchmarks run again, correct javadocs
mikemccand merged pull request #71: URL: https://github.com/apache/lucene/pull/71 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent
mikemccand commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-841300744 I am excited to see what happens to [`CheckIndex` time in Lucene's nightly benchmarks](https://home.apache.org/~mikemccand/lucenebench/checkIndexTime.html) after we push this! But I agree we must also not crush the more common case of machines that don't have tons of cores ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields
mikemccand commented on a change in pull request #133: URL: https://github.com/apache/lucene/pull/133#discussion_r632584798 ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.MultiDocValues; +import org.apache.lucene.index.OrdinalMap; +import org.apache.lucene.index.ReaderUtil; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.ConjunctionDISI; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.MatchAllDocsQuery; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LongValues; + +/** + * Compute facet counts from a previously indexed {@link SortedSetDocValues} or {@link + * org.apache.lucene.index.SortedDocValues} field. This approach will execute facet counting against + * the string values found in the specified field, with no assumptions on their format. Unlike + * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no assumption is made + * about a "dimension" path component being indexed. Because of this, the field itself is + * effectively treated as the "dimension", and counts for all unique string values are produced. + * This approach is meant to compliment {@link LongValueFacetCounts} in that they both provide facet + * counting on a doc value field with no assumptions of content. + * + * This implementation is useful if you want to dynamically count against any string doc value + * field without relying on {@link FacetField} and {@link FacetsConfig}. The disadvantage is that a + * separate field is required for each "dimension". If you want to pack multiple dimensions into the + * same doc values field, you probably want one of {@link + * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link + * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}. + * + * Note that there is an added cost on every {@link IndexReader} open to create a new {@link + * StringDocValuesReaderState}. Also note that this class should be instantiated and used from a + * single thread, because it holds a thread-private instance of {@link SortedSetDocValues}. + * + * @lucene.experimental + */ +// TODO: Add a concurrent version much like ConcurrentSortedSetDocValuesFacetCounts? +public class StringValueFacetCounts extends Facets { + + private final IndexReader reader; + private final String field; + private final OrdinalMap ordinalMap; + private final SortedSetDocValues docValues; + + private final int[] counts; + + private int totalDocCount = 0; Review comment: You don't need the `= 0` -- it's java's default already. ## File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java ## @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.IndexReader;
[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public
[ https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344634#comment-17344634 ] Michael McCandless commented on LUCENE-9956: Maybe we could do both? Make these APIs public (I agree it seems unreasonable now to not be able to {{get}} the things you had {{set}} / passed to {{ctor}}) and also (better) test the actually-used-for-searching {{rewrite}}? > Make getBaseQuery API from DrillDownQuery public > - > > Key: LUCENE-9956 > URL: https://issues.apache.org/jira/browse/LUCENE-9956 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Gautam Worah >Priority: Trivial > > It would be great if users could access the baseQuery of a DrillDownQuery. I > think this can be useful for folks who want to access/test the clauses of a > BooleanQuery (for example) after they've already wrapped it into a > DrillDownQuery. > > Currently the {{Query getBaseQuery()}} method is package private by default. > If this proposed change does not make sense, or if this change breaks the > semantic of the class, I am happy to explore other ways of doing this! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn opened a new pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort
dnhatn opened a new pull request #140: URL: https://github.com/apache/lucene/pull/140 This change enables bulk-merge for term vectors with index sort. The algorithm used here is similar to the one that is used to merge stored fields. Relates #134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344599#comment-17344599 ] Matt Weber commented on LUCENE-9958: [~jpountz] Wow that was quick! Thank you! > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.9 > > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9958. -- Fix Version/s: 8.9 Resolution: Fixed > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.9 > > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344549#comment-17344549 ] ASF subversion and git services commented on LUCENE-9958: - Commit d50d5dec62b612b8d603d82d33044cfc97c02d91 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d50d5de ] LUCENE-9958: Fixed performance regression for boolean queries that configure a minimum number of matching clauses. > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.9 > > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344548#comment-17344548 ] ASF subversion and git services commented on LUCENE-9932: - Commit 8045a170f42f3562f8d13616e2c4426bbcacf3a1 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8045a17 ] LUCENE-9932: Spotless. > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Fix For: 8.9 > > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12.5h > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344547#comment-17344547 ] ASF subversion and git services commented on LUCENE-9958: - Commit 2c04ab58353eb56d254b09ba075ff33e20e9d329 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2c04ab5 ] LUCENE-9958: Fixed performance regression for boolean queries that configure a minimum number of matching clauses. > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344544#comment-17344544 ] Adrien Grand commented on LUCENE-9958: -- The fix is embarrissingly simple. In short, WANDScorer would only accept to leave scorers behind if the sum of their score could not be competitive. However it is also ok to leave {{minShouldMatch-1}} scorers behind regardless of their score, since there cannot be a hit without at least {{minShouldMatch}} matching scorers regardless of their score. {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java index f33af6b8ee8..f5bab49fb71 100644 --- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java +++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java @@ -548,7 +548,7 @@ final class WANDScorer extends Scorer { /** Insert an entry in 'tail' and evict the least-costly scorer if full. */ private DisiWrapper insertTailWithOverFlow(DisiWrapper s) { -if (tailMaxScore + s.maxScore < minCompetitiveScore) { +if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < minShouldMatch) { // we have free room for this new entry addTail(s); tailMaxScore += s.maxScore; {code} Here are updated results from luceneutil where baseline is origing/main and the patch is the above 1-line change: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value PKLookup 248.11 (4.1%) 235.92 (4.0%) -4.9% ( -12% -3%) 0.000 MSM7 203.98 (3.0%) 199.78 (4.2%) -2.1% ( -8% -5%) 0.075 MSM3 20.09 (3.0%) 20.34 (3.2%) 1.2% ( -4% -7%) 0.212 MSM1 20.15 (2.9%) 20.44 (3.5%) 1.4% ( -4% -8%) 0.162 MSM2 20.14 (3.0%) 20.44 (3.4%) 1.5% ( -4% -8%) 0.141 MSM4 18.93 (3.0%) 20.41 (3.7%) 7.8% ( 1% - 14%) 0.000 MSM55.11 (4.7%) 23.01 (17.2%) 350.1% ( 313% - 390%) 0.000 MSM62.32 (5.2%) 50.64 (92.0%) 2086.0% (1889% - 2304%) 0.000 {noformat} As we would usually expect, QPS now goes up as the minimum number of required clauses increases. > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang opened a new pull request #139: [LUCENE-9957: Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
LuXugang opened a new pull request #139: URL: https://github.com/apache/lucene/pull/139 Since in method Lucene90DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer) , all values will be visited, in the meantime, we can check if all values were sorted. if so, after docIds written done, we use DirectMonotonicWriter write all values then return. it can get relatively impressive compression -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344521#comment-17344521 ] Adrien Grand edited comment on LUCENE-9958 at 5/14/21, 11:20 AM: - Good news is that it's easy to reproduce. Using the following tasks file {noformat} MSM1: ref http from mostly interview 9 hard MSM2: ref http from mostly interview 9 hard +minShouldMatch=2 MSM3: ref http from mostly interview 9 hard +minShouldMatch=3 MSM4: ref http from mostly interview 9 hard +minShouldMatch=4 MSM5: ref http from mostly interview 9 hard +minShouldMatch=5 MSM6: ref http from mostly interview 9 hard +minShouldMatch=6 MSM7: ref http from mostly interview 9 hard +minShouldMatch=7 {noformat} I got the following results on wikimedium10m where baseline is origin/main and the patch reverts LUCENE-9346: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value MSM2 20.22 (3.7%)1.94 (0.2%) -90.4% ( -90% - -89%) 0.000 MSM3 20.14 (3.7%)3.00 (0.7%) -85.1% ( -86% - -83%) 0.000 MSM4 18.95 (3.6%)8.81 (2.5%) -53.5% ( -57% - -49%) 0.000 PKLookup 250.33 (3.5%) 230.62 (3.7%) -7.9% ( -14% -0%) 0.000 MSM7 202.13 (4.2%) 199.17 (3.3%) -1.5% ( -8% -6%) 0.216 MSM1 20.24 (3.7%) 20.81 (3.3%) 2.9% ( -4% - 10%) 0.010 MSM55.04 (5.5%) 29.43 (33.8%) 483.5% ( 420% - 553%) 0.000 MSM62.28 (6.1%) 90.03(298.1%) 3852.9% (3343% - 4428%) 0.000 {noformat} was (Author: jpountz): Good news is that it's easy to reproduce. Using the following tasks file {noformat} MSM1: ref http from mostly interview 9 hard MSM2: ref http from mostly interview 9 hard +minShouldMatch=2 MSM3: ref http from mostly interview 9 hard +minShouldMatch=3 MSM4: ref http from mostly interview 9 hard +minShouldMatch=4 MSM5: ref http from mostly interview 9 hard +minShouldMatch=5 MSM6: ref http from mostly interview 9 hard +minShouldMatch=6 MSM7: ref http from mostly interview 9 hard +minShouldMatch=7 {noformat} I got the following results on wikimedium10m where baseline is origin/main and the patch reverts LUCENE-9346: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value PKLookup 248.06 (3.6%) 231.47 (4.3%) -6.7% ( -14% -1%) 0.000 MSM7 182.44 (3.8%) 181.65 (3.4%) -0.4% ( -7% -7%) 0.704 MSM1 19.52 (4.4%) 20.31 (3.8%) 4.1% ( -4% - 12%) 0.002 MSM23.27 (3.4%)4.20 (2.9%) 28.4% ( 21% - 35%) 0.000 MSM33.09 (4.6%)6.95 (4.9%) 125.0% ( 110% - 141%) 0.000 MSM42.29 (5.7%)9.85 (15.2%) 329.9% ( 292% - 371%) 0.000 MSM52.20 (5.8%) 29.48 (56.8%) 1240.2% (1113% - 1382%) 0.000 MSM62.21 (5.8%) 88.95(223.7%) 3929.4% (3497% - 4414%) 0.000 {noformat} > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
[ https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344521#comment-17344521 ] Adrien Grand commented on LUCENE-9958: -- Good news is that it's easy to reproduce. Using the following tasks file {noformat} MSM1: ref http from mostly interview 9 hard MSM2: ref http from mostly interview 9 hard +minShouldMatch=2 MSM3: ref http from mostly interview 9 hard +minShouldMatch=3 MSM4: ref http from mostly interview 9 hard +minShouldMatch=4 MSM5: ref http from mostly interview 9 hard +minShouldMatch=5 MSM6: ref http from mostly interview 9 hard +minShouldMatch=6 MSM7: ref http from mostly interview 9 hard +minShouldMatch=7 {noformat} I got the following results on wikimedium10m where baseline is origin/main and the patch reverts LUCENE-9346: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value PKLookup 248.06 (3.6%) 231.47 (4.3%) -6.7% ( -14% -1%) 0.000 MSM7 182.44 (3.8%) 181.65 (3.4%) -0.4% ( -7% -7%) 0.704 MSM1 19.52 (4.4%) 20.31 (3.8%) 4.1% ( -4% - 12%) 0.002 MSM23.27 (3.4%)4.20 (2.9%) 28.4% ( 21% - 35%) 0.000 MSM33.09 (4.6%)6.95 (4.9%) 125.0% ( 110% - 141%) 0.000 MSM42.29 (5.7%)9.85 (15.2%) 329.9% ( 292% - 371%) 0.000 MSM52.20 (5.8%) 29.48 (56.8%) 1240.2% (1113% - 1382%) 0.000 MSM62.21 (5.8%) 88.95(223.7%) 3929.4% (3497% - 4414%) 0.000 {noformat} > Performance regression when a minimum number of matching SHOULD clauses is > required > --- > > Key: LUCENE-9958 > URL: https://issues.apache.org/jira/browse/LUCENE-9958 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > Opening this issue on behalf of [~mattweber], who reported this at > https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. > It looks like the fact that we introduced dynamic pruning for queries that > already have a minimum number of SHOULD clauses configured makes things > _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-9957: -- Description: When all values were sorted, using DirectMonotonicWriter to store them can get relatively impressive compression (was: When all values were sorted, use DirectMonotonicWriter to store them can get relatively impressive compression) > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > > When all values were sorted, using DirectMonotonicWriter to store them can > get relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344510#comment-17344510 ] Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 10:24 AM: -- Since in method Lucene90DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer) , all values will be visited, in the meantime, so we can check if all values were sorted. if so, after docIds written done, we use DirectMonotonicWriter write all values then return. Two conditions have to be met: # all values were monotone increased # numDocsWithValue == numValue numDocsWithValue == numValue means DocValues is NumericDocValues or SortedNumericDocValues which only has one value in one document. I did some simple tests: indexing 10million documents into one Segment. then only calculate the file length of *.dvd file. UniqueValues >= 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|20014826B|14129376B|-29.405%|6321130| | |1|20011970B|13768928B|-31.196%|6322006| | |2|20014826B|14145670B|29.324%|6321066| | |3|20014826B|14031072B|-29.896%|6319892| | |4|20014826B|14276230B|-28.671%|632| | |5|20014826B|13998304B|-30.060%|6320938| | |6|20014826B|13932768B|-30.387%|6320997| | |7|20014826B|13801696B|-31.042%|6321756| | |8|20014826B|13768928B|-31.206%|6322336| | |9|20014826B|14260448B|-28.750%|6321014| | | | | | | | | UniqueValues < 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|2500076B|66064B|-97.35%|2| | |1|2500076B|66064B|-97.35%|2| | |2|576B|82454B|-98.35%|4| | |3|576B|82454B|-98.35%|4| | |4|576B|115234B|-97.69%|8| | |5|576B|115234B|-97.69%|8| | |6|1076B|180794B|-98.19%|16| | |7|1076B|180794B|-98.19%|16| | |8|1076B|311914B|-96.88%|32| | |9|1076B|311914B|-96.88%|32| | |10|1076B|574154B|-94.25%|64| | |11|1076B|574154B|-94.25%|64| | |12|1076B|1098634B|-89.01%|128| | |13|1076B|1098634B|-89.01%|128| | |14|1076B|1303509B|-86.96%|255| | |15|1076B|1303509B|-86.96%|255| was (Author: chrislu): I did some simple tests: indexing 10million documents into one Segment。 UniqueValues >= 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|20014826B|14129376B|-29.405%|6321130| | |1|20011970B|13768928B|-31.196%|6322006| | |2|20014826B|14145670B|29.324%|6321066| | |3|20014826B|14031072B|-29.896%|6319892| | |4|20014826B|14276230B|-28.671%|632| | |5|20014826B|13998304B|-30.060%|6320938| | |6|20014826B|13932768B|-30.387%|6320997| | |7|20014826B|13801696B|-31.042%|6321756| | |8|20014826B|13768928B|-31.206%|6322336| | |9|20014826B|14260448B|-28.750%|6321014| | | | | | | | | UniqueValues < 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|2500076B|66064B|-97.35%|2| | |1|2500076B|66064B|-97.35%|2| | |2|576B|82454B|-98.35%|4| | |3|576B|82454B|-98.35%|4| | |4|576B|115234B|-97.69%|8| | |5|576B|115234B|-97.69%|8| | |6|1076B|180794B|-98.19%|16| | |7|1076B|180794B|-98.19%|16| | |8|1076B|311914B|-96.88%|32| | |9|1076B|311914B|-96.88%|32| | |10|1076B|574154B|-94.25%|64| | |11|1076B|574154B|-94.25%|64| | |12|1076B|1098634B|-89.01%|128| | |13|1076B|1098634B|-89.01%|128| | |14|1076B|1303509B|-86.96%|255| | |15|1076B|1303509B|-86.96%|255| > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > > When all values were sorted, use DirectMonotonicWriter to store them can get > relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344510#comment-17344510 ] Lu Xugang commented on LUCENE-9957: --- I did some simple tests: indexing 10million documents into one Segment。 UniqueValues >= 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|20014826B|14129376B|-29.405%|6321130| | |1|20011970B|13768928B|-31.196%|6322006| | |2|20014826B|14145670B|29.324%|6321066| | |3|20014826B|14031072B|-29.896%|6319892| | |4|20014826B|14276230B|-28.671%|632| | |5|20014826B|13998304B|-30.060%|6320938| | |6|20014826B|13932768B|-30.387%|6320997| | |7|20014826B|13801696B|-31.042%|6321756| | |8|20014826B|13768928B|-31.206%|6322336| | |9|20014826B|14260448B|-28.750%|6321014| | | | | | | | | UniqueValues < 256: ||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| || |0|2500076B|66064B|-97.35%|2| | |1|2500076B|66064B|-97.35%|2| | |2|576B|82454B|-98.35%|4| | |3|576B|82454B|-98.35%|4| | |4|576B|115234B|-97.69%|8| | |5|576B|115234B|-97.69%|8| | |6|1076B|180794B|-98.19%|16| | |7|1076B|180794B|-98.19%|16| | |8|1076B|311914B|-96.88%|32| | |9|1076B|311914B|-96.88%|32| | |10|1076B|574154B|-94.25%|64| | |11|1076B|574154B|-94.25%|64| | |12|1076B|1098634B|-89.01%|128| | |13|1076B|1098634B|-89.01%|128| | |14|1076B|1303509B|-86.96%|255| | |15|1076B|1303509B|-86.96%|255| > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > > When all values were sorted, use DirectMonotonicWriter to store them can get > relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors
Adrien Grand created LUCENE-9959: Summary: Can we remove threadlocals of stored fields and term vectors Key: LUCENE-9959 URL: https://issues.apache.org/jira/browse/LUCENE-9959 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand [~rmuir] suggested removing these threadlocals at https://github.com/apache/lucene/pull/137#issuecomment-840111367. These threadlocals are trappy if you manage many segments and threads within the same JVM, or worse: non-fixed threadpools. The challenge is to keep the API easy to use. We could take advantage of 9.0 to change the stored fields API? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #137: LUCENE-9955: Reduced state of stored fields readers.
jpountz commented on pull request #137: URL: https://github.com/apache/lucene/pull/137#issuecomment-841140765 Agreed we should look into this! I opened https://issues.apache.org/jira/browse/LUCENE-9959. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9957) Use DirectMonotonicWriter to store sortedValues in NumericDocValues/SortedNumericDocValues
Lu Xugang created LUCENE-9957: - Summary: Use DirectMonotonicWriter to store sortedValues in NumericDocValues/SortedNumericDocValues Key: LUCENE-9957 URL: https://issues.apache.org/jira/browse/LUCENE-9957 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Affects Versions: 8.8.2 Reporter: Lu Xugang When all values were sorted, use DirectMonotonicWriter to store them can get relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required
Adrien Grand created LUCENE-9958: Summary: Performance regression when a minimum number of matching SHOULD clauses is required Key: LUCENE-9958 URL: https://issues.apache.org/jira/browse/LUCENE-9958 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Opening this issue on behalf of [~mattweber], who reported this at https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854. It looks like the fact that we introduced dynamic pruning for queries that already have a minimum number of SHOULD clauses configured makes things _slower_, at least in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-9957: -- Summary: Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues (was: Use DirectMonotonicWriter to store sortedValues in NumericDocValues/SortedNumericDocValues) > Use DirectMonotonicWriter to store sorted Values in > NumericDocValues/SortedNumericDocValues > --- > > Key: LUCENE-9957 > URL: https://issues.apache.org/jira/browse/LUCENE-9957 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.8.2 >Reporter: Lu Xugang >Priority: Major > > When all values were sorted, use DirectMonotonicWriter to store them can get > relatively impressive compression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
jpountz commented on pull request #101: URL: https://github.com/apache/lucene/pull/101#issuecomment-841124271 > in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed? I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time. > The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause. Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9932. -- Fix Version/s: 8.9 Resolution: Fixed > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Fix For: 8.9 > > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12.5h > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9725) Allow BM25FQuery to use other similarities
[ https://issues.apache.org/jira/browse/LUCENE-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344393#comment-17344393 ] ASF subversion and git services commented on LUCENE-9725: - Commit e4d4438e047eb68110e6d7c6242c11c3fcd121e2 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e4d4438 ] LUCENE-9725: Remove unused imports. > Allow BM25FQuery to use other similarities > -- > > Key: LUCENE-9725 > URL: https://issues.apache.org/jira/browse/LUCENE-9725 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Assignee: Julie Tibshirani >Priority: Major > Fix For: 8.9 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > From a high level, BM25FQuery works as follows: > # Given a list of fields and weights, it pretends there's a synthetic > combined field where all terms have been indexed. It computes new term and > collection statistics for this combined field. > # It uses a disjunction iterator and BM25Similarity to score the documents. > The steps are (1) compute statistics that represent the combined field > content, and (2) pass these to a similarity function. There is nothing really > specific to BM25Similarity in this approach. In step 2, we could use another > similarity, for example BooleanSimilarity or those based on language models > like LMDirichletSimilarity. The main restriction is that norms have to be > additive (the norm of the combined field must be the sum of the field norms). > Maybe we could unhardcode BM25Similarity in BM25FQuery and instead use the > one configured on IndexSearcher. We could think of this as providing a > sensible default approach to cross-field scoring for many similarities. It's > an incremental step towards LUCENE-8711, which would give similarities more > fine-grained control over how stats/ scores are combined across fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344394#comment-17344394 ] ASF subversion and git services commented on LUCENE-9932: - Commit 86b6d35f7229010757924da3312ffb2fe72b17f4 in lucene-solr's branch refs/heads/branch_8x from neoReMinD [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86b6d35 ] LUCENE-9932: Performance improvement for BKD index building > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12.5h > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
jpountz commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-841080232 @neoremind I enjoyed it too. Thanks for identifying this opportunity for speedup and going through the many feedback iterations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344387#comment-17344387 ] ASF subversion and git services commented on LUCENE-9932: - Commit 8e94a591d8d7287844ae999e22f9290e113197ff in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8e94a59 ] LUCENE-9932: Fix test bug. > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12h 20m > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
neoremind commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-841074895 @jpountz It's great to work with you on this optimization :smile: Thanks for taking so much time to help me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344379#comment-17344379 ] ASF subversion and git services commented on LUCENE-9932: - Commit fd4b3c81d517e0cf9d804211b0785721cdfd1a6c in lucene's branch refs/heads/main from neoReMinD [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fd4b3c8 ] LUCENE-9932: Performance improvement for BKD index building (#91) > Performance improvement for BKD index building > -- > > Key: LUCENE-9932 > URL: https://issues.apache.org/jira/browse/LUCENE-9932 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.8.2 >Reporter: neoremind >Priority: Critical > Attachments: benchmark_data.png, flame-graph.png, > refined-code-benchmark.png, refined-code-benchmark2.png > > Time Spent: 12h > Remaining Estimate: 0h > > In BKD index building, the input bytes must be sorted before calling BKD > writer related API. The sorting method leverages MSB Radix Sort algorithm, > and the comparing method takes both the bytes itself and the DocId, but in > real cases, DocIds are usually monotonically increasing. This could yield one > possible performance enhancer. I found this enhancement when I dig into one > performance issue in our system. Then I research on the possible solution. > DocId is usually increased by one when building index in a thread-safe way, > by assuming such condition, the comparing method can eliminate the > unnecessary comparing input - DocId, only leave the bytes itself to compare. > In order to do so, MSB radix sorting and its fallback sorting method must be > *stable*, so that when elements are the same, the sorting method maintains > its original order when added, which makes DocId still monotonically > increasing. To make MSB Radix Sort stable, it needs a trivial update; to make > fallback sort table, use merge sort instead of quick sort. Meanwhile, there > should introduce a switch which is able to turn the stable option on or off. > To validate how much performance could be gained. I make a benchmark taking > down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. > *Test environment:* > MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 > MHz DDR3 > *Java version:* > java version "1.8.0_161" > Java(TM) SE Runtime Environment (build 1.8.0_161-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) > *Testcase:* > bytesPerDim = [1, 2, 3, 4, 8, 16, 32] > dim = 1 > doc num = 2,000,000 > warm up 5 time, run 10 times to calculate average time used. > *Result:* > > ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id > (master branch)|| > |1|30989.594 us|1151149.9 us| > |2|313469.47 us|1115595.1 us| > |3|844617.8 us|1465465.1 us| > |4|1350946.8 us|1465465.1 us| > |8|1344814.6 us|1458115.5 us| > |16|1344516.6 us|1459849.6 us| > |32|1386847.8 us|1583097.5 us| > !benchmark_data.png|width=580,height=283! > Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster > when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data > cardinality is high (bytesPerDim >= 4, test cases will generate random bytes > which are more scatter, not likely to be duplicate), the performance does not > go backward, still a little better. > In conclusion, in the end to end process for building BKD index, which relies > on BKDWriter for some data types, performance could be better by ignoring > DocId if they are already monotonically increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #91: LUCENE-9932: Performance improvement for BKD index building
jpountz merged pull request #91: URL: https://github.com/apache/lucene/pull/91 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org