date:20210514

[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

2021-05-14 Thread GitBox



zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841606431


   > > in the jira ticket you had suggested to use BMM for top-level (flat?) 
boolean query only. Do you think this will need to be fixed?
   > 
   > I opened this JIRA ticket because it felt like we could do better for 
top-level disjunctions, but if BMM appears to work better most of the time, we 
could just move to it all the time.
   > 
   > > The one result that does show negative impact to AndMedOrHighHigh also 
shows impact to OrHighMed, so it’s a bit strange and may need further looking 
into to see the cause.
   > 
   > Yeah, I suspect there will always be cases when BMW will perform better 
than BMM or vice-versa, sometimes for subtle reasons.
   
   Makes sense! I'll not attempt to fix it for now then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on pull request #133:
URL: https://github.com/apache/lucene/pull/133#issuecomment-841561720


   I went ahead and added a sparse counting approach since it wasn't 
complicated to do. I borrowed heuristics and some logic from 
`IntTaxonomyFacets` in doing so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-05-14 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344925#comment-17344925
 ] 

Greg Miller commented on LUCENE-9956:
-

Ah, thanks [~gworah]. Sounds like a good use-case for needing to expose the 
base query then. What I'm hearing is that there are optimizations that should 
_only_ apply to the base query, so {{rewriting}} the query isn't helpful here 
since that will produce a {{BooleanQuery}} that contains all of the drill down 
dims applied. Thanks for clarifying!

+1 to adding public access to the base query. I left a comment on your PR with 
regards to the approach of making drill down dims public as well. I like making 
everything public for consistency, but if that work kind of balloons, I don't 
mind tackling just the base query for now. That's my opinion at least.

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-05-14 Thread Gautam Worah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344924#comment-17344924
 ] 

Gautam Worah commented on LUCENE-9956:
--

Here is why I need just the baseQuery and drill down queries from a 
{{DrillDownQuery}} object.
 I have an initial {{DrillDownQuery}} that I construct by parsing the user's 
{{BooleanQuery}} and add custom {{subQuery}} s to it using the {{add(String 
dim, Query subQuery)}} API.

Then later on, I try to optimize the base {{BooleanQuery}} (remove some terms) 
and create a new {{DrillDownQuery}} object and add the original \{{subQuery}} s 
back.

Can we make rewrite() do this? Yes

Pros: Does not expose the {{baseQuery}} and limits the access 
Cons: 
 1. {{rewrite}} returns a so to say combined form of the {{DrillDownQuery}}. 
The user will have to again parse the first BQ clause and get the {{baseQuery}} 
then parse all the remaining clauses and get the \{{subQuery}}s. 
2. Using the {{rewrite}} function to get the {{baseQuery}} is a bit non 
intuitive

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632839564



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##
@@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.MultiDocValues;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.ReaderUtil;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.ConjunctionDISI;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Compute facet counts from a previously indexed {@link SortedSetDocValues} 
or {@link
+ * org.apache.lucene.index.SortedDocValues} field. This approach will execute 
facet counting against
+ * the string values found in the specified field, with no assumptions on 
their format. Unlike
+ * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no 
assumption is made
+ * about a "dimension" path component being indexed. Because of this, the 
field itself is
+ * effectively treated as the "dimension", and counts for all unique string 
values are produced.
+ * This approach is meant to complement {@link LongValueFacetCounts} in that 
they both provide facet
+ * counting on a doc value field with no assumptions of content.
+ *
+ * This implementation is useful if you want to dynamically count against 
any string doc value
+ * field without relying on {@link FacetField} and {@link FacetsConfig}. The 
disadvantage is that a
+ * separate field is required for each "dimension". If you want to pack 
multiple dimensions into the
+ * same doc values field, you probably want one of {@link
+ * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link
+ * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}.
+ *
+ * Note that there is an added cost on every {@link IndexReader} open to 
create a new {@link
+ * StringDocValuesReaderState}. Also note that this class should be 
instantiated and used from a
+ * single thread, because it holds a thread-private instance of {@link 
SortedSetDocValues}.
+ *
+ * Also note that counting does not use a sparse data structure, so heap 
memory cost scales with
+ * the number of unique ordinals for the field being counting. For 
high-cardinality fields, this
+ * could be costly.
+ *
+ * @lucene.experimental
+ */
+// TODO: Add a concurrent version much like 
ConcurrentSortedSetDocValuesFacetCounts?
+public class StringValueFacetCounts extends Facets {
+
+  private final IndexReader reader;
+  private final String field;
+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  // TODO: There's an optimization opportunity here to use a sparse counting 
structure in some
+  // cases,
+  // much like what IntTaxonomyFacetCounts does.
+  /** Dense counting array indexed by ordinal. */
+  private final int[] counts;
+
+  private int totalDocCount;
+
+  /**
+   * Returns all facet counts for the field, same result as searching on 
{@link MatchAllDocsQuery}
+   * but faster.
+   */
+  public StringValueFacetCounts(StringDocValuesReaderState state) throws 
IOException {
+this(state, null);
+  }
+
+  /** Counts facets across the provided hits. */
+  public StringValueFacetCounts(StringDocValuesReaderState state, 
FacetsCollector facetsCollector)
+  throws IOException {
+reader = state.reader;
+field = state.field;
+ordinalMap = state.ordinalMap;
+docValues = getDocValues();
+
+// Since we accumulate counts in an array, we need to ensure the number of 
unique ordinals
+// doesn't overflow an integer:
+if (docValues.getValueCount() > Integer.MAX_VALUE) {
+  throw new IllegalArgumentException(
+  "can only handle

[GitHub] [lucene] gsmiller commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-05-14 Thread GitBox



gsmiller commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r632800907



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,11 +170,22 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
 Query[] dimQueries = new Query[this.dimQueries.size()];

Review comment:
   I wonder if it would make more sense to build each dim query when 
they're added and change dimQueries to `List`. From a quick 
glance at the code, I don't see a need to store these as boolean clauses and 
then only build them when needed. If we expose this as public, we could wind up 
building these queries multiple times, which is wasteful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on pull request #133:
URL: https://github.com/apache/lucene/pull/133#issuecomment-841450992


   @mikemccand yeah, this works for both single- and multi-valued fields. In 
`getDocValues()` I'm relying on `DocValues.getSortedSet()` which will first try 
to load stored values as `SortedSetDocValues` but will fall back to trying 
`SortedDocValues`. Pretty handy helper functionality. I cover this case in 
`testBasicSingleValuedUsingSortedDoc` to confirm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632722791



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringDocValuesReaderState.java
##
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Stores an {@link OrdinalMap} created for a specific {@link IndexReader} 
({@code reader}) + {@code
+ * field}. Enables re-use of the {@code ordinalMap} once created since 
creation is costly.
+ *
+ * Note: It's important that callers confirm the ordinal map is still valid 
for their cases.
+ * Specifically, callers should confirm that the reader used to create the map 
({@code reader})
+ * matches their use-case.
+ */
+class StringDocValuesReaderState {

Review comment:
   Ah, yes-- it absolutely should along with its ctor. Good catch! I've got 
my IDE setup to generate all new classes as package-private by default so I 
have to have a good reason to make something public. This one slipped through 
the cracks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632720493



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.MultiDocValues;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.ReaderUtil;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.ConjunctionDISI;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Compute facet counts from a previously indexed {@link SortedSetDocValues} 
or {@link
+ * org.apache.lucene.index.SortedDocValues} field. This approach will execute 
facet counting against
+ * the string values found in the specified field, with no assumptions on 
their format. Unlike
+ * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no 
assumption is made
+ * about a "dimension" path component being indexed. Because of this, the 
field itself is
+ * effectively treated as the "dimension", and counts for all unique string 
values are produced.
+ * This approach is meant to compliment {@link LongValueFacetCounts} in that 
they both provide facet
+ * counting on a doc value field with no assumptions of content.
+ *
+ * This implementation is useful if you want to dynamically count against 
any string doc value
+ * field without relying on {@link FacetField} and {@link FacetsConfig}. The 
disadvantage is that a
+ * separate field is required for each "dimension". If you want to pack 
multiple dimensions into the
+ * same doc values field, you probably want one of {@link
+ * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link
+ * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}.
+ *
+ * Note that there is an added cost on every {@link IndexReader} open to 
create a new {@link
+ * StringDocValuesReaderState}. Also note that this class should be 
instantiated and used from a
+ * single thread, because it holds a thread-private instance of {@link 
SortedSetDocValues}.
+ *
+ * @lucene.experimental
+ */
+// TODO: Add a concurrent version much like 
ConcurrentSortedSetDocValuesFacetCounts?
+public class StringValueFacetCounts extends Facets {
+
+  private final IndexReader reader;
+  private final String field;
+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  private final int[] counts;

Review comment:
   That's correct. I'll add some documentation. I considered having both 
sparse and dense approaches triggered by different thresholds, similar to what 
`IntTaxonomyFacetCounts` does, but opted not to for now. There should at least 
be some fairly common cases where this counting is pretty dense, assuming most 
unique values end up being seen at least once for a given field on any given 
match set. For very restrictive queries though, this could certainly get sparse.
   
   Anyway, maybe the most relevant reason I took this approach for now is that 
it's the existing approach used by `SortedSetDocValueFacetCounts`, so seemed 
like a reasonable starting place. But yes, optimization opportunities exist :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



gsmiller commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632717530



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.MultiDocValues;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.ReaderUtil;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.ConjunctionDISI;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Compute facet counts from a previously indexed {@link SortedSetDocValues} 
or {@link
+ * org.apache.lucene.index.SortedDocValues} field. This approach will execute 
facet counting against
+ * the string values found in the specified field, with no assumptions on 
their format. Unlike
+ * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no 
assumption is made
+ * about a "dimension" path component being indexed. Because of this, the 
field itself is
+ * effectively treated as the "dimension", and counts for all unique string 
values are produced.
+ * This approach is meant to compliment {@link LongValueFacetCounts} in that 
they both provide facet
+ * counting on a doc value field with no assumptions of content.
+ *
+ * This implementation is useful if you want to dynamically count against 
any string doc value
+ * field without relying on {@link FacetField} and {@link FacetsConfig}. The 
disadvantage is that a
+ * separate field is required for each "dimension". If you want to pack 
multiple dimensions into the
+ * same doc values field, you probably want one of {@link
+ * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link
+ * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}.
+ *
+ * Note that there is an added cost on every {@link IndexReader} open to 
create a new {@link
+ * StringDocValuesReaderState}. Also note that this class should be 
instantiated and used from a
+ * single thread, because it holds a thread-private instance of {@link 
SortedSetDocValues}.
+ *
+ * @lucene.experimental
+ */
+// TODO: Add a concurrent version much like 
ConcurrentSortedSetDocValuesFacetCounts?
+public class StringValueFacetCounts extends Facets {
+
+  private final IndexReader reader;
+  private final String field;
+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  private final int[] counts;
+
+  private int totalDocCount = 0;

Review comment:
   Good point; will remove.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-05-14 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344803#comment-17344803
 ] 

Greg Miller commented on LUCENE-9956:
-

{quote}I agree it seems unreasonable now to not be able to {{get}} the things 
you had {{set}} / passed to {{ctor}}
{quote}
Yeah that's fair. It's a little nuanced though I think. DDQ supports a few 
different ways to create the base query and the drill down dims, some of which 
are not as simple as having the user pass something in. For example, there's a 
ctor that allows the user to pass in an existing DDQ and an additional Query 
filter. The base query from the reference DDQ + filter becomes the new base 
query. Does it really make sense to expose that directly to the user? Also, the 
"standard" approach to adding drill down dimensions is to specify a dim + path, 
and DDQ constructs the appropriate Query using the user-provided 
{{FacetsConfig}}. Again, in these cases should we be exposing the Queries 
created under the hood?

I don't think there's any harm really in exposing them, but it does feel like a 
bit of an "advanced" feature. My intention isn't really to push back on adding 
this functionality, more to say "let's think about it a little bit and if 
{{rewrite}} is all that's really needed in this case, maybe we don't add this 
stuff yet".

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-05-14 Thread Gautam Worah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344801#comment-17344801
 ] 

Gautam Worah commented on LUCENE-9956:
--

Here is a PR that I opened yesterday: https://github.com/apache/lucene/pull/138

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796
 ] 

Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:05 PM:
-

benchmark: python src/python/localrun.py -source wikimedium5m

!image-2021-05-15-02-04-43-405.png|width=591,height=503!


was (Author: chrislu):
benchmark: python src/python/localrun.py -source wikimedium5m

  !image-2021-05-15-02-03-06-167.png!

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
> Attachments: image-2021-05-15-02-03-06-167.png, 
> image-2021-05-15-02-04-09-085.png, image-2021-05-15-02-04-43-405.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When all values were sorted, using DirectMonotonicWriter to store them can 
> get relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796
 ] 

Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:03 PM:
-

benchmark: python src/python/localrun.py -source wikimedium5m

  !image-2021-05-15-02-03-06-167.png!


was (Author: chrislu):
benchmark: python src/python/localrun.py -source wikimedium5m

 
{code:java}
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value
PKLookup  190.11  (5.6%)  190.46  (5.7%)
0.2% ( -10% -   12%) 0.917
BrowseDayOfYearTaxoFacets5.11  (4.6%)5.17  (3.7%)
1.1% (  -6% -9%) 0.425
BrowseDayOfYearSSDVFacets   29.52  (4.5%)   29.90  (2.6%)
1.3% (  -5% -8%) 0.273
   HighTermMonthSort  225.07 (16.4%)  228.27 (14.7%)
1.4% ( -25% -   38%) 0.772
   BrowseMonthTaxoFacets5.47  (4.6%)5.55  (3.9%)
1.5% (  -6% -   10%) 0.273
BrowseDateTaxoFacets5.12  (4.4%)5.19  (3.6%)
1.5% (  -6% -9%) 0.229
HighSloppyPhrase   20.55  (5.7%)   20.89  (4.7%)
1.6% (  -8% -   12%) 0.326
HighIntervalsOrdered   38.28  (5.8%)   38.95  (3.1%)
1.7% (  -6% -   11%) 0.244
  TermDTSort  372.47  (8.6%)  379.17  (6.7%)
1.8% ( -12% -   18%) 0.459
 MedSloppyPhrase   79.43  (7.6%)   81.11  (5.9%)
2.1% ( -10% -   16%) 0.328
 MedSpanNear  157.72  (4.3%)  161.24  (3.2%)
2.2% (  -5% -   10%) 0.063
 AndHighHigh   94.43  (6.0%)   96.66  (4.5%)
2.4% (  -7% -   13%) 0.157
HighTermTitleBDVSort  286.92 (16.9%)  293.86 (15.5%)
2.4% ( -25% -   41%) 0.637
   HighTermDayOfYearSort  222.69 (10.9%)  228.38 (11.6%)
2.6% ( -17% -   28%) 0.473
HighSpanNear   23.28  (6.9%)   23.90  (3.3%)
2.7% (  -6% -   13%) 0.118
 LowSpanNear   42.75  (6.1%)   43.93  (3.7%)
2.8% (  -6% -   13%) 0.081
  OrHighHigh   59.14  (5.1%)   60.90  (4.2%)
3.0% (  -6% -   12%) 0.044
 LowSloppyPhrase   58.41  (6.1%)   60.22  (4.1%)
3.1% (  -6% -   14%) 0.059
 Respell   85.25 (10.6%)   87.89  (8.8%)
3.1% ( -14% -   25%) 0.312
   BrowseMonthSSDVFacets   35.63  (6.5%)   36.78  (2.7%)
3.2% (  -5% -   13%) 0.041
Wildcard  157.33  (6.6%)  162.55  (3.9%)
3.3% (  -6% -   14%) 0.051
  Fuzzy2   55.44 (19.3%)   57.31 (20.0%)
3.4% ( -30% -   52%) 0.587
  AndHighLow  873.15  (7.7%)  904.41  (5.1%)
3.6% (  -8% -   17%) 0.082
 LowTerm 1357.94  (8.0%) 1409.83  (6.8%)
3.8% ( -10% -   20%) 0.103
   MedPhrase  168.51  (5.9%)  175.10  (5.6%)
3.9% (  -7% -   16%) 0.032
OrNotHighLow  887.60  (8.6%)  923.27  (6.4%)
4.0% ( -10% -   20%) 0.092
   OrHighMed  132.13  (9.7%)  137.52  (6.6%)
4.1% ( -11% -   22%) 0.120
   LowPhrase  223.17  (7.5%)  232.56  (4.9%)
4.2% (  -7% -   18%) 0.036
  HighPhrase  129.12  (5.9%)  134.66  (3.8%)
4.3% (  -5% -   14%) 0.006
 MedTerm 1302.32  (8.1%) 1358.78  (7.7%)
4.3% ( -10% -   21%) 0.083
  AndHighMed  198.67  (6.5%)  207.93  (5.9%)
4.7% (  -7% -   18%) 0.017
 Prefix3  198.58 (10.4%)  208.86  (5.8%)
5.2% (  -9% -   23%) 0.051
HighTerm 1324.37  (7.1%) 1408.12  (7.6%)
6.3% (  -7% -   22%) 0.006
  IntNRQ  182.65  (9.6%)  194.50  (6.3%)
6.5% (  -8% -   24%) 0.012
   OrHighLow  447.06 (13.3%)  476.33 (12.2%)
6.5% ( -16% -   36%) 0.105
   OrNotHighHigh  655.34  (8.9%)  701.03  (6.5%)
7.0% (  -7% -   24%) 0.004
OrHighNotMed  702.92 (11.3%)  757.64  (8.1%)
7.8% ( -10% -   30%) 0.012
   OrHighNotHigh  515.44  (9.0%)  556.38  (6.7%)
7.9% (  -7% -   25%) 0.001
  Fuzzy1   61.94 (11.8%)   66.98 (15.4%)
8.1% ( -17% -   40%) 0.061
OrHighNotLow  795.55  (8.0%)  865.52  (7.8%)
8.8% (  -6% -   26%) 0.000
OrNotHighMed  553.58  (9.4%)  602.57  (6.5%)
8.8% (  -6% -   27%) 0.001

{code}
 

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
>

[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796
 ] 

Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:01 PM:
-

benchmark: python src/python/localrun.py -source wikimedium5m

 
{code:java}
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value
PKLookup  190.11  (5.6%)  190.46  (5.7%)
0.2% ( -10% -   12%) 0.917
BrowseDayOfYearTaxoFacets5.11  (4.6%)5.17  (3.7%)
1.1% (  -6% -9%) 0.425
BrowseDayOfYearSSDVFacets   29.52  (4.5%)   29.90  (2.6%)
1.3% (  -5% -8%) 0.273
   HighTermMonthSort  225.07 (16.4%)  228.27 (14.7%)
1.4% ( -25% -   38%) 0.772
   BrowseMonthTaxoFacets5.47  (4.6%)5.55  (3.9%)
1.5% (  -6% -   10%) 0.273
BrowseDateTaxoFacets5.12  (4.4%)5.19  (3.6%)
1.5% (  -6% -9%) 0.229
HighSloppyPhrase   20.55  (5.7%)   20.89  (4.7%)
1.6% (  -8% -   12%) 0.326
HighIntervalsOrdered   38.28  (5.8%)   38.95  (3.1%)
1.7% (  -6% -   11%) 0.244
  TermDTSort  372.47  (8.6%)  379.17  (6.7%)
1.8% ( -12% -   18%) 0.459
 MedSloppyPhrase   79.43  (7.6%)   81.11  (5.9%)
2.1% ( -10% -   16%) 0.328
 MedSpanNear  157.72  (4.3%)  161.24  (3.2%)
2.2% (  -5% -   10%) 0.063
 AndHighHigh   94.43  (6.0%)   96.66  (4.5%)
2.4% (  -7% -   13%) 0.157
HighTermTitleBDVSort  286.92 (16.9%)  293.86 (15.5%)
2.4% ( -25% -   41%) 0.637
   HighTermDayOfYearSort  222.69 (10.9%)  228.38 (11.6%)
2.6% ( -17% -   28%) 0.473
HighSpanNear   23.28  (6.9%)   23.90  (3.3%)
2.7% (  -6% -   13%) 0.118
 LowSpanNear   42.75  (6.1%)   43.93  (3.7%)
2.8% (  -6% -   13%) 0.081
  OrHighHigh   59.14  (5.1%)   60.90  (4.2%)
3.0% (  -6% -   12%) 0.044
 LowSloppyPhrase   58.41  (6.1%)   60.22  (4.1%)
3.1% (  -6% -   14%) 0.059
 Respell   85.25 (10.6%)   87.89  (8.8%)
3.1% ( -14% -   25%) 0.312
   BrowseMonthSSDVFacets   35.63  (6.5%)   36.78  (2.7%)
3.2% (  -5% -   13%) 0.041
Wildcard  157.33  (6.6%)  162.55  (3.9%)
3.3% (  -6% -   14%) 0.051
  Fuzzy2   55.44 (19.3%)   57.31 (20.0%)
3.4% ( -30% -   52%) 0.587
  AndHighLow  873.15  (7.7%)  904.41  (5.1%)
3.6% (  -8% -   17%) 0.082
 LowTerm 1357.94  (8.0%) 1409.83  (6.8%)
3.8% ( -10% -   20%) 0.103
   MedPhrase  168.51  (5.9%)  175.10  (5.6%)
3.9% (  -7% -   16%) 0.032
OrNotHighLow  887.60  (8.6%)  923.27  (6.4%)
4.0% ( -10% -   20%) 0.092
   OrHighMed  132.13  (9.7%)  137.52  (6.6%)
4.1% ( -11% -   22%) 0.120
   LowPhrase  223.17  (7.5%)  232.56  (4.9%)
4.2% (  -7% -   18%) 0.036
  HighPhrase  129.12  (5.9%)  134.66  (3.8%)
4.3% (  -5% -   14%) 0.006
 MedTerm 1302.32  (8.1%) 1358.78  (7.7%)
4.3% ( -10% -   21%) 0.083
  AndHighMed  198.67  (6.5%)  207.93  (5.9%)
4.7% (  -7% -   18%) 0.017
 Prefix3  198.58 (10.4%)  208.86  (5.8%)
5.2% (  -9% -   23%) 0.051
HighTerm 1324.37  (7.1%) 1408.12  (7.6%)
6.3% (  -7% -   22%) 0.006
  IntNRQ  182.65  (9.6%)  194.50  (6.3%)
6.5% (  -8% -   24%) 0.012
   OrHighLow  447.06 (13.3%)  476.33 (12.2%)
6.5% ( -16% -   36%) 0.105
   OrNotHighHigh  655.34  (8.9%)  701.03  (6.5%)
7.0% (  -7% -   24%) 0.004
OrHighNotMed  702.92 (11.3%)  757.64  (8.1%)
7.8% ( -10% -   30%) 0.012
   OrHighNotHigh  515.44  (9.0%)  556.38  (6.7%)
7.9% (  -7% -   25%) 0.001
  Fuzzy1   61.94 (11.8%)   66.98 (15.4%)
8.1% ( -17% -   40%) 0.061
OrHighNotLow  795.55  (8.0%)  865.52  (7.8%)
8.8% (  -6% -   26%) 0.000
OrNotHighMed  553.58  (9.4%)  602.57  (6.5%)
8.8% (  -6% -   27%) 0.001

{code}
 


was (Author: chrislu):
benchmark: python src/python/localrun.py -source wikimedium5m

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
>

[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796
 ] 

Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 6:00 PM:
-

benchmark: python src/python/localrun.py -source wikimedium5m


was (Author: chrislu):
benchmark: python src/python/localrun.py -source wikimedium5m

result:
{code:java}
//代码占位符
{code}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 
190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets 
5.11 (4.6%) 5.17 (3.7%) 1.1% ( -6% - 9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 
(4.5%) 29.90 (2.6%) 1.3% ( -5% - 8%) 0.273 HighTermMonthSort 225.07 (16.4%) 
228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets 5.47 (4.6%) 5.55 
(3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets 5.12 (4.4%) 5.19 (3.6%) 
1.5% ( -6% - 9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 
12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 
0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 
MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 
157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 
96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 
(15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 
(11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( 
-6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 
OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 
58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 
(8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 
3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 
0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 
873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 
1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 
3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 
20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 
LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 
(5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 
(7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( 
-7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 
HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 
(9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 
(12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% 
( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 
0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 
61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 
865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 
8.8% ( -6% - 27%) 0.001

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When all values were sorted, using DirectMonotonicWriter to store them can 
> get relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344796#comment-17344796
 ] 

Lu Xugang commented on LUCENE-9957:
---

benchmark: python src/python/localrun.py -source wikimedium5m

result:
{code:java}
//代码占位符
{code}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 
190.11 (5.6%) 190.46 (5.7%) 0.2% ( -10% - 12%) 0.917 BrowseDayOfYearTaxoFacets 
5.11 (4.6%) 5.17 (3.7%) 1.1% ( -6% - 9%) 0.425 BrowseDayOfYearSSDVFacets 29.52 
(4.5%) 29.90 (2.6%) 1.3% ( -5% - 8%) 0.273 HighTermMonthSort 225.07 (16.4%) 
228.27 (14.7%) 1.4% ( -25% - 38%) 0.772 BrowseMonthTaxoFacets 5.47 (4.6%) 5.55 
(3.9%) 1.5% ( -6% - 10%) 0.273 BrowseDateTaxoFacets 5.12 (4.4%) 5.19 (3.6%) 
1.5% ( -6% - 9%) 0.229 HighSloppyPhrase 20.55 (5.7%) 20.89 (4.7%) 1.6% ( -8% - 
12%) 0.326 HighIntervalsOrdered 38.28 (5.8%) 38.95 (3.1%) 1.7% ( -6% - 11%) 
0.244 TermDTSort 372.47 (8.6%) 379.17 (6.7%) 1.8% ( -12% - 18%) 0.459 
MedSloppyPhrase 79.43 (7.6%) 81.11 (5.9%) 2.1% ( -10% - 16%) 0.328 MedSpanNear 
157.72 (4.3%) 161.24 (3.2%) 2.2% ( -5% - 10%) 0.063 AndHighHigh 94.43 (6.0%) 
96.66 (4.5%) 2.4% ( -7% - 13%) 0.157 HighTermTitleBDVSort 286.92 (16.9%) 293.86 
(15.5%) 2.4% ( -25% - 41%) 0.637 HighTermDayOfYearSort 222.69 (10.9%) 228.38 
(11.6%) 2.6% ( -17% - 28%) 0.473 HighSpanNear 23.28 (6.9%) 23.90 (3.3%) 2.7% ( 
-6% - 13%) 0.118 LowSpanNear 42.75 (6.1%) 43.93 (3.7%) 2.8% ( -6% - 13%) 0.081 
OrHighHigh 59.14 (5.1%) 60.90 (4.2%) 3.0% ( -6% - 12%) 0.044 LowSloppyPhrase 
58.41 (6.1%) 60.22 (4.1%) 3.1% ( -6% - 14%) 0.059 Respell 85.25 (10.6%) 87.89 
(8.8%) 3.1% ( -14% - 25%) 0.312 BrowseMonthSSDVFacets 35.63 (6.5%) 36.78 (2.7%) 
3.2% ( -5% - 13%) 0.041 Wildcard 157.33 (6.6%) 162.55 (3.9%) 3.3% ( -6% - 14%) 
0.051 Fuzzy2 55.44 (19.3%) 57.31 (20.0%) 3.4% ( -30% - 52%) 0.587 AndHighLow 
873.15 (7.7%) 904.41 (5.1%) 3.6% ( -8% - 17%) 0.082 LowTerm 1357.94 (8.0%) 
1409.83 (6.8%) 3.8% ( -10% - 20%) 0.103 MedPhrase 168.51 (5.9%) 175.10 (5.6%) 
3.9% ( -7% - 16%) 0.032 OrNotHighLow 887.60 (8.6%) 923.27 (6.4%) 4.0% ( -10% - 
20%) 0.092 OrHighMed 132.13 (9.7%) 137.52 (6.6%) 4.1% ( -11% - 22%) 0.120 
LowPhrase 223.17 (7.5%) 232.56 (4.9%) 4.2% ( -7% - 18%) 0.036 HighPhrase 129.12 
(5.9%) 134.66 (3.8%) 4.3% ( -5% - 14%) 0.006 MedTerm 1302.32 (8.1%) 1358.78 
(7.7%) 4.3% ( -10% - 21%) 0.083 AndHighMed 198.67 (6.5%) 207.93 (5.9%) 4.7% ( 
-7% - 18%) 0.017 Prefix3 198.58 (10.4%) 208.86 (5.8%) 5.2% ( -9% - 23%) 0.051 
HighTerm 1324.37 (7.1%) 1408.12 (7.6%) 6.3% ( -7% - 22%) 0.006 IntNRQ 182.65 
(9.6%) 194.50 (6.3%) 6.5% ( -8% - 24%) 0.012 OrHighLow 447.06 (13.3%) 476.33 
(12.2%) 6.5% ( -16% - 36%) 0.105 OrNotHighHigh 655.34 (8.9%) 701.03 (6.5%) 7.0% 
( -7% - 24%) 0.004 OrHighNotMed 702.92 (11.3%) 757.64 (8.1%) 7.8% ( -10% - 30%) 
0.012 OrHighNotHigh 515.44 (9.0%) 556.38 (6.7%) 7.9% ( -7% - 25%) 0.001 Fuzzy1 
61.94 (11.8%) 66.98 (15.4%) 8.1% ( -17% - 40%) 0.061 OrHighNotLow 795.55 (8.0%) 
865.52 (7.8%) 8.8% ( -6% - 26%) 0.000 OrNotHighMed 553.58 (9.4%) 602.57 (6.5%) 
8.8% ( -6% - 27%) 0.001

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When all values were sorted, using DirectMonotonicWriter to store them can 
> get relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand merged pull request #71: LUCENE-9651: Make benchmarks run again, correct javadocs

2021-05-14 Thread GitBox



mikemccand merged pull request #71:
URL: https://github.com/apache/lucene/pull/71


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-14 Thread GitBox



mikemccand commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-841300744


   I am excited to see what happens to [`CheckIndex` time in Lucene's nightly 
benchmarks](https://home.apache.org/~mikemccand/lucenebench/checkIndexTime.html)
 after we push this!  But I agree we must also not crush the more common case 
of machines that don't have tons of cores ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-14 Thread GitBox



mikemccand commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632584798



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.MultiDocValues;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.ReaderUtil;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.ConjunctionDISI;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Compute facet counts from a previously indexed {@link SortedSetDocValues} 
or {@link
+ * org.apache.lucene.index.SortedDocValues} field. This approach will execute 
facet counting against
+ * the string values found in the specified field, with no assumptions on 
their format. Unlike
+ * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no 
assumption is made
+ * about a "dimension" path component being indexed. Because of this, the 
field itself is
+ * effectively treated as the "dimension", and counts for all unique string 
values are produced.
+ * This approach is meant to compliment {@link LongValueFacetCounts} in that 
they both provide facet
+ * counting on a doc value field with no assumptions of content.
+ *
+ * This implementation is useful if you want to dynamically count against 
any string doc value
+ * field without relying on {@link FacetField} and {@link FacetsConfig}. The 
disadvantage is that a
+ * separate field is required for each "dimension". If you want to pack 
multiple dimensions into the
+ * same doc values field, you probably want one of {@link
+ * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link
+ * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}.
+ *
+ * Note that there is an added cost on every {@link IndexReader} open to 
create a new {@link
+ * StringDocValuesReaderState}. Also note that this class should be 
instantiated and used from a
+ * single thread, because it holds a thread-private instance of {@link 
SortedSetDocValues}.
+ *
+ * @lucene.experimental
+ */
+// TODO: Add a concurrent version much like 
ConcurrentSortedSetDocValuesFacetCounts?
+public class StringValueFacetCounts extends Facets {
+
+  private final IndexReader reader;
+  private final String field;
+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  private final int[] counts;
+
+  private int totalDocCount = 0;

Review comment:
   You don't need the `= 0` -- it's java's default already.

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;

[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-05-14 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344634#comment-17344634
 ] 

Michael McCandless commented on LUCENE-9956:


Maybe we could do both?  Make these APIs public (I agree it seems unreasonable 
now to not be able to {{get}} the things you had {{set}} / passed to {{ctor}}) 
and also (better) test the actually-used-for-searching {{rewrite}}?

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn opened a new pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

2021-05-14 Thread GitBox



dnhatn opened a new pull request #140:
URL: https://github.com/apache/lucene/pull/140


   This change enables bulk-merge for term vectors with index sort. The 
algorithm used here is similar to the one that is used to merge stored fields.
   
   Relates #134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Matt Weber (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344599#comment-17344599
 ] 

Matt Weber commented on LUCENE-9958:


[~jpountz]  Wow that was quick!  Thank you!

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.9
>
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9958.
--
Fix Version/s: 8.9
   Resolution: Fixed

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.9
>
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344549#comment-17344549
 ] 

ASF subversion and git services commented on LUCENE-9958:
-

Commit d50d5dec62b612b8d603d82d33044cfc97c02d91 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d50d5de ]

LUCENE-9958: Fixed performance regression for boolean queries that configure a 
minimum number of matching clauses.


> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.9
>
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344548#comment-17344548
 ] 

ASF subversion and git services commented on LUCENE-9932:
-

Commit 8045a170f42f3562f8d13616e2c4426bbcacf3a1 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8045a17 ]

LUCENE-9932: Spotless.


> Performance improvement for BKD index building
> --
>
> Key: LUCENE-9932
> URL: https://issues.apache.org/jira/browse/LUCENE-9932
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.8.2
>Reporter: neoremind
>Priority: Critical
> Fix For: 8.9
>
> Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344547#comment-17344547
 ] 

ASF subversion and git services commented on LUCENE-9958:
-

Commit 2c04ab58353eb56d254b09ba075ff33e20e9d329 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2c04ab5 ]

LUCENE-9958: Fixed performance regression for boolean queries that configure a 
minimum number of matching clauses.


> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344544#comment-17344544
 ] 

Adrien Grand commented on LUCENE-9958:
--

The fix is embarrissingly simple. In short, WANDScorer would only accept to 
leave scorers behind if the sum of their score could not be competitive. 
However it is also ok to leave {{minShouldMatch-1}} scorers behind regardless 
of their score, since there cannot be a hit without at least {{minShouldMatch}} 
matching scorers regardless of their score.

{code:java}
diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java 
b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
index f33af6b8ee8..f5bab49fb71 100644
--- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
+++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
@@ -548,7 +548,7 @@ final class WANDScorer extends Scorer {
 
   /** Insert an entry in 'tail' and evict the least-costly scorer if full. */
   private DisiWrapper insertTailWithOverFlow(DisiWrapper s) {
-if (tailMaxScore + s.maxScore < minCompetitiveScore) {
+if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < 
minShouldMatch) {
   // we have free room for this new entry
   addTail(s);
   tailMaxScore += s.maxScore;
 {code}

Here are updated results from luceneutil where baseline is origing/main and the 
patch is the above 1-line change:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
PKLookup  248.11  (4.1%)  235.92  (4.0%)   
-4.9% ( -12% -3%) 0.000
MSM7  203.98  (3.0%)  199.78  (4.2%)   
-2.1% (  -8% -5%) 0.075
MSM3   20.09  (3.0%)   20.34  (3.2%)
1.2% (  -4% -7%) 0.212
MSM1   20.15  (2.9%)   20.44  (3.5%)
1.4% (  -4% -8%) 0.162
MSM2   20.14  (3.0%)   20.44  (3.4%)
1.5% (  -4% -8%) 0.141
MSM4   18.93  (3.0%)   20.41  (3.7%)
7.8% (   1% -   14%) 0.000
MSM55.11  (4.7%)   23.01 (17.2%)  
350.1% ( 313% -  390%) 0.000
MSM62.32  (5.2%)   50.64 (92.0%) 
2086.0% (1889% - 2304%) 0.000
{noformat}

As we would usually expect, QPS now goes up as the minimum number of required 
clauses increases.

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang opened a new pull request #139: [LUCENE-9957: Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread GitBox



LuXugang opened a new pull request #139:
URL: https://github.com/apache/lucene/pull/139


   Since in method Lucene90DocValuesConsumer#writeValues(FieldInfo field, 
DocValuesProducer valuesProducer) , all values will be visited, in the 
meantime, we can check if all values were sorted. if so, after docIds written 
done, we use DirectMonotonicWriter write all values then return. it can get 
relatively impressive compression
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344521#comment-17344521
 ] 

Adrien Grand edited comment on LUCENE-9958 at 5/14/21, 11:20 AM:
-

Good news is that it's easy to reproduce. Using the following tasks file

{noformat}
MSM1: ref http from mostly interview 9 hard
MSM2: ref http from mostly interview 9 hard +minShouldMatch=2
MSM3: ref http from mostly interview 9 hard +minShouldMatch=3
MSM4: ref http from mostly interview 9 hard +minShouldMatch=4
MSM5: ref http from mostly interview 9 hard +minShouldMatch=5
MSM6: ref http from mostly interview 9 hard +minShouldMatch=6
MSM7: ref http from mostly interview 9 hard +minShouldMatch=7
{noformat}

I got the following results on wikimedium10m where baseline is origin/main and 
the patch reverts LUCENE-9346:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
MSM2   20.22  (3.7%)1.94  (0.2%)  
-90.4% ( -90% -  -89%) 0.000
MSM3   20.14  (3.7%)3.00  (0.7%)  
-85.1% ( -86% -  -83%) 0.000
MSM4   18.95  (3.6%)8.81  (2.5%)  
-53.5% ( -57% -  -49%) 0.000
PKLookup  250.33  (3.5%)  230.62  (3.7%)   
-7.9% ( -14% -0%) 0.000
MSM7  202.13  (4.2%)  199.17  (3.3%)   
-1.5% (  -8% -6%) 0.216
MSM1   20.24  (3.7%)   20.81  (3.3%)
2.9% (  -4% -   10%) 0.010
MSM55.04  (5.5%)   29.43 (33.8%)  
483.5% ( 420% -  553%) 0.000
MSM62.28  (6.1%)   90.03(298.1%) 
3852.9% (3343% - 4428%) 0.000
{noformat}


was (Author: jpountz):
Good news is that it's easy to reproduce. Using the following tasks file

{noformat}
MSM1: ref http from mostly interview 9 hard
MSM2: ref http from mostly interview 9 hard +minShouldMatch=2
MSM3: ref http from mostly interview 9 hard +minShouldMatch=3
MSM4: ref http from mostly interview 9 hard +minShouldMatch=4
MSM5: ref http from mostly interview 9 hard +minShouldMatch=5
MSM6: ref http from mostly interview 9 hard +minShouldMatch=6
MSM7: ref http from mostly interview 9 hard +minShouldMatch=7
{noformat}

I got the following results on wikimedium10m where baseline is origin/main and 
the patch reverts LUCENE-9346:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
PKLookup  248.06  (3.6%)  231.47  (4.3%)   
-6.7% ( -14% -1%) 0.000
MSM7  182.44  (3.8%)  181.65  (3.4%)   
-0.4% (  -7% -7%) 0.704
MSM1   19.52  (4.4%)   20.31  (3.8%)
4.1% (  -4% -   12%) 0.002
MSM23.27  (3.4%)4.20  (2.9%)   
28.4% (  21% -   35%) 0.000
MSM33.09  (4.6%)6.95  (4.9%)  
125.0% ( 110% -  141%) 0.000
MSM42.29  (5.7%)9.85 (15.2%)  
329.9% ( 292% -  371%) 0.000
MSM52.20  (5.8%)   29.48 (56.8%) 
1240.2% (1113% - 1382%) 0.000
MSM62.21  (5.8%)   88.95(223.7%) 
3929.4% (3497% - 4414%) 0.000

{noformat}

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344521#comment-17344521
 ] 

Adrien Grand commented on LUCENE-9958:
--

Good news is that it's easy to reproduce. Using the following tasks file

{noformat}
MSM1: ref http from mostly interview 9 hard
MSM2: ref http from mostly interview 9 hard +minShouldMatch=2
MSM3: ref http from mostly interview 9 hard +minShouldMatch=3
MSM4: ref http from mostly interview 9 hard +minShouldMatch=4
MSM5: ref http from mostly interview 9 hard +minShouldMatch=5
MSM6: ref http from mostly interview 9 hard +minShouldMatch=6
MSM7: ref http from mostly interview 9 hard +minShouldMatch=7
{noformat}

I got the following results on wikimedium10m where baseline is origin/main and 
the patch reverts LUCENE-9346:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
PKLookup  248.06  (3.6%)  231.47  (4.3%)   
-6.7% ( -14% -1%) 0.000
MSM7  182.44  (3.8%)  181.65  (3.4%)   
-0.4% (  -7% -7%) 0.704
MSM1   19.52  (4.4%)   20.31  (3.8%)
4.1% (  -4% -   12%) 0.002
MSM23.27  (3.4%)4.20  (2.9%)   
28.4% (  21% -   35%) 0.000
MSM33.09  (4.6%)6.95  (4.9%)  
125.0% ( 110% -  141%) 0.000
MSM42.29  (5.7%)9.85 (15.2%)  
329.9% ( 292% -  371%) 0.000
MSM52.20  (5.8%)   29.48 (56.8%) 
1240.2% (1113% - 1382%) 0.000
MSM62.21  (5.8%)   88.95(223.7%) 
3929.4% (3497% - 4414%) 0.000

{noformat}

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-9957:
--
Description: When all values were sorted, using DirectMonotonicWriter to 
store them can get relatively impressive compression  (was: When all values 
were sorted, use DirectMonotonicWriter to store them can get relatively 
impressive compression)

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>
> When all values were sorted, using DirectMonotonicWriter to store them can 
> get relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344510#comment-17344510
 ] 

Lu Xugang edited comment on LUCENE-9957 at 5/14/21, 10:24 AM:
--

Since in method Lucene90DocValuesConsumer#writeValues(FieldInfo field, 
DocValuesProducer valuesProducer) , all values will be visited, in the 
meantime, so we can check if all values were sorted. if so, after docIds 
written done, we use DirectMonotonicWriter write all values then return.

Two conditions have to be met:
 # all values were monotone increased
 # numDocsWithValue == numValue

numDocsWithValue == numValue means DocValues is NumericDocValues or 
SortedNumericDocValues which only has one value in one document.

I did some simple tests: indexing 10million documents into one Segment. then 
only calculate the file length of *.dvd file.

UniqueValues >= 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|20014826B|14129376B|-29.405%|6321130| |
|1|20011970B|13768928B|-31.196%|6322006| |
|2|20014826B|14145670B|29.324%|6321066| |
|3|20014826B|14031072B|-29.896%|6319892| |
|4|20014826B|14276230B|-28.671%|632| |
|5|20014826B|13998304B|-30.060%|6320938| |
|6|20014826B|13932768B|-30.387%|6320997| |
|7|20014826B|13801696B|-31.042%|6321756| |
|8|20014826B|13768928B|-31.206%|6322336| |
|9|20014826B|14260448B|-28.750%|6321014| |
| | | | | | |

 
 UniqueValues < 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|2500076B|66064B|-97.35%|2| |
|1|2500076B|66064B|-97.35%|2| |
|2|576B|82454B|-98.35%|4| |
|3|576B|82454B|-98.35%|4| |
|4|576B|115234B|-97.69%|8| |
|5|576B|115234B|-97.69%|8| |
|6|1076B|180794B|-98.19%|16| |
|7|1076B|180794B|-98.19%|16| |
|8|1076B|311914B|-96.88%|32| |
|9|1076B|311914B|-96.88%|32| |
|10|1076B|574154B|-94.25%|64| |
|11|1076B|574154B|-94.25%|64| |
|12|1076B|1098634B|-89.01%|128| |
|13|1076B|1098634B|-89.01%|128| |
|14|1076B|1303509B|-86.96%|255| |
|15|1076B|1303509B|-86.96%|255|


was (Author: chrislu):
I did some simple tests: indexing 10million documents into one Segment。

UniqueValues >= 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|20014826B|14129376B|-29.405%|6321130| |
|1|20011970B|13768928B|-31.196%|6322006| |
|2|20014826B|14145670B|29.324%|6321066| |
|3|20014826B|14031072B|-29.896%|6319892| |
|4|20014826B|14276230B|-28.671%|632| |
|5|20014826B|13998304B|-30.060%|6320938| |
|6|20014826B|13932768B|-30.387%|6320997| |
|7|20014826B|13801696B|-31.042%|6321756| |
|8|20014826B|13768928B|-31.206%|6322336| |
|9|20014826B|14260448B|-28.750%|6321014| |
| | | | | | |
 
UniqueValues < 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|2500076B|66064B|-97.35%|2| |
|1|2500076B|66064B|-97.35%|2| |
|2|576B|82454B|-98.35%|4| |
|3|576B|82454B|-98.35%|4| |
|4|576B|115234B|-97.69%|8| |
|5|576B|115234B|-97.69%|8| |
|6|1076B|180794B|-98.19%|16| |
|7|1076B|180794B|-98.19%|16| |
|8|1076B|311914B|-96.88%|32| |
|9|1076B|311914B|-96.88%|32| |
|10|1076B|574154B|-94.25%|64| |
|11|1076B|574154B|-94.25%|64| |
|12|1076B|1098634B|-89.01%|128| |
|13|1076B|1098634B|-89.01%|128| |
|14|1076B|1303509B|-86.96%|255| |
|15|1076B|1303509B|-86.96%|255|

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>
> When all values were sorted, use DirectMonotonicWriter to store them can get 
> relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344510#comment-17344510
 ] 

Lu Xugang commented on LUCENE-9957:
---

I did some simple tests: indexing 10million documents into one Segment。

UniqueValues >= 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|20014826B|14129376B|-29.405%|6321130| |
|1|20011970B|13768928B|-31.196%|6322006| |
|2|20014826B|14145670B|29.324%|6321066| |
|3|20014826B|14031072B|-29.896%|6319892| |
|4|20014826B|14276230B|-28.671%|632| |
|5|20014826B|13998304B|-30.060%|6320938| |
|6|20014826B|13932768B|-30.387%|6320997| |
|7|20014826B|13801696B|-31.042%|6321756| |
|8|20014826B|13768928B|-31.206%|6322336| |
|9|20014826B|14260448B|-28.750%|6321014| |
| | | | | | |
 
UniqueValues < 256：
||Loop||Branch(Main)||Branch(PR)||Storage||UniqueValues|| ||
|0|2500076B|66064B|-97.35%|2| |
|1|2500076B|66064B|-97.35%|2| |
|2|576B|82454B|-98.35%|4| |
|3|576B|82454B|-98.35%|4| |
|4|576B|115234B|-97.69%|8| |
|5|576B|115234B|-97.69%|8| |
|6|1076B|180794B|-98.19%|16| |
|7|1076B|180794B|-98.19%|16| |
|8|1076B|311914B|-96.88%|32| |
|9|1076B|311914B|-96.88%|32| |
|10|1076B|574154B|-94.25%|64| |
|11|1076B|574154B|-94.25%|64| |
|12|1076B|1098634B|-89.01%|128| |
|13|1076B|1098634B|-89.01%|128| |
|14|1076B|1303509B|-86.96%|255| |
|15|1076B|1303509B|-86.96%|255|

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>
> When all values were sorted, use DirectMonotonicWriter to store them can get 
> relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors

2021-05-14 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-9959:


 Summary: Can we remove threadlocals of stored fields and term 
vectors
 Key: LUCENE-9959
 URL: https://issues.apache.org/jira/browse/LUCENE-9959
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


[~rmuir] suggested removing these threadlocals at 
https://github.com/apache/lucene/pull/137#issuecomment-840111367.

These threadlocals are trappy if you manage many segments and threads within 
the same JVM, or worse: non-fixed threadpools. The challenge is to keep the API 
easy to use.

We could take advantage of 9.0 to change the stored fields API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #137: LUCENE-9955: Reduced state of stored fields readers.

2021-05-14 Thread GitBox



jpountz commented on pull request #137:
URL: https://github.com/apache/lucene/pull/137#issuecomment-841140765


   Agreed we should look into this! I opened 
https://issues.apache.org/jira/browse/LUCENE-9959.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9957) Use DirectMonotonicWriter to store sortedValues in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)

Lu Xugang created LUCENE-9957:
-

 Summary: Use DirectMonotonicWriter to store sortedValues in 
NumericDocValues/SortedNumericDocValues
 Key: LUCENE-9957
 URL: https://issues.apache.org/jira/browse/LUCENE-9957
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 8.8.2
Reporter: Lu Xugang


When all values were sorted, use DirectMonotonicWriter to store them can get 
relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-14 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-9958:


 Summary: Performance regression when a minimum number of matching 
SHOULD clauses is required
 Key: LUCENE-9958
 URL: https://issues.apache.org/jira/browse/LUCENE-9958
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


Opening this issue on behalf of [~mattweber], who reported this at 
https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.

It looks like the fact that we introduced dynamic pruning for queries that 
already have a minimum number of SHOULD clauses configured makes things 
_slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-14 Thread Lu Xugang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-9957:
--
Summary: Use DirectMonotonicWriter to store sorted Values in 
NumericDocValues/SortedNumericDocValues  (was: Use DirectMonotonicWriter to 
store sortedValues in NumericDocValues/SortedNumericDocValues)

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.8.2
>Reporter: Lu Xugang
>Priority: Major
>
> When all values were sorted, use DirectMonotonicWriter to store them can get 
> relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

2021-05-14 Thread GitBox



jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841124271


   > in the jira ticket you had suggested to use BMM for top-level (flat?) 
boolean query only. Do you think this will need to be fixed?
   
   I opened this JIRA ticket because it felt like we could do better for 
top-level disjunctions, but if BMM appears to work better most of the time, we 
could just move to it all the time.
   
   > The one result that does show negative impact to AndMedOrHighHigh also 
shows impact to OrHighMed, so it’s a bit strange and may need further looking 
into to see the cause.
   
   Yeah, I suspect there will always be cases when BMW will perform better than 
BMM or vice-versa, sometimes for subtle reasons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9932) Performance improvement for BKD index building

2021-05-14 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9932.
--
Fix Version/s: 8.9
   Resolution: Fixed

> Performance improvement for BKD index building
> --
>
> Key: LUCENE-9932
> URL: https://issues.apache.org/jira/browse/LUCENE-9932
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.8.2
>Reporter: neoremind
>Priority: Critical
> Fix For: 8.9
>
> Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9725) Allow BM25FQuery to use other similarities

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344393#comment-17344393
 ] 

ASF subversion and git services commented on LUCENE-9725:
-

Commit e4d4438e047eb68110e6d7c6242c11c3fcd121e2 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e4d4438 ]

LUCENE-9725: Remove unused imports.


> Allow BM25FQuery to use other similarities
> --
>
> Key: LUCENE-9725
> URL: https://issues.apache.org/jira/browse/LUCENE-9725
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> From a high level, BM25FQuery works as follows:
> # Given a list of fields and weights, it pretends there's a synthetic 
> combined field where all terms have been indexed. It computes new term and 
> collection statistics for this combined field.
> # It uses a disjunction iterator and BM25Similarity to score the documents.
> The steps are (1) compute statistics that represent the combined field 
> content, and (2) pass these to a similarity function. There is nothing really 
> specific to BM25Similarity in this approach. In step 2, we could use another 
> similarity, for example BooleanSimilarity or those based on language models 
> like LMDirichletSimilarity. The main restriction is that norms have to be 
> additive (the norm of the combined field must be the sum of the field norms).
> Maybe we could unhardcode BM25Similarity in BM25FQuery and instead use the 
> one configured on IndexSearcher. We could think of this as providing a 
> sensible default approach to cross-field scoring for many similarities. It's 
> an incremental step towards LUCENE-8711, which would give similarities more 
> fine-grained control over how stats/ scores are combined across fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344394#comment-17344394
 ] 

ASF subversion and git services commented on LUCENE-9932:
-

Commit 86b6d35f7229010757924da3312ffb2fe72b17f4 in lucene-solr's branch 
refs/heads/branch_8x from neoReMinD
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86b6d35 ]

LUCENE-9932: Performance improvement for BKD index building


> Performance improvement for BKD index building
> --
>
> Key: LUCENE-9932
> URL: https://issues.apache.org/jira/browse/LUCENE-9932
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.8.2
>Reporter: neoremind
>Priority: Critical
> Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-05-14 Thread GitBox



jpountz commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-841080232


   @neoremind I enjoyed it too. Thanks for identifying this opportunity for 
speedup and going through the many feedback iterations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344387#comment-17344387
 ] 

ASF subversion and git services commented on LUCENE-9932:
-

Commit 8e94a591d8d7287844ae999e22f9290e113197ff in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8e94a59 ]

LUCENE-9932: Fix test bug.


> Performance improvement for BKD index building
> --
>
> Key: LUCENE-9932
> URL: https://issues.apache.org/jira/browse/LUCENE-9932
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.8.2
>Reporter: neoremind
>Priority: Critical
> Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-05-14 Thread GitBox



neoremind commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-841074895


   @jpountz It's great to work with you on this optimization :smile: Thanks for 
taking so much time to help me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9932) Performance improvement for BKD index building

2021-05-14 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344379#comment-17344379
 ] 

ASF subversion and git services commented on LUCENE-9932:
-

Commit fd4b3c81d517e0cf9d804211b0785721cdfd1a6c in lucene's branch 
refs/heads/main from neoReMinD
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fd4b3c8 ]

LUCENE-9932: Performance improvement for BKD index building (#91)



> Performance improvement for BKD index building
> --
>
> Key: LUCENE-9932
> URL: https://issues.apache.org/jira/browse/LUCENE-9932
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.8.2
>Reporter: neoremind
>Priority: Critical
> Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-05-14 Thread GitBox



jpountz merged pull request #91:
URL: https://github.com/apache/lucene/pull/91


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

48 matches

Mail list logo