[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

GitBox Fri, 14 May 2021 11:35:05 -0700


gsmiller commented on a change in pull request #133:
URL: https://github.com/apache/lucene/pull/133#discussion_r632720493




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##########
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.MultiDocValues;
+import org.apache.lucene.index.OrdinalMap;
+import org.apache.lucene.index.ReaderUtil;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.ConjunctionDISI;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Compute facet counts from a previously indexed {@link SortedSetDocValues} 
or {@link
+ * org.apache.lucene.index.SortedDocValues} field. This approach will execute 
facet counting against
+ * the string values found in the specified field, with no assumptions on 
their format. Unlike
+ * {@link org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}, no 
assumption is made
+ * about a "dimension" path component being indexed. Because of this, the 
field itself is
+ * effectively treated as the "dimension", and counts for all unique string 
values are produced.
+ * This approach is meant to compliment {@link LongValueFacetCounts} in that 
they both provide facet
+ * counting on a doc value field with no assumptions of content.
+ *
+ * <p>This implementation is useful if you want to dynamically count against 
any string doc value
+ * field without relying on {@link FacetField} and {@link FacetsConfig}. The 
disadvantage is that a
+ * separate field is required for each "dimension". If you want to pack 
multiple dimensions into the
+ * same doc values field, you probably want one of {@link
+ * org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts} or {@link
+ * org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts}.
+ *
+ * <p>Note that there is an added cost on every {@link IndexReader} open to 
create a new {@link
+ * StringDocValuesReaderState}. Also note that this class should be 
instantiated and used from a
+ * single thread, because it holds a thread-private instance of {@link 
SortedSetDocValues}.
+ *
+ * @lucene.experimental
+ */
+// TODO: Add a concurrent version much like 
ConcurrentSortedSetDocValuesFacetCounts?
+public class StringValueFacetCounts extends Facets {
+
+  private final IndexReader reader;
+  private final String field;
+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  private final int[] counts;

Review comment:
       That's correct. I'll add some documentation. I considered having both 
sparse and dense approaches triggered by different thresholds, similar to what 
`IntTaxonomyFacetCounts` does, but opted not to for now. There should at least 
be some fairly common cases where this counting is pretty dense, assuming most 
unique values end up being seen at least once for a given field on any given 
match set. For very restrictive queries though, this could certainly get sparse.
   
   Anyway, maybe the most relevant reason I took this approach for now is that 
it's the existing approach used by `SortedSetDocValueFacetCounts`, so seemed 
like a reasonable starting place. But yes, optimization opportunities exist :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #133: LUCENE-9950: New facet counting implementation for general string doc value fields

Reply via email to