[GitHub] [lucene] mdmarshmallow commented on a change in pull request #509: LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets

GitBox Mon, 03 Jan 2022 22:32:40 -0800


mdmarshmallow commented on a change in pull request #509:
URL: https://github.com/apache/lucene/pull/509#discussion_r776104278




##########
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##########
@@ -471,19 +471,24 @@ private void indexDrillDownTerms(
 
   private void processSSDVFacetFields(
       Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) {
+    Set<String> addedPaths = new HashSet<>();
+
     for (Map.Entry<String, List<SortedSetDocValuesFacetField>> ent : 
byField.entrySet()) {
 
       String indexFieldName = ent.getKey();
 
       for (SortedSetDocValuesFacetField facetField : ent.getValue()) {
-        FacetLabel facetLabel = new FacetLabel(facetField.dim, 
facetField.label);
-        String fullPath = pathToString(facetLabel.components, 
facetLabel.length);
-
-        // For facet counts:
-        doc.add(new SortedSetDocValuesField(indexFieldName, new 
BytesRef(fullPath)));
-
-        // For drill-down:
-        indexDrillDownTerms(doc, indexFieldName, getDimConfig(facetField.dim), 
facetLabel);
+        FacetLabel facetLabel = new FacetLabel(facetField.dim, 
facetField.path);
+        for (int i = 0; i < facetLabel.length; i++) {

Review comment:
       Yeah I turned everything configurable. If there is at least one dim that 
is configured as hierarchical, then everything will be stored and counted in 
the "new" way, if not then SSDV facets will be stored and counted the same way 
they are currently.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader 
reader, String field) th
     }
     valueCount = (int) dv.getValueCount();
 
-    // TODO: we can make this more efficient if eg we can be
-    // "involved" when OrdinalMap is being created?  Ie see
-    // each term/ord it's assigning as it goes...
-    String lastDim = null;
-    int startOrd = -1;
+    // keeps track of last path with length i where the path has an unfilled 
sibling
+    Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>();
+
+    BytesRef nextTerm = null;
+    String[] nextComponents = null;
 
-    // TODO: this approach can work for full hierarchy?;
-    // TaxoReader can't do this since ords are not in
-    // "sorted order" ... but we should generalize this to
-    // support arbitrary hierarchy:
     for (int ord = 0; ord < valueCount; ord++) {
-      final BytesRef term = dv.lookupOrd(ord);
-      String[] components = FacetsConfig.stringToPath(term.utf8ToString());
-      if (components.length != 2) {
-        throw new IllegalArgumentException(
-            "this class can only handle 2 level hierarchy (dim/value); got: "
-                + Arrays.toString(components)
-                + " "
-                + term.utf8ToString());
+      if (children == null) {
+        children = new FixedBitSet(valueCount);
+        siblings = new int[valueCount];
+      }
+
+      String[] components;
+
+      if (nextTerm == null) {
+        BytesRef term = dv.lookupOrd(ord);
+        components = FacetsConfig.stringToPath(term.utf8ToString());
+      } else {
+        components = nextComponents;
+      }
+
+      if (components.length == 1) {
+        // current ord is a dim
+        if (dims == null) {
+          dims = new ArrayList<>();
+        }
+        dims.add(new DimAndOrd(components[0], ord));
       }
-      if (!components[0].equals(lastDim)) {
-        if (lastDim != null) {
-          prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1));
+
+      if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) {

Review comment:
       I agree that stacks are probably better suited to this problem. I 
changed it to use stacks instead. I think the resulting code is actually a bit 
cleaner.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader 
reader, String field) th
     }
     valueCount = (int) dv.getValueCount();
 
-    // TODO: we can make this more efficient if eg we can be
-    // "involved" when OrdinalMap is being created?  Ie see
-    // each term/ord it's assigning as it goes...
-    String lastDim = null;
-    int startOrd = -1;
+    // keeps track of last path with length i where the path has an unfilled 
sibling
+    Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>();
+
+    BytesRef nextTerm = null;
+    String[] nextComponents = null;
 
-    // TODO: this approach can work for full hierarchy?;
-    // TaxoReader can't do this since ords are not in
-    // "sorted order" ... but we should generalize this to
-    // support arbitrary hierarchy:
     for (int ord = 0; ord < valueCount; ord++) {
-      final BytesRef term = dv.lookupOrd(ord);
-      String[] components = FacetsConfig.stringToPath(term.utf8ToString());
-      if (components.length != 2) {
-        throw new IllegalArgumentException(
-            "this class can only handle 2 level hierarchy (dim/value); got: "
-                + Arrays.toString(components)
-                + " "
-                + term.utf8ToString());
+      if (children == null) {
+        children = new FixedBitSet(valueCount);
+        siblings = new int[valueCount];
+      }
+
+      String[] components;
+
+      if (nextTerm == null) {
+        BytesRef term = dv.lookupOrd(ord);
+        components = FacetsConfig.stringToPath(term.utf8ToString());
+      } else {
+        components = nextComponents;
+      }
+
+      if (components.length == 1) {
+        // current ord is a dim
+        if (dims == null) {
+          dims = new ArrayList<>();
+        }
+        dims.add(new DimAndOrd(components[0], ord));
       }
-      if (!components[0].equals(lastDim)) {
-        if (lastDim != null) {
-          prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1));
+
+      if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) {
+        OrdAndComponent possibleSibling = 
pathsWithUnfilledSiblings.get(components.length - 1);
+        boolean isSibling = true;
+        for (int i = 0; i < possibleSibling.component.length - 1; i++) {
+          if (!possibleSibling.component[i].equals(components[i])) {
+            isSibling = false;
+            break;
+          }
+        }
+        if (isSibling) {
+          siblings[possibleSibling.ord] = ord;
+        } else {
+          siblings[possibleSibling.ord] = INVALID_ORDINAL;
         }
-        startOrd = ord;
-        lastDim = components[0];
+        pathsWithUnfilledSiblings.remove(components.length - 1);
+      }
+
+      if (ord + 1 == valueCount) {
+        // last value, cannot have children or links to more siblings
+        siblings[ord] = INVALID_ORDINAL;
+        break;
+      }
+
+      nextTerm = dv.lookupOrd(ord + 1);
+      nextComponents = FacetsConfig.stringToPath(nextTerm.utf8ToString());
+
+      if (components.length < nextComponents.length) {
+        // next ord must be a direct child of current ord

Review comment:
       Mentioned it in this comment as well as the comment in the next `else 
if` statement, since that is also only true due to indexing all ancestral paths.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -46,13 +48,18 @@
   private final String field;
   private final int valueCount;
 
+  // denotes whether next ord is child or not
+  private FixedBitSet children;

Review comment:
       Yes your interpretation is correct. I'll change the name to 
`hasChildren` and add a more descriptive comment.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -222,4 +258,59 @@ public IndexReader getReader() {
   public int getSize() {
     return valueCount;
   }
+
+  @Override
+  public Iterable<Integer> childOrds(int pathOrd) {
+    return () ->
+        new Iterator<>() {
+
+          boolean atStart = true;
+          int currentOrd = pathOrd;
+
+          @Override
+          public boolean hasNext() {
+            if (atStart) {
+              if (currentOrd < 0 || currentOrd >= children.length()) {
+                return false;
+              }
+              return children.get(currentOrd);
+            } else {
+              return siblings[currentOrd] != INVALID_ORDINAL;
+            }
+          }
+
+          @Override
+          public Integer next() {
+            if (atStart) {
+              if (currentOrd < 0 || currentOrd >= children.length()) {
+                return INVALID_ORDINAL;
+              }
+              atStart = false;
+              if (children.get(currentOrd)) {
+                return ++currentOrd;

Review comment:
       Changed it.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader 
reader, String field) th
     }
     valueCount = (int) dv.getValueCount();
 
-    // TODO: we can make this more efficient if eg we can be
-    // "involved" when OrdinalMap is being created?  Ie see
-    // each term/ord it's assigning as it goes...
-    String lastDim = null;
-    int startOrd = -1;
+    // keeps track of last path with length i where the path has an unfilled 
sibling
+    Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>();
+
+    BytesRef nextTerm = null;
+    String[] nextComponents = null;
 
-    // TODO: this approach can work for full hierarchy?;
-    // TaxoReader can't do this since ords are not in
-    // "sorted order" ... but we should generalize this to
-    // support arbitrary hierarchy:
     for (int ord = 0; ord < valueCount; ord++) {
-      final BytesRef term = dv.lookupOrd(ord);
-      String[] components = FacetsConfig.stringToPath(term.utf8ToString());
-      if (components.length != 2) {
-        throw new IllegalArgumentException(
-            "this class can only handle 2 level hierarchy (dim/value); got: "
-                + Arrays.toString(components)
-                + " "
-                + term.utf8ToString());
+      if (children == null) {
+        children = new FixedBitSet(valueCount);
+        siblings = new int[valueCount];
+      }
+
+      String[] components;
+
+      if (nextTerm == null) {
+        BytesRef term = dv.lookupOrd(ord);
+        components = FacetsConfig.stringToPath(term.utf8ToString());
+      } else {
+        components = nextComponents;
+      }
+
+      if (components.length == 1) {
+        // current ord is a dim
+        if (dims == null) {
+          dims = new ArrayList<>();
+        }
+        dims.add(new DimAndOrd(components[0], ord));
       }
-      if (!components[0].equals(lastDim)) {
-        if (lastDim != null) {
-          prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1));
+
+      if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) {
+        OrdAndComponent possibleSibling = 
pathsWithUnfilledSiblings.get(components.length - 1);
+        boolean isSibling = true;
+        for (int i = 0; i < possibleSibling.component.length - 1; i++) {
+          if (!possibleSibling.component[i].equals(components[i])) {

Review comment:
       Replaced all `!` with `== false`.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -222,4 +258,59 @@ public IndexReader getReader() {
   public int getSize() {
     return valueCount;
   }
+
+  @Override
+  public Iterable<Integer> childOrds(int pathOrd) {

Review comment:
       Changed this to a primitive iterator instead. Didn't know that was a 
thing :)

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader 
reader, String field) th
     }
     valueCount = (int) dv.getValueCount();
 
-    // TODO: we can make this more efficient if eg we can be
-    // "involved" when OrdinalMap is being created?  Ie see
-    // each term/ord it's assigning as it goes...
-    String lastDim = null;
-    int startOrd = -1;
+    // keeps track of last path with length i where the path has an unfilled 
sibling
+    Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>();
+
+    BytesRef nextTerm = null;
+    String[] nextComponents = null;
 
-    // TODO: this approach can work for full hierarchy?;
-    // TaxoReader can't do this since ords are not in
-    // "sorted order" ... but we should generalize this to
-    // support arbitrary hierarchy:
     for (int ord = 0; ord < valueCount; ord++) {
-      final BytesRef term = dv.lookupOrd(ord);
-      String[] components = FacetsConfig.stringToPath(term.utf8ToString());
-      if (components.length != 2) {
-        throw new IllegalArgumentException(
-            "this class can only handle 2 level hierarchy (dim/value); got: "
-                + Arrays.toString(components)
-                + " "
-                + term.utf8ToString());
+      if (children == null) {
+        children = new FixedBitSet(valueCount);
+        siblings = new int[valueCount];
+      }
+
+      String[] components;
+
+      if (nextTerm == null) {
+        BytesRef term = dv.lookupOrd(ord);
+        components = FacetsConfig.stringToPath(term.utf8ToString());
+      } else {
+        components = nextComponents;
+      }
+
+      if (components.length == 1) {
+        // current ord is a dim
+        if (dims == null) {
+          dims = new ArrayList<>();
+        }
+        dims.add(new DimAndOrd(components[0], ord));
       }
-      if (!components[0].equals(lastDim)) {
-        if (lastDim != null) {
-          prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1));
+
+      if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) {
+        OrdAndComponent possibleSibling = 
pathsWithUnfilledSiblings.get(components.length - 1);
+        boolean isSibling = true;
+        for (int i = 0; i < possibleSibling.component.length - 1; i++) {
+          if (!possibleSibling.component[i].equals(components[i])) {
+            isSibling = false;
+            break;
+          }
+        }
+        if (isSibling) {
+          siblings[possibleSibling.ord] = ord;
+        } else {
+          siblings[possibleSibling.ord] = INVALID_ORDINAL;
         }
-        startOrd = ord;
-        lastDim = components[0];
+        pathsWithUnfilledSiblings.remove(components.length - 1);
+      }
+
+      if (ord + 1 == valueCount) {
+        // last value, cannot have children or links to more siblings
+        siblings[ord] = INVALID_ORDINAL;
+        break;
+      }
+
+      nextTerm = dv.lookupOrd(ord + 1);
+      nextComponents = FacetsConfig.stringToPath(nextTerm.utf8ToString());
+
+      if (components.length < nextComponents.length) {
+        // next ord must be a direct child of current ord
+        children.set(ord);

Review comment:
       Thanks :)

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -46,13 +48,18 @@
   private final String field;
   private final int valueCount;
 
+  // denotes whether next ord is child or not
+  private FixedBitSet children;

Review comment:
       Removed lazy inits and changed to `final`

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java
##########
@@ -36,23 +35,21 @@
  */
 public abstract class SortedSetDocValuesReaderState implements Accountable {
 
-  /**
-   * Holds start/end range of ords, which maps to one dimension (someday we 
may generalize it to map
-   * to hierarchies within one dimension).
-   */
-  public static final class OrdRange {
-    /** Start of range, inclusive: */
-    public final int start;
-    /** End of range, inclusive: */
-    public final int end;
+  /** Holder class for a dimension along with it's corresponding ordinal */
+  public static class DimAndOrd {
+    String dim;
+    int ord;
 
-    /** Start and end are inclusive. */
-    public OrdRange(int start, int end) {
-      this.start = start;
-      this.end = end;
+    /** sole constructor */
+    public DimAndOrd(String dim, int ord) {
+      this.dim = dim;
+      this.ord = ord;
     }
   }
 
+  /** Invalid ordinal const */
+  public static int INVALID_ORDINAL = -1;

Review comment:
       `SortedSetDocValues#NO_MORE_ORDS` is a `long` so I think it's easier to 
just define a new constant.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -317,26 +317,23 @@ public Number getSpecificValue(String dim, String... 
path) throws IOException {
   public List<FacetResult> getAllDims(int topN) throws IOException {
 
     List<FacetResult> results = new ArrayList<>();
-    for (Map.Entry<String, OrdRange> ent : 
state.getPrefixToOrdRange().entrySet()) {
-      FacetResult fr = getDim(ent.getKey(), ent.getValue(), topN);
+    for (DimAndOrd dim : state.getDims()) {
+      Iterable<Integer> childOrds = state.childOrds(dim.ord);
+      FacetResult fr = getDim(dim.dim, new String[0], dim.ord, childOrds, 
topN);
       if (fr != null) {
         results.add(fr);
       }
     }
 
     // Sort by highest count:
-    Collections.sort(
-        results,
-        new Comparator<FacetResult>() {
-          @Override
-          public int compare(FacetResult a, FacetResult b) {
-            if (a.value.intValue() > b.value.intValue()) {
-              return -1;
-            } else if (b.value.intValue() > a.value.intValue()) {
-              return 1;
-            } else {
-              return a.dim.compareTo(b.dim);
-            }
+    results.sort(
+        (a, b) -> {

Review comment:
       Hmm ok, I changed the lambdas back to anonymous classes for now.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##########
@@ -471,19 +471,24 @@ private void indexDrillDownTerms(
 
   private void processSSDVFacetFields(
       Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) {
+    Set<String> addedPaths = new HashSet<>();

Review comment:
       Wasn't sure if this was already done, will remove this.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java
##########
@@ -46,13 +48,18 @@
   private final String field;
   private final int valueCount;
 
+  // denotes whether next ord is child or not
+  private FixedBitSet children;
+  // maps an ord to its first sibling
+  private int[] siblings;

Review comment:
       I added a TODO, but changed it so this array will only be initialized if 
the field is configured to be hierarchical.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on a change in pull request #509: LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets

Reply via email to