[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

GitBox Fri, 20 Aug 2021 05:44:41 -0700


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r692233337




##########
File path: lucene/CHANGES.txt
##########
@@ -137,6 +137,9 @@ API Changes
 
 Improvements
 
+* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more 
efficiently retrieve FacetLabels for multiple
+  facet ordinals at once (Gautam Worah, Mike McCandless)
+

Review comment:
       We usually append these entries at the bottom of the section, not 
inserting at the top :)

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString()
+              + ". The maximum possible ordinal number is "
+              + (indexReader.maxDoc() - 1));
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i]);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);

Review comment:
       Can you change this so that we `sync` once on the `categoryCache`, do 
all the lookups, then release the lock?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString()
+              + ". The maximum possible ordinal number is "
+              + (indexReader.maxDoc() - 1));
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i]);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {

Review comment:
       Awesome!  Such a useful sorting utility class...

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    float PROBABILITY_OF_COMMIT = 0.5f;

Review comment:
       How about randomizing this?  I.e. `float PROBABILITY_OF_COMMIT = (float) 
random().nextDouble();`?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+    if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString()
+              + ". The maximum possible ordinal number is "
+              + (indexReader.maxDoc() - 1));
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i]);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        /*
+        If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+        Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+        (which is the maxDoc of the specific leaf)
+         */
+        if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {

Review comment:
       Did you add a test case that exposed this bug (missing the `docBase` 
before)?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
     }
 
     LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+    int[] ordinals = new int[labelValues.length];
+    float[] values = new float[labelValues.length];
+
     for (int i = labelValues.length - 1; i >= 0; i--) {
       TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
-      FacetLabel child = taxoReader.getPath(ordAndValue.ord);
-      labelValues[i] = new LabelAndValue(child.components[cp.length], 
ordAndValue.value);
+      ordinals[i] = ordAndValue.ord;
+      values[i] = ordAndValue.value;
+    }
+
+    FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) 
taxoReader).getBulkPath(ordinals);

Review comment:
       Hmm OK but this is quite messy.  This shows the cost of abstractions 
actually :)  Because we have these two separated abstractions, even though 
there is only one example of the `abstract` base class today, it is dangerous 
to do a hard cast like this.  If a future alternative `TaxonomyReader` comes 
along, then this code will fail?
   
   How about implementing `getBulkPath` in `TaxonomyReader`, falling back to 
looking up the individual paths in the base class, but overriding to the more 
efficient implementation (the one you already wrote) in 
`DirectoryTaxonomyReader`?

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {

Review comment:
       OK :)

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
     taxoReader.close();
     dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+    float PROBABILITY_OF_COMMIT = 0.5f;
+    Directory src = newDirectory();
+    DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+    String randomArray[] = new String[random().nextInt(1000)];
+    // adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+    // test cases
+    Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+    FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+    int allOrdinals[] = new int[randomArray.length];
+
+    for (int i = 0; i < randomArray.length; i++) {
+      allPaths[i] = new FacetLabel(randomArray[i]);
+      w.addCategory(allPaths[i]);
+      // add random commits to create multiple segments in the index
+      if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+        w.commit();
+      }
+    }
+    w.commit();
+    w.close();
+
+    DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+    for (int i = 0; i < allPaths.length; i++) {

Review comment:
       Could you randomize what paths you look up, instead of always looking up 
all paths?  Randomly pick a (randomly sized) subset each time?
   
   Also, could you run multiple threads here, each running multiple iterations, 
given the sort of complex thread concurrency with the locking to find cache 
misses / to insert them after?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?

Review comment:
       Well, the existing `LinkedHashMap` is horribly RAM-inefficient.  I.e. it 
has high overhead since we are caching little tiny objects.  `IntObjectHashMap` 
would be better.  One simple approach to make an approximate LRU from this HPPC 
class is a "double barrel" cache.  At all times there are two 
`IntObjectHashMap`s.  One is primary, the other is secondary.  On get we first 
check primary.  If it's not there, check secondary and if that was a hit, also 
store it in primary.  If it's a full cache miss, store in primary.  And then 
periodically, when there are too many items in both maps, drop secondary, move 
primary to secondary, and make a new empty map as primary.
   
   This is not a "true" LRU, only approximate, but I bet given the big RAM 
efficiency gains we'd see from `IntObjectHashMap`, probably worth it.
   
   But let's at least open an issue to get the pebble rolling?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
     }
 
     synchronized (categoryCache) {
-      categoryCache.put(catIDInteger, ret);
+      categoryCache.put(ordinal, ret);
     }
 
     return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+    // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+    // wrapped as LRU?
+    synchronized (categoryCache) {
+      return categoryCache.get(ordinal);
+    }
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+      throws IllegalArgumentException {
+    if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+      throw new IllegalArgumentException(
+          "ordinal "
+              + ordinal
+              + " is out of the range of the indexReader "
+              + indexReader.toString());
+    }
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * <p>This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   *     index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+    ensureOpen();
+
+    int ordinalsLength = ordinals.length;
+    FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+    // remember the original positions of ordinals before they are sorted
+    int[] originalPosition = new int[ordinalsLength];
+    Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+    int indexReaderMaxDoc = indexReader.maxDoc();
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      // check whether the ordinal is valid before accessing the cache
+      checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+      // check the cache before trying to find it in the index
+      FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+      if (ordinalPath != null) {
+        bulkPath[i] = ordinalPath;
+      }
+    }
+
+    /* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+    new InPlaceMergeSorter() {
+      @Override
+      protected void swap(int i, int j) {
+        int x = ordinals[i];
+        ordinals[i] = ordinals[j];
+        ordinals[j] = x;
+
+        x = originalPosition[i];
+        originalPosition[i] = originalPosition[j];
+        originalPosition[j] = x;
+      }
+      ;
+
+      @Override
+      public int compare(int i, int j) {
+        return Integer.compare(ordinals[i], ordinals[j]);
+      }
+    }.sort(0, ordinalsLength);
+
+    int readerIndex;
+    int leafReaderMaxDoc = 0;
+    int leafReaderDocBase = 0;
+    LeafReader leafReader;
+    LeafReaderContext leafReaderContext;
+    BinaryDocValues values = null;
+
+    for (int i = 0; i < ordinalsLength; i++) {
+      if (bulkPath[originalPosition[i]] == null) {
+        /*
+        If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that 
contains our ordinal
+         */
+        if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+          readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+          leafReaderContext = indexReader.leaves().get(readerIndex);
+          leafReader = leafReaderContext.reader();
+          leafReaderMaxDoc = leafReader.maxDoc();
+          leafReaderDocBase = leafReaderContext.docBase;
+          values = leafReader.getBinaryDocValues(Consts.FULL);
+
+          /*
+          If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+           */
+          if (values == null) {
+            return getBulkPathForOlderIndexes(ordinals);
+          }
+        }
+        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+        assert success;
+        bulkPath[originalPosition[i]] =
+            new 
FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+        // add the value to the categoryCache after computation
+        synchronized (categoryCache) {

Review comment:
       I think a better solution would be to accumulate, into a temporary 
array, the `int`/`FacetLabel` pairs that need to be inserted.
   
   Then, outside this `for` loop, if there were any of those, acquire the 
`sync` lock once, iterate over those cache misses and insert them.  This way we 
are 1) only acquiring/releasing `sync` lock once, and 2) only if there was at 
least one cache miss, and 3) only actually iterating on those cache misses.
   
   These sorts of very fine grained lock acquire/release can lead to serious 
contention with many threads/cores.  Let's take advantage of the "bulkness" of 
this new API to also increase lock granularity?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

Reply via email to