Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

via GitHub Mon, 03 Jun 2024 06:20:24 -0700


mikemccand commented on code in PR #13359:
URL: https://github.com/apache/lucene/pull/13359#discussion_r1624448703



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##########
@@ -307,6 +309,30 @@ private boolean setEOF() {
     return true;
   }
 
+  @Override
+  public void prepareSeekExact(BytesRef target) throws IOException {
+    if (fr.index == null) {
+      throw new IllegalStateException("terms index was not loaded");
+    }
+
+    if (fr.size() == 0 || target.compareTo(fr.getMin()) < 0 || 
target.compareTo(fr.getMax()) > 0) {
+      return;
+    }
+
+    // TODO: should we try to reuse the current state of this terms enum when 
applicable?
+    BytesRefFSTEnum<BytesRef> indexEnum = new BytesRefFSTEnum<>(fr.index);
+    InputOutput<BytesRef> output = indexEnum.seekFloor(target);
+    if (output != null) { // should never be null since we already checked 
against fr.getMin()?
+      final long code =
+          fr.readVLongOutput(
+              new ByteArrayDataInput(
+                  output.output.bytes, output.output.offset, 
output.output.length));
+      final long fpSeek = code >>> 
Lucene90BlockTreeTermsReader.OUTPUT_FLAGS_NUM_BITS;
+      initIndexInput();
+      in.prefetch(fpSeek, 1); // TODO: could we know the length of the block?

Review Comment:
   > > But for a non-leaf blocks, first all leaf blocks under them are written 
(in order), and THEN the non-leaf block is written only when we are done with 
all those recursions and writing any straggler terms that live in the non-leaf 
block.
   > 
   > This means if we subtract the `fp` of a non-leaf block and its next, we 
will get its sub blocks' total length?
   
   It's tricky.  I think if you do that, you'll get the total length of the 
next's sub blocks total length?  Because each non-leaf block is written at the 
end of the recursive (depth first) visit of all of its sub blocks.
   
   I'm still not sure how to cleanly/efficiently get the total bytes length of 
a leaf block by looking solely at the FST terms index.  So we should proceed 
with the hint as is (pre-fetch 1 byte from position X) -- "typically" the terms 
block will fit into a single IO page (512 or 4096 bytes) and any further 
readahead the IO system does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use `IndexInput#prefetch` for terms dictionary lookups. [lucene]

Reply via email to