Re: [PR] Reduce FST block size for BlockTreeTermsWriter [lucene]

via GitHub Tue, 03 Oct 2023 12:45:18 -0700


mikemccand commented on code in PR #12604:
URL: https://github.com/apache/lucene/pull/12604#discussion_r1344628522



##########
lucene/CHANGES.txt:
##########
@@ -163,6 +163,9 @@ Optimizations
 * GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
   descending score. (Adrien Grand)
 
+* GITHUB#12604: Estimate the block size of FST BytesStore in 
BlockTreeTermsWriter
+  to reducing GC load during indexing. (Guo Feng)

Review Comment:
   reducing -> reduce



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
         }
       }
 
+      long estimateSize = prefix.length;

Review Comment:
   Too bad we don't have a writer that uses tiny (like 8 bytes) block at first, 
but doubles size for each new block (16 bytes, 32 bytes next, etc.).  Then we 
would naturally use log(size) number of blocks without over-allocating.
   
   But then reading bytes is a bit tricky because we'd need to take discrete 
log (base 2) of the address.  Maybe it wouldn't be so bad -- we could do this 
with `Long.numberOfLeadingZeros` maybe?  But that's a bigger change ... we can 
do this separately/later.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
         }
       }
 
+      long estimateSize = prefix.length;
+      for (PendingBlock block : blocks) {
+        if (block.subIndices != null) {

Review Comment:
   We also should really explore the `TODO` above to write `vLong` in opposite 
byte order -- this might save quite a bit of storage in the FST since outputs 
would share more prefixes.  Again, separate issue 😀 



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
         }
       }
 
+      long estimateSize = prefix.length;
+      for (PendingBlock block : blocks) {
+        if (block.subIndices != null) {

Review Comment:
   We also should really explore the `TODO` above to write `vLong` in opposite 
byte order -- this might save quite a bit of storage in the FST since outputs 
would share more prefixes.  Again, separate issue 😀 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Reduce FST block size for BlockTreeTermsWriter [lucene]

Reply via email to