mikemccand commented on code in PR #12604:
URL: https://github.com/apache/lucene/pull/12604#discussion_r1344628522
##########
lucene/CHANGES.txt:
##########
@@ -163,6 +163,9 @@ Optimizations
* GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
descending score. (Adrien Grand)
+* GITHUB#12604: Estimate the block size of FST BytesStore in
BlockTreeTermsWriter
+ to reducing GC load during indexing. (Guo Feng)
Review Comment:
reducing -> reduce
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
}
}
+ long estimateSize = prefix.length;
Review Comment:
Too bad we don't have a writer that uses tiny (like 8 bytes) block at first,
but doubles size for each new block (16 bytes, 32 bytes next, etc.). Then we
would naturally use log(size) number of blocks without over-allocating.
But then reading bytes is a bit tricky because we'd need to take discrete
log (base 2) of the address. Maybe it wouldn't be so bad -- we could do this
with `Long.numberOfLeadingZeros` maybe? But that's a bigger change ... we can
do this separately/later.
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
}
}
+ long estimateSize = prefix.length;
+ for (PendingBlock block : blocks) {
+ if (block.subIndices != null) {
Review Comment:
We also should really explore the `TODO` above to write `vLong` in opposite
byte order -- this might save quite a bit of storage in the FST since outputs
would share more prefixes. Again, separate issue 😀
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -490,10 +491,22 @@ public void compileIndex(
}
}
+ long estimateSize = prefix.length;
+ for (PendingBlock block : blocks) {
+ if (block.subIndices != null) {
Review Comment:
We also should really explore the `TODO` above to write `vLong` in opposite
byte order -- this might save quite a bit of storage in the FST since outputs
would share more prefixes. Again, separate issue 😀
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]