Re: [PR] feat(append): dynamically cap log-block buffer at per-task memory [hudi]

via GitHub Tue, 26 May 2026 20:07:24 -0700


hudi-agent commented on code in PR #18860:
URL: https://github.com/apache/hudi/pull/18860#discussion_r3308172954



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -691,6 +757,42 @@ protected void flushToDiskIfRequired(HoodieRecord record, 
boolean appendDeleteBl
     }
   }
 
+  @VisibleForTesting
+  void setSizeEstimator(SizeEstimator<HoodieRecord> sizeEstimator) {
+    this.sizeEstimator = sizeEstimator;
+  }
+
+  @VisibleForTesting

Review Comment:
   🤖 nit: every other `@VisibleForTesting` accessor in this block uses the 
`ForTest` suffix (`getNumberOfRecordsForTest`, `getEffectiveBlockSizeForTest`, 
etc.), but this one doesn't — could you rename it to 
`getAverageRecordSizeForTest()` for consistency? As-is it reads like a 
production-visible API.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -691,6 +757,42 @@ protected void flushToDiskIfRequired(HoodieRecord record, 
boolean appendDeleteBl
     }
   }
 
+  @VisibleForTesting
+  void setSizeEstimator(SizeEstimator<HoodieRecord> sizeEstimator) {
+    this.sizeEstimator = sizeEstimator;
+  }
+
+  @VisibleForTesting
+  long getAverageRecordSize() {
+    return averageRecordSize;
+  }
+
+  @VisibleForTesting
+  long getNumberOfRecordsForTest() {
+    return numberOfRecords;
+  }
+
+  @VisibleForTesting
+  long getEstimatedBytesWrittenForTest() {
+    return estimatedNumberOfBytesWritten;
+  }
+
+  @VisibleForTesting
+  void simulateBufferedRecordForTest(HoodieRecord bufferedRecord) {
+    numberOfRecords++;
+    flushToDiskIfRequired(bufferedRecord, false);

Review Comment:
   🤖 Have you considered the downstream effect on `canWrite()`? It computes 
`parquetMaxFileSize >= estimatedNumberOfBytesWritten * 
logFileToParquetCompressionRatio`, and `estimatedNumberOfBytesWritten` is now 
accumulated using the larger post-`prepareRecord` `averageRecordSize` (on 
Spark, the in-memory Avro graph vs. the prior UnsafeRow). That likely makes 
`canWrite()` return false earlier than before, so log files may roll more 
aggressively. Is that the intended trade-off, or does 
`logFileToParquetCompressionRatio` need re-tuning given the new sizing basis?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(append): dynamically cap log-block buffer at per-task memory [hudi]

Reply via email to