wombatu-kun commented on code in PR #19002:
URL: https://github.com/apache/hudi/pull/19002#discussion_r3409463412
##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/table/log/TestLogReaderUtils.java:
##########
@@ -131,6 +132,48 @@ public void testGetAllLogFilesWithMaxCommit() throws
Exception {
}
}
+ @Test
+ public void testLogFileWriteStatSizeMatchesOnDisk() throws Exception {
+ // HoodieAppendHandle records each log file's size by deriving it from
AppendResult
+ // (logOffset + accumulated appended bytes) instead of a getPathInfo per
file. Validate that the
+ // derived size in the write stat matches the actual on-disk log file
length, across multiple
+ // appends to the same log file (two upsert commits with no compaction /
small-file packing).
+ HoodieTableMetaClient metaClient =
getHoodieMetaClient(HoodieTableType.MERGE_ON_READ, new Properties());
+
+ HoodieWriteConfig config = getConfigBuilder(true)
+ .withPath(basePath())
+ .withCompactionConfig(HoodieCompactionConfig.newBuilder()
+ .withInlineCompaction(false)
+ .compactionSmallFileSize(0)
+ .build())
+ .build();
+
+ HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator();
+
+ try (SparkRDDWriteClient client = getHoodieWriteClient(config)) {
+ // First commit - insert data (base files)
+ String firstCommit = "001";
+ WriteClientTestUtils.startCommitWithTime(client, firstCommit);
+ JavaRDD<WriteStatus> insertRdd =
client.insert(jsc().parallelize(dataGen.generateInserts(firstCommit, 100), 1),
firstCommit);
+ assertNoWriteErrors(insertRdd.collect());
+ client.commit(firstCommit, insertRdd);
+
+ // Upsert across two commits so each log file accumulates multiple
appends through the handle
Review Comment:
This comment (and the method docstring above) says the two upsert commits
make each log file "accumulate multiple appends through the handle," but under
the default write version (NINE) each delta commit writes a new instant-named
log file from offset 0 (HoodieWriteHandle.createLogWriter, the version >= EIGHT
branch builds the writer with withFileSize(0L)). So each log file gets a single
append, getLogOffset() is 0 for every status here, and this assertion passes
identically with or without the `logOffset +` term - the term only matters when
appending to a pre-existing log file (the table version SIX branch). To
actually exercise the derived sum, drive a logOffset > 0 case (e.g. build the
config with write table version SIX so the second commit appends to the
existing file); otherwise reword the comments to say only the offset-0 path is
covered.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]