yihua commented on code in PR #18412:
URL: https://github.com/apache/hudi/pull/18412#discussion_r3036403812
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/buffer/LogScanningRecordBufferLoader.java:
##########
@@ -54,7 +55,14 @@ protected <T> List<String>
scanLogFiles(HoodieReaderContext<T> readerContext, Ho
readStats.setTotalUpdatedRecordsCompacted(logRecordReader.getNumMergedRecordsInLog());
readStats.setTotalLogFilesCompacted(logRecordReader.getTotalLogFiles());
readStats.setTotalLogRecords(logRecordReader.getTotalLogRecords());
- readStats.setTotalLogBlocks(logRecordReader.getTotalLogBlocks());
+ readStats.setTotalLogBlocks(logRecordReader.getTotalValidLogBlocks());
Review Comment:
🤖 This silently changes the semantics of `totalLogBlocks` in commit
metadata. Previously it reported all scanned blocks (`getTotalLogBlocks()`),
now it reports only valid blocks after rollback/compaction filtering
(`getTotalValidLogBlocks()`). Since this value is persisted in
`HoodieWriteStat` and surfaced through
`HoodieCommitMetadata.getTotalLogBlocksCompacted()` and CLI commands, existing
consumers will see different numbers for the same workload. Could you keep
`setTotalLogBlocks` using `getTotalLogBlocks()` (total scanned) and use the new
`setTotalValidLogBlocks` for the valid-only count?
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/buffer/LogScanningRecordBufferLoader.java:
##########
@@ -54,7 +55,14 @@ protected <T> List<String>
scanLogFiles(HoodieReaderContext<T> readerContext, Ho
readStats.setTotalUpdatedRecordsCompacted(logRecordReader.getNumMergedRecordsInLog());
readStats.setTotalLogFilesCompacted(logRecordReader.getTotalLogFiles());
readStats.setTotalLogRecords(logRecordReader.getTotalLogRecords());
- readStats.setTotalLogBlocks(logRecordReader.getTotalLogBlocks());
+ readStats.setTotalLogBlocks(logRecordReader.getTotalValidLogBlocks());
+
readStats.setTotalValidLogBlocks(logRecordReader.getTotalValidLogBlocks());
+ readStats.setTotalLogBlocksSize(logRecordReader.getTotalLogBlocksSize());
+
readStats.setTotalLogBlocksScanTimeMs(logRecordReader.getBlocksScanDuration());
+
readStats.setTotalLogSizeCompacted(logRecordReader.getBlocksStats().stream()
Review Comment:
🤖 This computes `totalLogSizeCompacted` by summing only data block sizes
from `blocksStats`, but `blocksStats` is only populated for data blocks
(AVRO/HFILE/PARQUET) — delete blocks are excluded. The old code used
`CompactionStrategy.TOTAL_LOG_FILE_SIZE` which was the total log file size.
This will under-report compacted log size when delete blocks are present. Is
that intentional?
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java:
##########
@@ -277,6 +299,7 @@ &&
compareTimestamps(logBlock.getLogBlockHeader().get(INSTANT_TIME), GREATER_THA
}
if (instantRange.isPresent() &&
!instantRange.get().isInRange(instantTime)) {
// filter the log block by instant range
Review Comment:
🤖 Using WARN level for an instant that's outside the requested range seems
like normal filtering behavior rather than something concerning. Could this be
noisy in production? I'd suggest DEBUG level here — an instant being filtered
by range is expected, not a warning.
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java:
##########
@@ -306,8 +332,14 @@ &&
compareTimestamps(logBlock.getLogBlockHeader().get(INSTANT_TIME), GREATER_THA
logBlock.getLogBlockHeader().get(TARGET_INSTANT_TIME);
targetRollbackInstants.add(targetInstantForCommandBlock);
orderedInstantsList.remove(targetInstantForCommandBlock);
- instantToBlocksMap.remove(targetInstantForCommandBlock);
+ List<HoodieLogBlock> rolledBackBlocks =
instantToBlocksMap.remove(targetInstantForCommandBlock);
+ if (rolledBackBlocks != null) {
+ numBlocksRolledBack += rolledBackBlocks.size();
+ }
+ LOG.info("Reading a rollback block with instant {} and target
instant {}",
+ instantTime, targetInstantForCommandBlock);
} else {
+ LOG.info("Reading a command block with instant {} whose
operation is not supported", instantTime);
Review Comment:
🤖 Good point. The `LOG.error` before the throw is added at line 345 in the
new code (`LOG.error("Reading a command block with instant {} whose operation
is not supported", instantTime)`), so I think this is already addressed in the
current diff.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]