hudi-agent commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3430817005
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
);
}
+ /**
+ * Sums row counts read from each base file's footer metadata, in parallel
via the engine context.
+ * Used in place of materializing and counting an RDD of records during RLI
bootstrap.
+ */
+ private long estimateRecordCountFromBaseFiles(List<Pair<String,
HoodieBaseFile>> partitionBaseFilePairs) {
+ if (partitionBaseFilePairs.isEmpty()) {
+ return 0L;
+ }
+ int parallelism = Math.min(partitionBaseFilePairs.size(),
+ dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+ StorageConfiguration<?> storageConfBroadcast = storageConf;
+ return engineContext.parallelize(partitionBaseFilePairs, parallelism)
+ .map(partitionAndBaseFile -> {
+ HoodieBaseFile baseFile = partitionAndBaseFile.getValue();
+ StoragePath path = baseFile.getStoragePath();
+ try {
+ HoodieStorage storage = HoodieStorageUtils.getStorage(path,
storageConfBroadcast);
+ return
HoodieIOFactory.getIOFactory(storage).getFileFormatUtils(path).getRowCount(storage,
path);
+ } catch (Exception e) {
+ LOG.warn("Failed to read row count from base file footer: {}",
path, e);
+ return 0L;
Review Comment:
🤖 Worth noting that `HFileUtils.getRowCount` already throws
`UnsupportedOperationException` today, so the current catch-all `Exception`
block masks that as a per-file 0 and would silently estimate the whole table at
0 records (→ `minFileGroupCount`) for an HFile-formatted base file. Letting
`UnsupportedOperationException` propagate (or probing `baseFileFormat` upfront)
and routing to the legacy supplier in that case seems like the safer behavior —
a one-time fall-through is much cheaper than under-provisioning RLI shards.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -859,6 +885,32 @@ private int
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
);
}
+ /**
+ * Sums row counts read from each base file's footer metadata, in parallel
via the engine context.
+ * Used in place of materializing and counting an RDD of records during RLI
bootstrap.
+ */
+ private long estimateRecordCountFromBaseFiles(List<Pair<String,
HoodieBaseFile>> partitionBaseFilePairs) {
+ if (partitionBaseFilePairs.isEmpty()) {
+ return 0L;
+ }
+ int parallelism = Math.min(partitionBaseFilePairs.size(),
+ dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+ StorageConfiguration<?> storageConfBroadcast = storageConf;
Review Comment:
🤖 nit: `storageConfBroadcast` reads like a Spark `Broadcast<T>` variable,
which this isn't — it's a local capture to avoid serializing `this` in the
lambda. Could you rename it to something like `localStorageConf` (and maybe add
a brief inline comment explaining the lambda-capture idiom)?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -4379,6 +4379,104 @@ private void changeTableVersion(HoodieTableVersion
version) throws IOException {
}
}
+ /**
+ * Validates that RLI initialization estimates file group count from base
file footer metadata
+ * (instead of materializing and counting records) when min != max file
group count.
+ */
+ @ParameterizedTest
+ @EnumSource(HoodieTableType.class)
+ public void testRecordIndexFileGroupEstimation(HoodieTableType tableType)
throws Exception {
Review Comment:
🤖 Confirming the math: 200 RLI records ×
`RECORD_INDEX_AVERAGE_RECORD_SIZE=48` ≈ 9.6 KB, which is well below the default
`record.index.max.filegroup.size=1 GB`, so the estimator always clamps to
`minFileGroupCount=1` regardless of what `estimateRecordCountFromBaseFiles`
returns. As-is the test would pass even if the new footer-reading path silently
returned 0, so it doesn't actually cover the regression it's meant to guard.
Setting `withRecordIndexMaxFileGroupSizeBytes` to a small value (e.g. 1–2 KB)
would make the row-count drive the result and exercise the new code path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]