Re: [PR] feat(metadata): optimize RLI bootstrap by sizing file groups from base file footer row counts [hudi]

via GitHub Wed, 17 Jun 2026 12:13:34 -0700


hudi-agent commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3430817005



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int 
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
     );
   }
 
+  /**
+   * Sums row counts read from each base file's footer metadata, in parallel 
via the engine context.
+   * Used in place of materializing and counting an RDD of records during RLI 
bootstrap.
+   */
+  private long estimateRecordCountFromBaseFiles(List<Pair<String, 
HoodieBaseFile>> partitionBaseFilePairs) {
+    if (partitionBaseFilePairs.isEmpty()) {
+      return 0L;
+    }
+    int parallelism = Math.min(partitionBaseFilePairs.size(),
+        dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+    StorageConfiguration<?> storageConfBroadcast = storageConf;
+    return engineContext.parallelize(partitionBaseFilePairs, parallelism)
+        .map(partitionAndBaseFile -> {
+          HoodieBaseFile baseFile = partitionAndBaseFile.getValue();
+          StoragePath path = baseFile.getStoragePath();
+          try {
+            HoodieStorage storage = HoodieStorageUtils.getStorage(path, 
storageConfBroadcast);
+            return 
HoodieIOFactory.getIOFactory(storage).getFileFormatUtils(path).getRowCount(storage,
 path);
+          } catch (Exception e) {
+            LOG.warn("Failed to read row count from base file footer: {}", 
path, e);
+            return 0L;

Review Comment:
   🤖 Worth noting that `HFileUtils.getRowCount` already throws 
`UnsupportedOperationException` today, so the current catch-all `Exception` 
block masks that as a per-file 0 and would silently estimate the whole table at 
0 records (→ `minFileGroupCount`) for an HFile-formatted base file. Letting 
`UnsupportedOperationException` propagate (or probing `baseFileFormat` upfront) 
and routing to the legacy supplier in that case seems like the safer behavior — 
a one-time fall-through is much cheaper than under-provisioning RLI shards.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -859,6 +885,32 @@ private int 
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
     );
   }
 
+  /**
+   * Sums row counts read from each base file's footer metadata, in parallel 
via the engine context.
+   * Used in place of materializing and counting an RDD of records during RLI 
bootstrap.
+   */
+  private long estimateRecordCountFromBaseFiles(List<Pair<String, 
HoodieBaseFile>> partitionBaseFilePairs) {
+    if (partitionBaseFilePairs.isEmpty()) {
+      return 0L;
+    }
+    int parallelism = Math.min(partitionBaseFilePairs.size(),
+        dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+    StorageConfiguration<?> storageConfBroadcast = storageConf;

Review Comment:
   🤖 nit: `storageConfBroadcast` reads like a Spark `Broadcast<T>` variable, 
which this isn't — it's a local capture to avoid serializing `this` in the 
lambda. Could you rename it to something like `localStorageConf` (and maybe add 
a brief inline comment explaining the lambda-capture idiom)?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -4379,6 +4379,104 @@ private void changeTableVersion(HoodieTableVersion 
version) throws IOException {
     }
   }
 
+  /**
+   * Validates that RLI initialization estimates file group count from base 
file footer metadata
+   * (instead of materializing and counting records) when min != max file 
group count.
+   */
+  @ParameterizedTest
+  @EnumSource(HoodieTableType.class)
+  public void testRecordIndexFileGroupEstimation(HoodieTableType tableType) 
throws Exception {

Review Comment:
   🤖 Confirming the math: 200 RLI records × 
`RECORD_INDEX_AVERAGE_RECORD_SIZE=48` ≈ 9.6 KB, which is well below the default 
`record.index.max.filegroup.size=1 GB`, so the estimator always clamps to 
`minFileGroupCount=1` regardless of what `estimateRecordCountFromBaseFiles` 
returns. As-is the test would pass even if the new footer-reading path silently 
returned 0, so it doesn't actually cover the regression it's meant to guard. 
Setting `withRecordIndexMaxFileGroupSizeBytes` to a small value (e.g. 1–2 KB) 
would make the row-count drive the result and exercise the new code path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(metadata): optimize RLI bootstrap by sizing file groups from base file footer row counts [hudi]

Reply via email to