wombatu-kun commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3411492237


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int 
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
     );
   }
 
+  /**
+   * Sums row counts read from each base file's footer metadata, in parallel 
via the engine context.
+   * Used in place of materializing and counting an RDD of records during RLI 
bootstrap.
+   */
+  private long estimateRecordCountFromBaseFiles(List<Pair<String, 
HoodieBaseFile>> partitionBaseFilePairs) {
+    if (partitionBaseFilePairs.isEmpty()) {
+      return 0L;
+    }
+    int parallelism = Math.min(partitionBaseFilePairs.size(),
+        dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+    StorageConfiguration<?> storageConfBroadcast = storageConf;
+    return engineContext.parallelize(partitionBaseFilePairs, parallelism)
+        .map(partitionAndBaseFile -> {
+          HoodieBaseFile baseFile = partitionAndBaseFile.getValue();
+          StoragePath path = baseFile.getStoragePath();
+          try {
+            HoodieStorage storage = HoodieStorageUtils.getStorage(path, 
storageConfBroadcast);
+            return 
HoodieIOFactory.getIOFactory(storage).getFileFormatUtils(path).getRowCount(storage,
 path);
+          } catch (Exception e) {
+            LOG.warn("Failed to read row count from base file footer: {}", 
path, e);
+            return 0L;

Review Comment:
   Concrete failure mode for the swallow-and-return-0L path: 
HFileUtils.getRowCount throws UnsupportedOperationException 
(hudi-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:109), so 
a data table with hoodie.base.file.format=HFILE hits the catch for every base 
file, sums to 0, and sizes RLI at minFileGroupCount with only per-file WARNs. 
Parquet/ORC are fine (footer reads). The second option above is the safer one: 
let UnsupportedOperationException propagate instead of catching Exception, and 
route that table to the legacy persist+count supplier, so only unsupported 
formats pay the old cost while Parquet/ORC keep the fast path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to