Re: [PR] [improvement](fe) Add external table metadata profile details [doris]

via GitHub Thu, 28 May 2026 18:12:47 -0700


github-actions[bot] commented on code in PR #63648:
URL: https://github.com/apache/doris/pull/63648#discussion_r3321554535



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -914,21 +927,30 @@ private List<Split> doGetSplits(int numBackends) throws 
UserException {
         TableScan scan = createTableScan();
 
         try (CloseableIterable<FileScanTask> fileScanTasks = 
planFileScanTask(scan)) {
-            if (tableLevelPushDownCount) {
-                int needSplitCnt = countFromSnapshot < 
COUNT_WITH_PARALLEL_SPLITS
-                        ? 1 : 
sessionVariable.getParallelExecInstanceNum(scanContext.getClusterName()) * 
numBackends;
-                for (FileScanTask next : fileScanTasks) {
-                    splits.add(createIcebergSplit(next));
-                    if (splits.size() >= needSplitCnt) {
-                        break;
+            long startTime = System.currentTimeMillis();
+            try {

Review Comment:
   This timer now starts after `planFileScanTask(scan)`, so it excludes the 
eager Iceberg planning paths. For example, when manifest cache is enabled, 
`planFileScanTaskWithManifestCache()` does the manifest/delete-file planning 
before this line; when manifest cache is disabled and this is the non-batch 
`splitFiles()` path, lines 613-615 materialize `scan.planFiles()` into a list 
before returning the iterable. In both cases the expensive file-scan-task 
planning has already happened before `startTime`, so `External Table Get File 
Scan Tasks Time` can still be near zero or only include split conversion. 
Please time both the eager `planFileScanTask()` work and the later iterable 
consumption, without double-counting the lazy paths already fixed by the 
existing thread.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/hudi/source/HudiScanNode.java:
##########
@@ -416,6 +434,9 @@ private void getPartitionSplits(HivePartition partition, 
List<Split> splits) thr
                     .forEach(fileSlice -> splits.add(
                             generateHudiSplit(fileSlice, 
partition.getPartitionValues(), queryInstant)));
         }
+        if (getSummaryProfile() != null) {
+            
getSummaryProfile().addExternalTableGetFileScanTasksTime(System.currentTimeMillis()
 - startTime);
+        }

Review Comment:
   `getPartitionsSplits()` dispatches one task per partition on the external 
meta-cache file-listing executor, and each task records its own elapsed time 
here into the shared `External Table Get File Scan Tasks Time`. Because 
`SummaryProfile` presents the counter as a single elapsed time, parallel Hudi 
partition listing can report roughly the sum of worker durations rather than 
the wall-clock split planning time. This makes the new external meta total 
misleading for normal batch/full-table Hudi scans; please record one wall-clock 
interval around the parallel fan-out/wait or label/count it as accumulated 
worker time.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/source/HiveScanNode.java:
##########
@@ -301,6 +312,9 @@ private void getFileSplitByPartitions(HiveExternalMetaCache 
cache, List<HivePart
             fileCaches = cache.getFilesByPartitions(partitions, withCache, 
partitions.size() > 1,
                     directoryLister, hmsTable);
         }
+        if (getSummaryProfile() != null) {
+            
getSummaryProfile().addExternalTableGetPartitionFilesTime(System.currentTimeMillis()
 - startTime);
+        }

Review Comment:
   In batch mode this method is invoked once per partition from concurrent 
tasks in `startSplit()` (`Collections.singletonList(partition)` is submitted to 
`scheduleExecutor`). Each worker adds its own elapsed time to the shared 
`External Table Get Partition Files Time`, and `SummaryProfile` then displays 
that accumulated value as `Get Partition Files Time`. With parallel partition 
listing, the reported time can be inflated by the worker concurrency and exceed 
the actual planning wall time. Please measure the batch partition-file phase 
once around the scheduling/wait path, or otherwise expose this explicitly as 
accumulated worker time instead of elapsed profile time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [improvement](fe) Add external table metadata profile details [doris]

Reply via email to