morningman commented on code in PR #25175:
URL: https://github.com/apache/doris/pull/25175#discussion_r1352793796


##########
fe/fe-core/src/main/java/org/apache/doris/statistics/util/StatisticsUtil.java:
##########
@@ -574,31 +574,71 @@ public static long 
getRowCountFromFileList(HMSExternalTable table) {
         if (table.isView()) {
             return 0;
         }
+        HiveMetaStoreCache.HivePartitionValues partitionValues = 
getPartitionValuesForTable(table);
+        int totalPartitionSize = partitionValues == null ? 1 : 
partitionValues.getIdToPartitionItem().size();
+
+        // Get files for all partitions.
+        int samplePartitionSize = Config.hive_stats_partition_sample_size;
+        List<HiveMetaStoreCache.FileCacheValue> filesByPartitions
+                = getFilesForPartitions(table, partitionValues, 
samplePartitionSize);
+        long totalSize = 0;
+        // Calculate the total file size.
+        for (HiveMetaStoreCache.FileCacheValue files : filesByPartitions) {
+            for (HiveMetaStoreCache.HiveFileStatus file : files.getFiles()) {
+                totalSize += file.getLength();
+            }
+        }
+        // Estimate row count: totalSize/estimatedRowSize
+        long estimatedRowSize = 0;
+        for (Column column : table.getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        if (samplePartitionSize < totalPartitionSize) {
+            totalSize = totalSize * totalPartitionSize / samplePartitionSize;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    public static HiveMetaStoreCache.HivePartitionValues 
getPartitionValuesForTable(HMSExternalTable table) {
+        if (table.isView()) {
+            return null;
+        }
         HiveMetaStoreCache cache = Env.getCurrentEnv().getExtMetaCacheMgr()
                 .getMetaStoreCache((HMSExternalCatalog) table.getCatalog());
         List<Type> partitionColumnTypes = table.getPartitionColumnTypes();
         HiveMetaStoreCache.HivePartitionValues partitionValues = null;
-        List<HivePartition> hivePartitions = Lists.newArrayList();
-        int samplePartitionSize = Config.hive_stats_partition_sample_size;
-        int totalPartitionSize = 1;
         // Get table partitions from cache.
         if (!partitionColumnTypes.isEmpty()) {
             // It is ok to get partition values from cache,
             // no need to worry that this call will invalid or refresh the 
cache.
             // because it has enough space to keep partition info of all 
tables in cache.
             partitionValues = cache.getPartitionValues(table.getDbName(), 
table.getName(), partitionColumnTypes);
         }
+        return partitionValues;
+    }
+
+    public static List<HiveMetaStoreCache.FileCacheValue> 
getFilesForPartitions(
+            HMSExternalTable table, HiveMetaStoreCache.HivePartitionValues 
partitionValues, int sampleSize) {
+        if (table.isView()) {
+            return null;

Review Comment:
   Better not return null



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -635,6 +636,30 @@ public void gsonPostProcess() throws IOException {
         super.gsonPostProcess();
         estimatedRowCount = -1;
     }
+
+    @Override
+    public List<Long> getChunkSizes() {
+        HiveMetaStoreCache.HivePartitionValues partitionValues = 
StatisticsUtil.getPartitionValuesForTable(this);
+        List<HiveMetaStoreCache.FileCacheValue> filesByPartitions
+                = StatisticsUtil.getFilesForPartitions(this, partitionValues, 
0);
+        List<Long> result = Lists.newArrayList();
+        for (HiveMetaStoreCache.FileCacheValue files : filesByPartitions) {
+            for (HiveMetaStoreCache.HiveFileStatus file : files.getFiles()) {
+                result.add(file.getLength());
+            }
+        }
+        return result;
+    }
+
+    @Override
+    public long getDataSize(boolean singleReplica) {
+        List<Long> chunkSizes = getChunkSizes();

Review Comment:
   Looks like it is a very heavy operation? It need to fetch all partitions and 
all files of a table, just for getting the total size of this table?
   But in most case, the total size of table can be found in hive table's 
property?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to