Quanlong Huang created IMPALA-13122: ---------------------------------------
Summary: Show file stats in table loading logs Key: IMPALA-13122 URL: https://issues.apache.org/jira/browse/IMPALA-13122 Project: IMPALA Issue Type: Improvement Components: Catalog Reporter: Quanlong Huang Here is an example for table loading logs on a table: {noformat} I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table definition and all partition(s) of tpcds.store_sales (needed by coordinator) I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. Actual columns: 23 I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List Done. Time taken: 26.699us I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata from the Metastore: tpcds.store_sales I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions for: tpcds.store_sales using partition batch size: 1000 I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 partitions for table tpcds.store_sales I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 partitions for table tpcds.store_sales I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata from the Metastore: tpcds.store_sales I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file and block metadata for 1824 paths for table tpcds.store_sales using a thread pool of size 5 I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816, ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time taken: 569.107ms I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for table: tpcds.store_sales set to: -1 I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: tpcds.store_sales (4026ms){noformat} >From the logs, we know the table has 23 columns and 1824 partitions. Time >spent in loading the table schema and file metadata are also shown. However, it's unknown whether there are small files issue under the partitions. The underlying storage could also be slow (e.g. S3) which results in a long time in loading file metadata. It'd be helpful to add these in the logs: * number of files loaded * min/avg/max of file sizes * total file size * number of files * number of blocks (HDFS only) * number of hosts, disks (HDFS/Ozone only) * Stats of accessTime and lastModifiedTime These can be aggregated in FileMetadataLoader#loadInternal() and logged in ParallelFileMetadataLoader#load() or HdfsTable#loadFileMetadataForPartitions(). [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177] [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172] [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836] -- This message was sent by Atlassian Jira (v8.20.10#820010)