Quanlong Huang created IMPALA-13122:
---------------------------------------

             Summary: Show file stats in table loading logs
                 Key: IMPALA-13122
                 URL: https://issues.apache.org/jira/browse/IMPALA-13122
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Quanlong Huang


Here is an example for table loading logs on a table:
{noformat}
I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table 
definition and all partition(s) of tpcds.store_sales (needed by coordinator)
I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. 
Actual columns: 23
I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List Done. 
Time taken: 26.699us
I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions 
for: tpcds.store_sales using partition batch size: 1000 
I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file 
and block metadata for 1824 paths for table tpcds.store_sales using a thread 
pool of size 5
I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block metadata 
for tpcds.store_sales partitions: ss_sold_date_sk=2450816, 
ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time taken: 
569.107ms
I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for table: 
tpcds.store_sales set to: -1
I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: 
tpcds.store_sales (4026ms){noformat}
>From the logs, we know the table has 23 columns and 1824 partitions. Time 
>spent in loading the table schema and file metadata are also shown.

However, it's unknown whether there are small files issue under the partitions. 
The underlying storage could also be slow (e.g. S3) which results in a long 
time in loading file metadata.

It'd be helpful to add these in the logs:
 * number of files loaded
 * min/avg/max of file sizes
 * total file size
 * number of files
 * number of blocks (HDFS only)
 * number of hosts, disks (HDFS/Ozone only)
 * Stats of accessTime and lastModifiedTime

These can be aggregated in FileMetadataLoader#loadInternal() and logged in 
ParallelFileMetadataLoader#load() or HdfsTable#loadFileMetadataForPartitions().

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]

[https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to