[jira] [Comment Edited] (SPARK-26223) Scan: track metastore operation time

Yuanjian Li (JIRA) Mon, 17 Dec 2018 08:19:32 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723111#comment-16723111
 ]


Yuanjian Li edited comment on SPARK-26223 at 12/17/18 4:10 PM:
---------------------------------------------------------------

The usage of externalCatalog in `SessionCatalog` and the interface of 
`ExternalCatalog` are clear clues for this issue. Most interfaces in 
ExternalCatalog used in DDL, listing all scenario for metastore operations 
relative of Scan below:
 # getTable: called by analyzing rule ResolveRelation's lookupRelation.
 # listPartitions:
1. Called by execution stage about HiveTableScanExec during getting raw 
Partitions.
2. Called by optimize rule OptimizeMetadataOnlyQuery's 
replaceTableScanWithPartitionMetadata.
3. Called by HiveMetastoreCtalog.convertToLogicalRelation when lazy pruning is 
disabled, the entrance of this scenario is the analysis rule 
RelationConversions of hive analyzer.
 # listPartitionsByFilter:
1. Same with 2.1
2. Same with 2.2
3. Called by CatalogFileIndex, currently, we address this meta store operation 
time by adding in file listing([discussion 
link|https://github.com/apache/spark/pull/23327#discussion_r242076144]), will 
split in this PR.

We can address all this scenario by appending phase to a new-added array buffer 
in the `CatalogTable` parameter list and dump the records to metrics in the 
scan node.


was (Author: xuanyuan):
The usage of externalCatalog in `SessionCatalog` and the interface of 
`ExternalCatalog` are clear clues for this issue. Most interfaces in 
ExternalCatalog used in DDL, listing all scenario for metastore operations 
relative of Scan below:
 # getTable: called by analyzing rule ResolveRelation's lookupRelation.
 # listPartitions:
1. Called by execution stage about HiveTableScanExec during getting raw 
Partitions.
2. Called by optimize rule OptimizeMetadataOnlyQuery's 
replaceTableScanWithPartitionMetadata.
3. Called by HiveMetastoreCtalog.convertToLogicalRelation when lazy pruning is 
disabled, the entrance of this scenario is the analysis rule 
RelationConversions of hive analyzer.
 # listPartitionsByFilter:
1. Same with 2.1
2. Same with 2.2
3. Called by CatalogFileIndex, currently, we address this meta store operation 
time by adding in file listing([discussion 
link|https://github.com/apache/spark/pull/23327#discussion_r242076144]), will 
split in this PR.

We can address all this scenario by appending phase to a new-added array buffer 
in the `CatalogTable` parameter list and dump the phase to metrics in scan node.

> Scan: track metastore operation time
> ------------------------------------
>
>                 Key: SPARK-26223
>                 URL: https://issues.apache.org/jira/browse/SPARK-26223
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Priority: Major
>
> The Scan node should report how much time it spent in metastore operations. 
> Similar to file listing, would be great to also report start and end time for 
> constructing a timeline.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26223) Scan: track metastore operation time

Reply via email to