[jira] [Commented] (SPARK-26222) Scan: track file listing time

Yuanjian Li (JIRA) Wed, 12 Dec 2018 07:48:31 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719110#comment-16719110
 ]


Yuanjian Li commented on SPARK-26222:
-------------------------------------

Yes, I think I misunderstood your original intention. I'll give a PR deal with 
non-physical phase first.

Currently we'll do the file listing in following scenario:

Physical phase:
 * Operations on create/refresh/insert/drop table, file listing will be 
triggered by InMemoryFileIndex.refresh.
 * DataFrameWriter.runCommand, file listing will be triggered by 
DataSource.resolveRelation. (Also included in first PR.)
 * DataSourceScanExec execution.

Non-physical phase:
 * DataFrameReader.load, file listing will be triggered by 
DataSource.resolveRelation.
 * Analyze rule like FindDataSourceTable and RelationConversions in Hive 
analyzer.
 * Optimization rule like PruneFileSourcePartitions, OptimizeMetadataOnlyQuery.

All the heavy listing file work with FileSystem without cache is done in 
InMemoryFileIndex.refresh0, we can track this base function and get rid of 
cache hit.

> Scan: track file listing time
> -----------------------------
>
>                 Key: SPARK-26222
>                 URL: https://issues.apache.org/jira/browse/SPARK-26222
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26222) Scan: track file listing time

Reply via email to