[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719110#comment-16719110 ]
Yuanjian Li commented on SPARK-26222: ------------------------------------- Yes, I think I misunderstood your original intention. I'll give a PR deal with non-physical phase first. Currently we'll do the file listing in following scenario: Physical phase: * Operations on create/refresh/insert/drop table, file listing will be triggered by InMemoryFileIndex.refresh. * DataFrameWriter.runCommand, file listing will be triggered by DataSource.resolveRelation. (Also included in first PR.) * DataSourceScanExec execution. Non-physical phase: * DataFrameReader.load, file listing will be triggered by DataSource.resolveRelation. * Analyze rule like FindDataSourceTable and RelationConversions in Hive analyzer. * Optimization rule like PruneFileSourcePartitions, OptimizeMetadataOnlyQuery. All the heavy listing file work with FileSystem without cache is done in InMemoryFileIndex.refresh0, we can track this base function and get rid of cache hit. > Scan: track file listing time > ----------------------------- > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Reynold Xin > Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org