[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720957#comment-16720957 ]
Reynold Xin commented on SPARK-26222: ------------------------------------- So I spent some time looking at the code base to understand what's going on, and how we should report this. In short, I think we have two types of tables: (1) tables that require full file listing in order to resolve the schema (including partition columns) (2) tables that don't. This means there are 3 scenarios to think about: (1) spark.read.parquet("/path/to/table").count() -> in this case an InMemoryFileIndex containing all of the leaf files is created. (2a) spark.read.table("abcd").count() -> when partitions are not tracked in the catalog, which is basically the same as (1) (2b) spark.read.table("abcd").count() -> when partitions are tracked in the catalog. In this case a CatalogFileIndex is created. We should measure the listing time in CatalogFileIndex.filterPartitions. Also instead of tracking them as phases, I'd associate the timing with the scan operator in SQL metrics. I'd report the start and end time, rather than just a single duration. > Scan: track file listing time > ----------------------------- > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Reynold Xin > Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org