[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817786#comment-17817786 ]
Eric Yang edited comment on SPARK-47017 at 2/15/24 9:30 PM: ------------------------------------------------------------ Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala] The listener event logs: [^eventLogs-local-1708032228180.zip] In L265 of the example code we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) A possible solution is to follow what InMemoryRelation is doing: in which it keeps the original physical plan so we still have a chance to show the DAG and the metric values somewhere. was (Author: JIRAUSER304132): Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala] The listener event logs: [^eventLogs-local-1708032228180.zip] In L265 of the example code we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > ------------------------------------------------------------------------------------- > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI > Affects Versions: 3.4.0, 3.5.0 > Reporter: Eric Yang > Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org