[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

Eric Yang (Jira) Thu, 15 Feb 2024 13:31:32 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817786#comment-17817786
 ]


Eric Yang edited comment on SPARK-47017 at 2/15/24 9:30 PM:
------------------------------------------------------------

Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): [^simple2.scala]

The listener event logs: [^eventLogs-local-1708032228180.zip]

In L265 of the example code we create a dataset from an existing RDD 
"resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to 
an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL 
metrics of this filter are not shown anywhere so we have no idea what the 
internal RDD execution looks like in this case (imagine that, instead of a 
simple filter, the RDD may contain very complex logic with many physical nodes.)

 

A possible solution is to follow what InMemoryRelation is doing: in which it 
keeps the original physical plan so we still have a chance to show the DAG and 
the metric values somewhere.


was (Author: JIRAUSER304132):
Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): [^simple2.scala]

The listener event logs: [^eventLogs-local-1708032228180.zip]

In L265 of the example code we create a dataset from an existing RDD 
"resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to 
an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL 
metrics of this filter are not shown anywhere so we have no idea what the 
internal RDD execution looks like in this case (imagine that, instead of a 
simple filter, the RDD may contain very complex logic with many physical nodes.)

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-47017
>                 URL: https://issues.apache.org/jira/browse/SPARK-47017
>             Project: Spark
>          Issue Type: New Feature
>          Components: Web UI
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Eric Yang
>            Priority: Major
>         Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow],     <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

Reply via email to