[ 
https://issues.apache.org/jira/browse/SPARK-26723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Matveev updated SPARK-26723:
-------------------------------------
    Attachment: Screen Shot 2019-01-24 at 4.13.14 PM.png
                Screen Shot 2019-01-24 at 4.13.02 PM.png

> Spark web UI only shows parts of SQL query graphs for queries with persist 
> operations
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-26723
>                 URL: https://issues.apache.org/jira/browse/SPARK-26723
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 2.3.2
>            Reporter: Vladimir Matveev
>            Priority: Major
>         Attachments: Screen Shot 2019-01-24 at 4.13.02 PM.png, Screen Shot 
> 2019-01-24 at 4.13.14 PM.png
>
>
> Currently it looks like the SQL view in Spark UI will truncate the graph on 
> the nodes corresponding to persist operations on the dataframe, only showing 
> everything after "LocalTableScan". This is *very* inconvenient, because in a 
> common case when you have a heavy computation and want to persist it before 
> writing to multiple outputs with some minor preprocessing, you lose almost 
> the entire graph with potentially very useful information in it.
> The query plans below the graph, however, show the full query, including all 
> computations before persists. Unfortunately, for complex queries looking into 
> the plan is unfeasible, and graph visualization becomes a very helpful tool; 
> with persist, it is apparently broken.
> You can verify it in Spark Shell with a very simple example:
> {code}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val query = Vector(1, 2, 3).toDF()
>   .select(($"value".cast("long") * f.rand).as("value"))
>   .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
> query.show()
> query.persist().show()
> {code}
> Here the same query is executed first without persist, and then with it. If 
> you now navigate to the Spark web UI SQL page, you'll see two queries, but 
> their graphs will be radically different: the one without persist will 
> contain the whole transformation with exchange, sort and window steps, while 
> the one with persist will only contain only a LocalTableScan step with some 
> intermediate transformations needed for `show`.
> After some looking into Spark code, I think that the reason for this is that 
> the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method 
> (which is used to serialize a plan before emitting the 
> SparkListenerSQLExecutionStart event) constructs the `SparkPlanInfo` object 
> from a `SparkPlan` object incorrectly, because if you invoke the `toString` 
> method on `SparkPlan` you'll see the entire plan, but the `SparkPlanInfo` 
> object will only contain nodes corresponding to actions after `persist`. 
> However, my knowledge of Spark internals is not deep enough to understand how 
> to fix this, and how SparkPlanInfo.fromSparkPlan is different from what 
> SparkPlan.toString does.
> This can be observed on Spark 2.3.2, but given that 2.4.0 code of 
> SparkPlanInfo does not seem to change much since 2.3.2, I'd expect that it 
> could be reproduced there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to