[ https://issues.apache.org/jira/browse/SPARK-26723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Matveev updated SPARK-26723: ------------------------------------- Attachment: Screen Shot 2019-01-24 at 4.13.14 PM.png Screen Shot 2019-01-24 at 4.13.02 PM.png > Spark web UI only shows parts of SQL query graphs for queries with persist > operations > ------------------------------------------------------------------------------------- > > Key: SPARK-26723 > URL: https://issues.apache.org/jira/browse/SPARK-26723 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 2.3.2 > Reporter: Vladimir Matveev > Priority: Major > Attachments: Screen Shot 2019-01-24 at 4.13.02 PM.png, Screen Shot > 2019-01-24 at 4.13.14 PM.png > > > Currently it looks like the SQL view in Spark UI will truncate the graph on > the nodes corresponding to persist operations on the dataframe, only showing > everything after "LocalTableScan". This is *very* inconvenient, because in a > common case when you have a heavy computation and want to persist it before > writing to multiple outputs with some minor preprocessing, you lose almost > the entire graph with potentially very useful information in it. > The query plans below the graph, however, show the full query, including all > computations before persists. Unfortunately, for complex queries looking into > the plan is unfeasible, and graph visualization becomes a very helpful tool; > with persist, it is apparently broken. > You can verify it in Spark Shell with a very simple example: > {code} > import org.apache.spark.sql.{functions => f} > import org.apache.spark.sql.expressions.Window > val query = Vector(1, 2, 3).toDF() > .select(($"value".cast("long") * f.rand).as("value")) > .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value")) > query.show() > query.persist().show() > {code} > Here the same query is executed first without persist, and then with it. If > you now navigate to the Spark web UI SQL page, you'll see two queries, but > their graphs will be radically different: the one without persist will > contain the whole transformation with exchange, sort and window steps, while > the one with persist will only contain only a LocalTableScan step with some > intermediate transformations needed for `show`. > After some looking into Spark code, I think that the reason for this is that > the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method > (which is used to serialize a plan before emitting the > SparkListenerSQLExecutionStart event) constructs the `SparkPlanInfo` object > from a `SparkPlan` object incorrectly, because if you invoke the `toString` > method on `SparkPlan` you'll see the entire plan, but the `SparkPlanInfo` > object will only contain nodes corresponding to actions after `persist`. > However, my knowledge of Spark internals is not deep enough to understand how > to fix this, and how SparkPlanInfo.fromSparkPlan is different from what > SparkPlan.toString does. > This can be observed on Spark 2.3.2, but given that 2.4.0 code of > SparkPlanInfo does not seem to change much since 2.3.2, I'd expect that it > could be reproduced there too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org