Vladimir Matveev created SPARK-26723:
----------------------------------------

             Summary: Spark web UI only shows parts of SQL query graphs for 
queries with persist operations
                 Key: SPARK-26723
                 URL: https://issues.apache.org/jira/browse/SPARK-26723
             Project: Spark
          Issue Type: Bug
          Components: Web UI
    Affects Versions: 2.3.2
            Reporter: Vladimir Matveev


Currently it looks like the SQL view in Spark UI will truncate the graph on the 
nodes corresponding to persist operations on the dataframe, only showing 
everything after "LocalTableScan". This is *very* inconvenient, because in a 
common case when you have a heavy computation and want to persist it before 
writing to multiple outputs with some minor preprocessing, you lose almost the 
entire graph with potentially very useful information in it.

The query plans below the graph, however, show the full query, including all 
computations before persists. Unfortunately, for complex queries looking into 
the plan is unfeasible, and graph visualization becomes a very helpful tool; 
with persist, it is apparently broken.

You can verify it in Spark Shell with a very simple example:
{code}
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window

val query = Vector(1, 2, 3).toDF()
  .select(($"value".cast("long") * f.rand).as("value"))
  .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
query.show()
query.persist().show()
{code}
Here the same query is executed first without persist, and then with it. If you 
now navigate to the Spark web UI SQL page, you'll see two queries, but their 
graphs will be radically different: the one without persist will contain the 
whole transformation with exchange, sort and window steps, while the one with 
persist will only contain only a LocalTableScan step with some intermediate 
transformations needed for `show`.

After some looking into Spark code, I think that the reason for this is that 
the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method (which 
is used to serialize a plan before emitting the SparkListenerSQLExecutionStart 
event) constructs the `SparkPlanInfo` object from a `SparkPlan` object 
incorrectly, because if you invoke the `toString` method on `SparkPlan` you'll 
see the entire plan, but the `SparkPlanInfo` object will only contain nodes 
corresponding to actions after `persist`. However, my knowledge of Spark 
internals is not deep enough to understand how to fix this, and how 
SparkPlanInfo.fromSparkPlan is different from what SparkPlan.toString does.

This can be observed on Spark 2.3.2, but given that 2.4.0 code of SparkPlanInfo 
does not seem to change much since 2.3.2, I'd expect that it could be 
reproduced there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to