[ 
https://issues.apache.org/jira/browse/SPARK-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14679:
------------------------------------

    Assignee: Apache Spark

> UI DAG visualization causes OOM generating data
> -----------------------------------------------
>
>                 Key: SPARK-14679
>                 URL: https://issues.apache.org/jira/browse/SPARK-14679
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 1.6.1
>            Reporter: Ryan Blue
>            Assignee: Apache Spark
>
> The UI will hit an OutOfMemoryException when generating the DAG visualization 
> data for large Hive table scans. The problem is that data is being duplicated 
> in the output for each RDD like cluster10 here:
> {code}
> digraph G {
>   subgraph clusterstage_1 {
>     label="Stage 1";
>     subgraph cluster7 {
>       label="TungstenAggregate";
>       9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
>     }
>     subgraph cluster10 {
>       label="HiveTableScan";
>       7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>       6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>       5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
>     }
>     subgraph cluster10 {
>       label="HiveTableScan";
>       7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>       6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>       5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
>     }
>     subgraph cluster8 {
>       label="ConvertToUnsafe";
>       8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
>     }
>     subgraph cluster10 {
>       label="HiveTableScan";
>       7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>       6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>       5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
>     }
>   }
>   8->9;
>   6->7;
>   5->6;
>   7->8;
> }
> {code}
> Hive has a large number of RDDs because it creates a RDD for each partition 
> in the scan returned by the metastore. Each RDD in results in another copy of 
> the. The data is built with a StringBuilder and copied into a String, so the 
> memory required gets huge quickly.
> The cause is how the RDDOperationGraph gets generated. For each RDD, a nested 
> chain of RDDOperationCluster is produced and those are merged. But, there is 
> no implementation of equals for RDDOperationCluster, so they are always 
> distinct and accumulated rather than 
> [deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to