Ryan Blue created SPARK-14679:
---------------------------------

             Summary: UI DAG visualization causes OOM generating data
                 Key: SPARK-14679
                 URL: https://issues.apache.org/jira/browse/SPARK-14679
             Project: Spark
          Issue Type: Bug
          Components: Web UI
    Affects Versions: 1.6.1
            Reporter: Ryan Blue


The UI will hit an OutOfMemoryException when generating the DAG visualization 
data for large Hive table scans. The problem is that data is being duplicated 
in the output for each RDD like cluster10 here:

{code}
digraph G {
  subgraph clusterstage_1 {
    label="Stage 1";
    subgraph cluster7 {
      label="TungstenAggregate";
      9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster8 {
      label="ConvertToUnsafe";
      8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
  }
  8->9;
  6->7;
  5->6;
  7->8;
}
{code}

Hive has a large number of RDDs because it creates a RDD for each partition in 
the scan returned by the metastore. Each RDD in results in another copy of the. 
The data is built with a StringBuilder and copied into a String, so the memory 
required gets huge quickly.

The cause is how the RDDOperationGraph gets generated. For each RDD, a nested 
chain of RDDOperationCluster is produced and those are merged. But, there is no 
implementation of equals for RDDOperationCluster, so they are always distinct 
and accumulated rather than 
[deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to