[ https://issues.apache.org/jira/browse/SPARK-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-14679. ------------------------------- Resolution: Fixed Fix Version/s: 1.6.2 2.0.0 Issue resolved by pull request 12437 [https://github.com/apache/spark/pull/12437] > UI DAG visualization causes OOM generating data > ----------------------------------------------- > > Key: SPARK-14679 > URL: https://issues.apache.org/jira/browse/SPARK-14679 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 1.6.1 > Reporter: Ryan Blue > Fix For: 2.0.0, 1.6.2 > > > The UI will hit an OutOfMemoryException when generating the DAG visualization > data for large Hive table scans. The problem is that data is being duplicated > in the output for each RDD like cluster10 here: > {code} > digraph G { > subgraph clusterstage_1 { > label="Stage 1"; > subgraph cluster7 { > label="TungstenAggregate"; > 9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"]; > } > subgraph cluster10 { > label="HiveTableScan"; > 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"]; > 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"]; > 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"]; > } > subgraph cluster10 { > label="HiveTableScan"; > 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"]; > 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"]; > 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"]; > } > subgraph cluster8 { > label="ConvertToUnsafe"; > 8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"]; > } > subgraph cluster10 { > label="HiveTableScan"; > 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"]; > 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"]; > 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"]; > } > } > 8->9; > 6->7; > 5->6; > 7->8; > } > {code} > Hive has a large number of RDDs because it creates a RDD for each partition > in the scan returned by the metastore. Each RDD in results in another copy of > the. The data is built with a StringBuilder and copied into a String, so the > memory required gets huge quickly. > The cause is how the RDDOperationGraph gets generated. For each RDD, a nested > chain of RDDOperationCluster is produced and those are merged. But, there is > no implementation of equals for RDDOperationCluster, so they are always > distinct and accumulated rather than > [deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org