[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-10339: ------------------------------------ Assignee: Apache Spark (was: Yin Huai) > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yin Huai > Assignee: Apache Spark > Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} > other operators > | > | > /--|------\ > / | \ > / | \ > / | \ > project project ... project > | | | > filter filter ... filter > | | | > part1 part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org