[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai updated SPARK-10339: ----------------------------- Description: I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decreases and eventually there is pretty much no free space in driver's old gen. Finally, all kinds of timeouts happen and the cluster is died. {code} val df = sqlContext.read.format("parquet").load("/tmp/partitioned") df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit) {code} I did a quick test by deleting SQL metrics from project and filter operator, my job works fine. The reason is that for a partitioned table, when we scan it, the actual plan is like {code} other operators | | /--|------\ / | \ / | \ / | \ project project ... project | | | filter filter ... filter | | | part1 part2 ... part n {code} We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver. was: I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decrease and eventually there is no free space in driver's old gen. {code} val df = sqlContext.read.format("parquet").load("/tmp/partitioned") df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit) {code} I did a quick test by deleting SQL metrics from project and filter operator, my job works fine. The reason is that for a partitioned table, when we scan it, the actual plan is like {code} other operators | | /--|------\ / | \ / | \ / | \ project project ... project | | | filter filter ... filter | | | part1 part2 ... part n {code} We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver. > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yin Huai > Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} > other operators > | > | > /--|------\ > / | \ > / | \ > / | \ > project project ... project > | | | > filter filter ... filter > | | | > part1 part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org