[ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10339:
-----------------------------
    Description: 
I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
When I run the following code, the free memory space in driver's old gen 
gradually decreases and eventually there is pretty much no free space in 
driver's old gen. Finally, all kinds of timeouts happen and the cluster is died.
{code}
val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
=> Unit)
{code}

I did a quick test by deleting SQL metrics from project and filter operator, my 
job works fine.

The reason is that for a partitioned table, when we scan it, the actual plan is 
like
{code}
       other operators
           |
           |
        /--|------\
       /   |       \
      /    |        \
     /     |         \
project  project ... project
  |        |           |
filter   filter  ... filter
  |        |           |
part1    part2   ... part n
{code}

We create SQL metrics for every filter and project, which causing the extremely 
high memory pressure to the driver.

  was:
I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
When I run the following code, the free memory space in driver's old gen 
gradually decrease and eventually there is no free space in driver's old gen.
{code}
val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
=> Unit)
{code}

I did a quick test by deleting SQL metrics from project and filter operator, my 
job works fine.

The reason is that for a partitioned table, when we scan it, the actual plan is 
like
{code}
       other operators
           |
           |
        /--|------\
       /   |       \
      /    |        \
     /     |         \
project  project ... project
  |        |           |
filter   filter  ... filter
  |        |           |
part1    part2   ... part n
{code}

We create SQL metrics for every filter and project, which causing the extremely 
high memory pressure to the driver.


> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10339
>                 URL: https://issues.apache.org/jira/browse/SPARK-10339
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>        other operators
>            |
>            |
>         /--|------\
>        /   |       \
>       /    |        \
>      /     |         \
> project  project ... project
>   |        |           |
> filter   filter  ... filter
>   |        |           |
> part1    part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to