[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

Ruslan Dautkhanov (JIRA) Mon, 10 Apr 2017 14:29:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963510#comment-15963510
 ]


Ruslan Dautkhanov edited comment on SPARK-12837 at 4/10/17 9:29 PM:
--------------------------------------------------------------------

It might be a bug in broadcast join.

Following Spark 2 query fails with 
{quote}Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB){quote}
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . <skip>
        m.year, 
        m.quarter, 
        t.individ, 
        t.hh_key
FROM    mv_dev.mv_raw_all_20170314 m, 
        disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.


was (Author: tagar):
It might be a bug in broadcast join.

Following Spark 2 query fails with 
Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB)
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . <skip>
        m.year, 
        m.quarter, 
        t.individ, 
        t.hh_key
FROM    mv_dev.mv_raw_all_20170314 m, 
        disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12837
>                 URL: https://issues.apache.org/jira/browse/SPARK-12837
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.0
>            Reporter: Tien-Dung LE
>            Assignee: Wenchen Fan
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

Reply via email to