[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2020-09-01 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188342#comment-17188342
 ] 

Nick Hryhoriev edited comment on SPARK-12837 at 9/1/20, 10:12 AM:
--

 I Think I still have the same issue with the dataset.count. in 2.4.3
 ```
 User class threw exception: org.apache.spark.SparkException: Job aborted due 
to stage failure: Total size of serialized results of 1317365 tasks (2.0 GB) is 
bigger than spark.driver.maxResultSize (2.0 GB)
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
 at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
 at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
 at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
 at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
 at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
 ```


was (Author: hryhoriev.nick):
I have the same issue with the dataset.count. in 2.4.3
```
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Total size of serialized results of 1317365 tasks (2.0 GB) is 
bigger than spark.driver.maxResultSize (2.0 GB)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at 

[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963705#comment-15963705
 ] 

Wenchen Fan edited comment on SPARK-12837 at 4/11/17 1:06 AM:
--

[~Tagar] I think this may be because the size estimation was not accurate, can 
you try master branch? It should have been fixed.


was (Author: cloud_fan):
[~Tagar] I think this may be because the size estimation was not accurate, can 
you try master branch? It should be fixed.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963510#comment-15963510
 ] 

Ruslan Dautkhanov edited comment on SPARK-12837 at 4/10/17 9:29 PM:


It might be a bug in broadcast join.

Following Spark 2 query fails with 
{quote}Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB){quote}
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . 
m.year, 
m.quarter, 
t.individ, 
t.hh_key
FROMmv_dev.mv_raw_all_20170314 m, 
disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.


was (Author: tagar):
It might be a bug in broadcast join.

Following Spark 2 query fails with 
Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB)
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . 
m.year, 
m.quarter, 
t.individ, 
t.hh_key
FROMmv_dev.mv_raw_all_20170314 m, 
disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962731#comment-15962731
 ] 

Wenchen Fan edited comment on SPARK-12837 at 4/10/17 11:54 AM:
---

[~jbherman] I tried and debugged your example, the actual task result is around 
1kb, when the number of shuffle partitions is 1k, the total result size will be 
around 1mb. I'm trying to reduce the overhead of accumulator, but you have to 
increase the `spark.driver.maxResultSize` anyway, 1mb is too small for this job.


was (Author: cloud_fan):
[~jbherman] I tried and debugged your example, the actual task result is around 
1kb, when the number of shuffle partitions is 1k, the total result size will be 
around 1mb. I'm trying to reduce the overhead of accumulator, but you have to 
increase the `spark.driver.maxResultSize` anyway.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org