[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043730#comment-14043730
 ] 

Bharath Ravi Kumar edited comment on SPARK-1112 at 6/25/14 4:51 PM:
--------------------------------------------------------------------

Can a clear workaround be specified for this bug please? For those unable to 
upgrade to run on 1.0.1  or 1.1.0 in production, general instructions on the 
workaround are required. This is a huge blocker for current production 
deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile() 
on an RDD (~400MB) causes execution to freeze with the last log statements seen 
on the driver being:

14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at 
Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at 
Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage 
6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6 
(MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from 
Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2 
tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5 
on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as 
351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6 
on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as 
186453 bytes in 16 ms

The test setup for reproducing this issue has two slaves (each with 24G) 
running spark standalone. The driver runs with Xmx 4G.

Thanks.


was (Author: reachbach):
Can a clear workaround be specified for this bug please? For those unable to 
upgrade to run on 1.0.1  or 1.1.0 in production, general instructions on the 
workaround are required. This is a huge blocker for current production 
deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile() 
on an RDD (~400MB) causes execution to freeze with the last log statements seen 
on the driver being:

14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at 
Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at 
Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage 
6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6 
(MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from 
Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2 
tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5 
on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as 
351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6 
on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as 
186453 bytes in 16 ms

Thanks.

> When spark.akka.frameSize > 10, task results bigger than 10MiB block execution
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1112
>                 URL: https://issues.apache.org/jira/browse/SPARK-1112
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: Guillaume Pitel
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>             Fix For: 1.0.1, 1.1.0
>
>
> When I set the spark.akka.frameSize to something over 10, the messages sent 
> from the executors to the driver completely block the execution if the 
> message is bigger than 10MiB and smaller than the frameSize (if it's above 
> the frameSize, it's ok)
> Workaround is to set the spark.akka.frameSize to 10. In this case, since 
> 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
> akka direct message though.
> The configuration seems to be correctly read (see actorSystemConfig.txt), so 
> I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to