Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime.
Jianshi On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu <zsxw...@gmail.com> wrote: > There are 2 cases for "No space left on device": > > 1. Some tasks which use large temp space cannot run in any node. > 2. The free space of datanodes is not balance. Some tasks which use large > temp space can not run in several nodes, but they can run in other nodes > successfully. > > Because most of our cases are the second one, we set > "spark.scheduler.executorTaskBlacklistTime" to 30000 to solve such "No > space left on device" errors. So if a task runs unsuccessfully in some > executor, it won't be scheduled to the same executor in 30 seconds. > > > Best Regards, > Shixiong Zhu > > 2015-03-16 17:40 GMT+08:00 Jianshi Huang <jianshi.hu...@gmail.com>: > >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 >> >> >> On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We're facing "No space left on device" errors lately from time to time. >>> The job will fail after retries. Obvious in such case, retry won't be >>> helpful. >>> >>> Sure it's the problem in the datanodes but I'm wondering if Spark Driver >>> can handle it and decommission the problematic datanode before retrying it. >>> And maybe dynamically allocate another datanode if dynamic allocation is >>> enabled. >>> >>> I think there needs to be a class of fatal errors that can't be >>> recovered with retries. And it's best Spark can handle it nicely. >>> >>> Thanks, >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/