Re: Delaying failed task retries + giving failing tasks to different nodes

Akhil Das Thu, 02 Apr 2015 23:16:13 -0700

I think these are the following configurations that you are looking for:

*spark.locality.wait*: Number of milliseconds to wait to launch a
data-local task before giving up and launching it on a less-local node. The
same wait will be used to step through multiple locality levels
(process-local, node-local, rack-local and then any). It is also possible
to customize the waiting time for each level by setting
spark.locality.wait.node, etc. You should increase this setting if your
tasks are long and see poor locality, but the default usually works well.


Read more here
https://spark.apache.org/docs/latest/configuration.html#scheduling



Thanks
Best Regards

On Fri, Apr 3, 2015 at 8:41 AM, Stephen Merity <step...@commoncrawl.org>
wrote:

> Hi there,
>
> I've been using Spark for processing 33,000 gzipped files that contain
> billions of JSON records (the metadata [WAT] dataset from Common Crawl).
> I've hit a few issues and have not yet found the answers from the
> documentation / search. This may well just be me not finding the right
> pages though I promise I've attempted to RTFM thoroughly!
>
> Is there any way to (a) ensure retry attempts are done on different nodes
> and/or (b) ensure there's a delay between retrying a failing task (similar
> to spark.shuffle.io.retryWait)?
>
> Optimally when a task fails it should be given to different executors*.
> This is not the case that I've seen. With maxFailures set to 16, the task
> is handed back to the same executor 16 times, even though there are 89
> other nodes.
>
> The retry attempts are incredibly fast. The transient issue disappears
> quickly (DNS resolution fails to an Amazon bucket) but the 16 retry
> attempts take less than a second, all run on the same flaky node.
>
> For now I've set the maxFailures to an absurdly high number and that has
> worked around the issue -- the DNS error disappears on the specified
> machine after ~22 seconds (~360 task attempts) -- but that's obviously
> suboptimal.
>
> Additionally, are there other options for handling node failures? Other
> than maxFailures I've only seen things relating to shuffle failures? In the
> one instance I've had a node lose communication, it killed the job. I'd
> assumed the RDD would reconstruct. For now I've tried to work around it by
> persisting to multiple machines (MEMORY_AND_DISK_SER_2).
>
> Thanks! ^_^
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl
>

Re: Delaying failed task retries + giving failing tasks to different nodes

Reply via email to