This sounds bad, and probably related to shuffle file consolidation.
Turning off consolidation would probably get you working again, but I'd
really love to track down the bug. Do you know if any tasks fail before
those errors start occurring? It's very possible that another exception is
occurring which is causing a file to not be written -- I think I've seen
latent OOMEs induce this behavior, for instance.


On Sun, Feb 9, 2014 at 8:28 AM, Guillaume Pitel
<guillaume.pi...@exensa.com>wrote:

>  Hi,
>
> I've got a strange problem with 0.8.1 (we're going to make the jump to
> 0.9.0 in a few days, but for now I'm woring with a 0.8.1 cluster) : After a
> few iteration of my method, one random node of my local cluster throws an
> exception like that :
>
> FileNotFoundException:
> /sparktmp/spark-local-20140209073949-29b1/37/merged_shuffle_24_23_1 (No
> such file or directory)
>
> Then, either the job get stuck for hours, or it fails right away.
>
> I've got the ulimit at 131k files, and consolidateFiles=true, so I don't
> think it a problem related to the # of file descriptors
>
> Guillaume
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
>  eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<exensa_logo_mail.png>>

Reply via email to