Hi,

can you take a look at the logs and see what the first error you are
getting is?  Its possible that the file doesn't exist when that error is
produced, but it shows up later -- I've seen similar things happen, but
only after there have already been some errors.  But, if you see that in
the very first error, then I"m not sure what the cause is.  Would be
helpful for you to send the logs.

Imran

On Fri, May 15, 2015 at 10:07 AM, rok <rokros...@gmail.com> wrote:

> I am trying to sort a collection of key,value pairs (between several
> hundred
> million to a few billion) and have recently been getting lots of
> "FetchFailedException" errors that seem to originate when one of the
> executors doesn't seem to find a temporary shuffle file on disk. E.g.:
>
> org.apache.spark.shuffle.FetchFailedException:
>
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
> (No such file or directory)
>
> This file actually exists:
>
> > ls -l
> >
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>
> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
>
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>
> This error repeats on several executors and is followed by a number of
>
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
> location for shuffle 0
>
> This results on most tasks being lost and executors dying.
>
> There is plenty of space on all of the appropriate filesystems, so none of
> the executors are running out of disk space. Any idea what might be causing
> this? I am running this via YARN on approximately 100 nodes with 2 cores
> per
> node. Any thoughts on what might be causing these errors? Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to