Re: FetchFailedException and MetadataFetchFailedException

Imran Rashid Fri, 22 May 2015 12:30:52 -0700

hmm, sorry I think that disproves my theory.  Nothing else is immediately
coming to mind.  Its possible there is more info in the logs from the
driver, couldn't hurt to send those (though I don't have high hopes of
finding anything that way).  Offchance this could be from too many open
files or something?  Normally there is a different error msg, but I figure
its worth asking anyway.


The error you reported here was slightly different from your original
post.  This error is from writing the shuffle map output, while the
original error you reported was a fetch failed, which is from reading the
shuffle data on the "reduce" side in the next stage.  Does the map stage
actually finish, even though the tasks are throwing these errors while
writing the map output?  Or do you sometimes get failures on the shuffle
write side, and sometimes on the shuffle read side?  (Not that I think you
are doing anything wrong, but it may help narrow down the root cause and
possibly file a bug.)

thanks


On Fri, May 22, 2015 at 4:40 AM, Rok Roskar <rokros...@gmail.com> wrote:

> on the worker/container that fails, the "file not found" is the first
> error -- the output below is from the yarn log. There were some python
> worker crashes for another job/stage earlier (see the warning at 18:36) but
> I expect those to be unrelated to this file not found error.
>
>
> ==================================================================================
> LogType:stderr
> Log Upload Time:15-May-2015 18:50:05
> LogLength:5706
> Log Contents:
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4
> j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting
> to kill Python Worker
> 15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0
> (TID 995)
> java.io.FileNotFoundException:
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3
> -44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or directory)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
>         at
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
>         at
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201)
>         at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759)
>         at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at
> org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823)
>         at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758)
>         at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754)
>         at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
> 15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local
> spark dir:
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
> java.io.IOException: Failed to delete:
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
>         at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933)
>         at
> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
>         at
> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>         at org.apache.spark.storage.DiskBlockManager.org
> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>         at
> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156)
>         at
> org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208)
>         at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88)
>         at org.apache.spark.executor.Executor.stop(Executor.scala:146)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105)
>         at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>         at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>         at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>         at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
>         at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>         at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>         at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <iras...@cloudera.com>
> wrote:
>
>> Hi,
>>
>> can you take a look at the logs and see what the first error you are
>> getting is?  Its possible that the file doesn't exist when that error is
>> produced, but it shows up later -- I've seen similar things happen, but
>> only after there have already been some errors.  But, if you see that in
>> the very first error, then I"m not sure what the cause is.  Would be
>> helpful for you to send the logs.
>>
>> Imran
>>
>> On Fri, May 15, 2015 at 10:07 AM, rok <rokros...@gmail.com> wrote:
>>
>>> I am trying to sort a collection of key,value pairs (between several
>>> hundred
>>> million to a few billion) and have recently been getting lots of
>>> "FetchFailedException" errors that seem to originate when one of the
>>> executors doesn't seem to find a temporary shuffle file on disk. E.g.:
>>>
>>> org.apache.spark.shuffle.FetchFailedException:
>>>
>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>> (No such file or directory)
>>>
>>> This file actually exists:
>>>
>>> > ls -l
>>> >
>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>
>>> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
>>>
>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>
>>> This error repeats on several executors and is followed by a number of
>>>
>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>> location for shuffle 0
>>>
>>> This results on most tasks being lost and executors dying.
>>>
>>> There is plenty of space on all of the appropriate filesystems, so none
>>> of
>>> the executors are running out of disk space. Any idea what might be
>>> causing
>>> this? I am running this via YARN on approximately 100 nodes with 2 cores
>>> per
>>> node. Any thoughts on what might be causing these errors? Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: FetchFailedException and MetadataFetchFailedException

Reply via email to