yes I've had errors with too many open files before, but this doesn't seem
to be the case here.

Hmm, you're right in that these errors are different from what I initially
stated -- I think what I assumed was that the failure to write resulted in
the worker to crash which in turn resulted in a failed fetch. I'll try to
see if I can make sense of it from the logs.

On Fri, May 22, 2015 at 9:29 PM, Imran Rashid <iras...@cloudera.com> wrote:

> hmm, sorry I think that disproves my theory.  Nothing else is immediately
> coming to mind.  Its possible there is more info in the logs from the
> driver, couldn't hurt to send those (though I don't have high hopes of
> finding anything that way).  Offchance this could be from too many open
> files or something?  Normally there is a different error msg, but I figure
> its worth asking anyway.
>
> The error you reported here was slightly different from your original
> post.  This error is from writing the shuffle map output, while the
> original error you reported was a fetch failed, which is from reading the
> shuffle data on the "reduce" side in the next stage.  Does the map stage
> actually finish, even though the tasks are throwing these errors while
> writing the map output?  Or do you sometimes get failures on the shuffle
> write side, and sometimes on the shuffle read side?  (Not that I think you
> are doing anything wrong, but it may help narrow down the root cause and
> possibly file a bug.)
>
> thanks
>
>
> On Fri, May 22, 2015 at 4:40 AM, Rok Roskar <rokros...@gmail.com> wrote:
>
>> on the worker/container that fails, the "file not found" is the first
>> error -- the output below is from the yarn log. There were some python
>> worker crashes for another job/stage earlier (see the warning at 18:36) but
>> I expect those to be unrelated to this file not found error.
>>
>>
>> ==================================================================================
>> LogType:stderr
>> Log Upload Time:15-May-2015 18:50:05
>> LogLength:5706
>> Log Contents:
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4
>> j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting
>> to kill Python Worker
>> 15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0
>> (TID 995)
>> java.io.FileNotFoundException:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3
>> -44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or
>> directory)
>>         at java.io.FileOutputStream.open(Native Method)
>>         at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
>>         at
>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
>>         at
>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758)
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754)
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>         at
>> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754)
>>         at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
>>         at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>         at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>         at java.lang.Thread.run(Thread.java:722)
>> 15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local
>> spark dir:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
>> java.io.IOException: Failed to delete:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
>>         at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933)
>>         at
>> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
>>         at
>> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162)
>>         at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>         at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>         at org.apache.spark.storage.DiskBlockManager.org
>> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>>         at
>> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156)
>>         at
>> org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208)
>>         at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88)
>>         at org.apache.spark.executor.Executor.stop(Executor.scala:146)
>>         at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>         at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>         at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>> On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <iras...@cloudera.com>
>> wrote:
>>
>>> Hi,
>>>
>>> can you take a look at the logs and see what the first error you are
>>> getting is?  Its possible that the file doesn't exist when that error is
>>> produced, but it shows up later -- I've seen similar things happen, but
>>> only after there have already been some errors.  But, if you see that in
>>> the very first error, then I"m not sure what the cause is.  Would be
>>> helpful for you to send the logs.
>>>
>>> Imran
>>>
>>> On Fri, May 15, 2015 at 10:07 AM, rok <rokros...@gmail.com> wrote:
>>>
>>>> I am trying to sort a collection of key,value pairs (between several
>>>> hundred
>>>> million to a few billion) and have recently been getting lots of
>>>> "FetchFailedException" errors that seem to originate when one of the
>>>> executors doesn't seem to find a temporary shuffle file on disk. E.g.:
>>>>
>>>> org.apache.spark.shuffle.FetchFailedException:
>>>>
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>> (No such file or directory)
>>>>
>>>> This file actually exists:
>>>>
>>>> > ls -l
>>>> >
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>>
>>>> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
>>>>
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>>
>>>> This error repeats on several executors and is followed by a number of
>>>>
>>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>>> location for shuffle 0
>>>>
>>>> This results on most tasks being lost and executors dying.
>>>>
>>>> There is plenty of space on all of the appropriate filesystems, so none
>>>> of
>>>> the executors are running out of disk space. Any idea what might be
>>>> causing
>>>> this? I am running this via YARN on approximately 100 nodes with 2
>>>> cores per
>>>> node. Any thoughts on what might be causing these errors? Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to