Do you guys have dynamic allocation turned on for YARN?

Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?

If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the previous stage to also be
re-computed if it's a wide dependency.

But like Petar said, if you turn the external shuffle service on, YARN
NodeManager process on the slave machines will serve out the map spill
data, instead of the Executor JVMs (by default unless you turn external
shuffle on, the Executor JVM itself serves out the shuffle data which
causes problems if an Executor dies).

Core, how often are Executors crashing in your app? How many Executors do
you have total? And what is the memory size for each? You can change what
fraction of the Executor heap will be used for your user code vs the
shuffle vs RDD caching with the spark.storage.memoryFraction setting.

On Sat, Feb 21, 2015 at 2:58 PM, Petar Zecevic <petar.zece...@gmail.com>
wrote:

>
> Could you try to turn on the external shuffle service?
>
> spark.shuffle.service.enable = true
>
>
> On 21.2.2015. 17:50, Corey Nolet wrote:
>
> I'm experiencing the same issue. Upon closer inspection I'm noticing that
> executors are being lost as well. Thing is, I can't figure out how they are
> dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
> allocated for the application. I was thinking perhaps it was possible that
> a single executor was getting a single or a couple large partitions but
> shouldn't the disk persistence kick in at that point?
>
> On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg <arp...@spotify.com>
> wrote:
>
>> For large jobs, the following error message is shown that seems to
>> indicate that shuffle files for some reason are missing. It's a rather
>> large job with many partitions. If the data size is reduced, the problem
>> disappears. I'm running a build from Spark master post 1.2 (build at
>> 2015-01-16) and running on Yarn 2.2. Any idea of how to resolve this
>> problem?
>>
>>  User class threw exception: Job aborted due to stage failure: Task 450
>> in stage 450.1 failed 4 times, most recent failure: Lost task 450.3 in
>> stage 450.1 (TID 167370, lon4-hadoopslave-b77.lon4.spotify.net):
>> java.io.FileNotFoundException:
>> /disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450
>> (No such file or directory)
>>  at java.io.FileOutputStream.open(Native Method)
>>  at java.io.FileOutputStream.(FileOutputStream.java:221)
>>  at java.io.FileOutputStream.(FileOutputStream.java:171)
>>  at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76)
>>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786)
>>  at
>> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
>>  at
>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149)
>>  at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>  at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
>>  at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>  at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>  TIA,
>> Anders
>>
>>
>
>

Reply via email to