I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a single executor was getting a single or a couple large partitions but shouldn't the disk persistence kick in at that point?
On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg <arp...@spotify.com> wrote: > For large jobs, the following error message is shown that seems to > indicate that shuffle files for some reason are missing. It's a rather > large job with many partitions. If the data size is reduced, the problem > disappears. I'm running a build from Spark master post 1.2 (build at > 2015-01-16) and running on Yarn 2.2. Any idea of how to resolve this > problem? > > User class threw exception: Job aborted due to stage failure: Task 450 in > stage 450.1 failed 4 times, most recent failure: Lost task 450.3 in stage > 450.1 (TID 167370, lon4-hadoopslave-b77.lon4.spotify.net): > java.io.FileNotFoundException: > /disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450 > (No such file or directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > TIA, > Anders > >