Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996
Hope this helps. On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas <johngou...@gmail.com> wrote: > No problem, Joe. There you go > https://issues.apache.org/jira/browse/SPARK-5081 > And also there is this one > https://issues.apache.org/jira/browse/SPARK-5715 which is marked as > resolved > > On 24 February 2015 at 21:51, Joe Wass <jw...@crossref.org> wrote: > >> Thanks everyone. >> >> Yiannis, do you know if there's a bug report for this regression? For >> some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but >> I can't remember what the bug was. >> >> Joe >> >> >> >> >> On 24 February 2015 at 19:26, Yiannis Gkoufas <johngou...@gmail.com> >> wrote: >> >>> Hi there, >>> >>> I assume you are using spark 1.2.1 right? >>> I faced the exact same issue and switched to 1.1.1 with the same >>> configuration and it was solved. >>> On 24 Feb 2015 19:22, "Ted Yu" <yuzhih...@gmail.com> wrote: >>> >>>> Here is a tool which may give you some clue: >>>> http://file-leak-detector.kohsuke.org/ >>>> >>>> Cheers >>>> >>>> On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov < >>>> vrodio...@splicemachine.com> wrote: >>>> >>>>> Usually it happens in Linux when application deletes file w/o double >>>>> checking that there are no open FDs (resource leak). In this case, Linux >>>>> holds all space allocated and does not release it until application >>>>> exits (crashes in your case). You check file system and everything is >>>>> normal, you have enough space and you have no idea why does application >>>>> report "no space left on device". >>>>> >>>>> Just a guess. >>>>> >>>>> -Vladimir Rodionov >>>>> >>>>> On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass <jw...@crossref.org> wrote: >>>>> >>>>>> I'm running a cluster of 3 Amazon EC2 machines (small number because >>>>>> it's expensive when experiments keep crashing after a day!). >>>>>> >>>>>> Today's crash looks like this (stacktrace at end of message). >>>>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an >>>>>> output location for shuffle 0 >>>>>> >>>>>> On my three nodes, I have plenty of space and inodes: >>>>>> >>>>>> A $ df -i >>>>>> Filesystem Inodes IUsed IFree IUse% Mounted on >>>>>> /dev/xvda1 524288 97937 426351 19% / >>>>>> tmpfs 1909200 1 1909199 1% /dev/shm >>>>>> /dev/xvdb 2457600 54 2457546 1% /mnt >>>>>> /dev/xvdc 2457600 24 2457576 1% /mnt2 >>>>>> /dev/xvds 831869296 23844 831845452 1% /vol0 >>>>>> >>>>>> A $ df -h >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/xvda1 7.9G 3.4G 4.5G 44% / >>>>>> tmpfs 7.3G 0 7.3G 0% /dev/shm >>>>>> /dev/xvdb 37G 1.2G 34G 4% /mnt >>>>>> /dev/xvdc 37G 177M 35G 1% /mnt2 >>>>>> /dev/xvds 1000G 802G 199G 81% /vol0 >>>>>> >>>>>> B $ df -i >>>>>> Filesystem Inodes IUsed IFree IUse% Mounted on >>>>>> /dev/xvda1 524288 97947 426341 19% / >>>>>> tmpfs 1906639 1 1906638 1% /dev/shm >>>>>> /dev/xvdb 2457600 54 2457546 1% /mnt >>>>>> /dev/xvdc 2457600 24 2457576 1% /mnt2 >>>>>> /dev/xvds 816200704 24223 816176481 1% /vol0 >>>>>> >>>>>> B $ df -h >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/xvda1 7.9G 3.6G 4.3G 46% / >>>>>> tmpfs 7.3G 0 7.3G 0% /dev/shm >>>>>> /dev/xvdb 37G 1.2G 34G 4% /mnt >>>>>> /dev/xvdc 37G 177M 35G 1% /mnt2 >>>>>> /dev/xvds 1000G 805G 195G 81% /vol0 >>>>>> >>>>>> C $df -i >>>>>> Filesystem Inodes IUsed IFree IUse% Mounted on >>>>>> /dev/xvda1 524288 97938 426350 19% / >>>>>> tmpfs 1906897 1 1906896 1% /dev/shm >>>>>> /dev/xvdb 2457600 54 2457546 1% /mnt >>>>>> /dev/xvdc 2457600 24 2457576 1% /mnt2 >>>>>> /dev/xvds 755218352 24024 755194328 1% /vol0 >>>>>> root@ip-10-204-136-223 ~]$ >>>>>> >>>>>> C $ df -h >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/xvda1 7.9G 3.4G 4.5G 44% / >>>>>> tmpfs 7.3G 0 7.3G 0% /dev/shm >>>>>> /dev/xvdb 37G 1.2G 34G 4% /mnt >>>>>> /dev/xvdc 37G 177M 35G 1% /mnt2 >>>>>> /dev/xvds 1000G 820G 181G 82% /vol0 >>>>>> >>>>>> The devices may be ~80% full but that still leaves ~200G free on >>>>>> each. My spark-env.sh has >>>>>> >>>>>> export SPARK_LOCAL_DIRS="/vol0/spark" >>>>>> >>>>>> I have manually verified that on each slave the only temporary files >>>>>> are stored on /vol0, all looking something like this >>>>>> >>>>>> >>>>>> /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 >>>>>> >>>>>> So it looks like all the files are being stored on the large drives >>>>>> (incidentally they're AWS EBS volumes, but that's the only way to get >>>>>> enough storage). My process crashed before with a slightly different >>>>>> exception under the same circumstances: kryo.KryoException: >>>>>> java.io.IOException: No space left on device >>>>>> >>>>>> These both happen after several hours and several GB of temporary >>>>>> files. >>>>>> >>>>>> Why does Spark think it's run out of space? >>>>>> >>>>>> TIA >>>>>> >>>>>> Joe >>>>>> >>>>>> Stack trace 1: >>>>>> >>>>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an >>>>>> output location for shuffle 0 >>>>>> at >>>>>> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) >>>>>> at >>>>>> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) >>>>>> at >>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >>>>>> at >>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >>>>>> at >>>>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >>>>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >>>>>> at >>>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) >>>>>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) >>>>>> at >>>>>> org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) >>>>>> at >>>>>> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) >>>>>> at >>>>>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) >>>>>> at >>>>>> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) >>>>>> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) >>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >>>>>> at >>>>>> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93) >>>>>> at >>>>>> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92) >>>>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >>>>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>>>>> at >>>>>> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:109) >>>>>> at >>>>>> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1177) >>>>>> at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:78) >>>>>> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) >>>>>> at >>>>>> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) >>>>>> at >>>>>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) >>>>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) >>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:245) >>>>>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >>>>>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >>>>>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >>>>>> at >>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >>>>>> at >>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> Stack trace 2: >>>>>> >>>>>> 15/02/22 02:47:08 WARN scheduler.TaskSetManager: Lost task 282.0 in >>>>>> stage 25.1 (TID 22644): com.esotericsoftware. >>>>>> kryo.KryoException: java.io.IOException: No space left on device >>>>>> at com.esotericsoftware.kryo.io.Output.flush(Output.java:157) >>>>>> at >>>>>> com.esotericsoftware.kryo.io.Output.require(Output.java:135) >>>>>> at >>>>>> com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) >>>>>> at >>>>>> com.esotericsoftware.kryo.io.Output.writeString(Output.java:306) >>>>>> at >>>>>> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:153) >>>>>> at >>>>>> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146) >>>>>> at >>>>>> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) >>>>>> at >>>>>> carbonite.serializer$print_collection.invoke(serializer.clj:41) >>>>>> at clojure.lang.Var.invoke(Var.java:423) >>>>>> at >>>>>> carbonite.ClojureCollSerializer.write(ClojureCollSerializer.java:19) >>>>>> at >>>>>> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) >>>>>> at >>>>>> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:130) >>>>>> at >>>>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:303) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:254) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83) >>>>>> at >>>>>> org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:87) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:237) >>>>>> at >>>>>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:206) >>>>>> at >>>>>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) >>>>>> at >>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >>>>>> at >>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> Caused by: java.io.IOException: No space left on device >>>>>> at java.io.FileOutputStream.writeBytes(Native Method) >>>>>> at java.io.FileOutputStream.write(FileOutputStream.java:345) >>>>>> at >>>>>> org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:86) >>>>>> at org.apache.spark.storage.DiskBlockObjectWriter.org >>>>>> $apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:221) >>>>>> at >>>>>> org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:86) >>>>>> at >>>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) >>>>>> at >>>>>> org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:300) >>>>>> at >>>>>> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:247) >>>>>> at >>>>>> org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107) >>>>>> >>>>>> >>>>>> >>>>> >>>> >> >