I get a very similar stack trace and have no idea what could be causing it (see below). I've created a SO: http://stackoverflow.com/questions/24038908/spark-fails-on-big-jobs-with-java-io-ioexception-filesystem-closed
14/06/02 20:44:04 INFO client.AppClient$ClientActor: Executor updated: app-20140602203435-0020/6 is now FAILED (Command exited with code 137) 14/06/02 20:44:05 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140602203435-0020/6 removed: Command exited with code 137 14/06/02 20:44:05 INFO cluster.SparkDeploySchedulerBackend: Executor 6 disconnected, so removing it 14/06/02 20:44:05 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on ip-172-31-23-17.ec2.internal: Unknown executor exit code (137) (died from signal 9?) 14/06/02 20:44:05 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 from TaskSet 2.0 14/06/02 20:44:05 WARN scheduler.TaskSetManager: Lost TID 358 (task 2.0:66) ... 14/06/02 21:08:11 INFO cluster.SparkDeploySchedulerBackend: Executor 16 disconnected, so removing it 14/06/02 21:08:11 ERROR scheduler.TaskSchedulerImpl: Lost executor 16 on ip-172-31-28-73.ec2.internal: remote Akka client disassociated 14/06/02 21:08:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 16 from TaskSet 5.5 14/06/02 21:08:11 INFO scheduler.DAGScheduler: Executor lost: 16 (epoch 24) 14/06/02 21:08:11 INFO storage.BlockManagerMasterActor: Trying to remove executor 16 from BlockManagerMaster. 14/06/02 21:08:11 INFO storage.BlockManagerMaster: Removed 16 successfully in removeExecutor 14/06/02 21:08:11 INFO scheduler.Stage: Stage 5 is now unavailable on executor 16 (207/234, false) 14/06/02 21:08:11 INFO client.AppClient$ClientActor: Executor updated: app-20140602203435-0020/16 is now FAILED (Command exited with code 137) 14/06/02 21:08:11 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140602203435-0020/16 removed: Command exited with code 137 14/06/02 21:08:11 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client 14/06/02 21:08:11 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection... 14/06/02 21:08:12 INFO scheduler.TaskSchedulerImpl: Ignoring update with state FAILED from TID 1364 because its task set is gone ... 14/06/02 21:08:12 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:57) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) In a Spark Worker Log: 14/06/02 20:26:27 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-172-31-22-58.ec2.internal:43224] -> [akka.tcp://spark@ip-172-31-23-17.ec2.internal:35581] disassociated! Shutting down. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-fail-with-IOException-tp450p6919.html Sent from the Apache Spark User List mailing list archive at Nabble.com.