Hi I am running calling a C++ library on Spark using JNI. Occasionally the C++ library causes the JVM to crash. The task terminates on the MASTER, but the driver does not return. I am not sure why the driver does not terminate. I also notice that after such an occurrence, I lose some workers permanently. I have a few questions
1) Why does the driver not terminate? Is this because some JVMs are still in zombie or inconsistent state? 2) Can anything be done to prevent this? 3) Is there a mode in Spark where I can ignore failure and still collect results from the successful tasks? This would be a hugely useful feature as I am using Spark to run regression tests on this native library. Just collection of successful results would be of huge benefit. Deenar I see the following messages in the driver 1) Initial Errors 14/04/11 18:13:21 INFO AppClient$ClientActor: Executor updated: app-20140411180619-0011/14 is now FAILED (Command exited with code 134) 14/04/11 18:13:21 INFO SparkDeploySchedulerBackend: Executor app-20140411180619-0011/14 removed: Command exited with code 134 14/04/11 18:13:21 INFO SparkDeploySchedulerBackend: Executor 14 disconnected, so removing it 14/04/11 18:13:21 ERROR TaskSchedulerImpl: Lost executor 14 on lonpldpuappu5.uk.db.com: Unknown executor exit code (134) (died from signal 6?) 14/04/11 18:13:21 INFO TaskSetManager: Re-queueing tasks for 14 from TaskSet 3.0 14/04/11 18:13:21 WARN TaskSetManager: Lost TID 320 (task 3.0:306) 14/04/11 18:13:21 INFO AppClient$ClientActor: Executor added: app-20140411180619-0011/55 on worker-20140409143755-lonpldpuappu5.uk.db.com-58926 (lonpldpuappu5.uk.db.com:58926) with 1 cores 14/04/11 18:13:21 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140411180619-0011/55 on hostPort lonpldpuappu5.uk.db.com:58926 with 1 cores, 12.0 GB RAM 14/04/11 18:13:21 INFO AppClient$ClientActor: Executor updated: app-20140411180619-0011/55 is now RUNNING 14/04/11 18:13:21 INFO TaskSetManager: Starting task 3.0:306 as TID 352 on executor 4: lonpldpuappu5.uk.db.com (NODE_LOCAL) 2) Application stopped 14/04/11 18:13:37 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client 14/04/11 18:13:37 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection... 14/04/11 18:13:37 INFO TaskSetManager: Starting task 3.0:386 as TID 433 on executor 58: lonpldpuappu5.uk.db.com (NODE_LOCAL) 14/04/11 18:13:37 INFO TaskSetManager: Serialized task 3.0:386 as 18244 bytes in 0 ms 14/04/11 18:13:37 INFO TaskSetManager: Starting task 3.0:409 as TID 434 on executor 39: lonpldpuappu5.uk.db.com (NODE_LOCAL) 14/04/11 18:13:37 INFO TaskSetManager: Serialized task 3.0:409 as 18244 bytes in 0 ms 14/04/11 18:13:37 WARN TaskSetManager: Lost TID 425 (task 3.0:400) 14/04/11 18:13:37 WARN TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:601) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.SequenceFile$Reader.sync(SequenceFile.java:2624) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:54) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 3) After this the driver keeps on running with the following messages and never terminates 14/04/11 18:13:37 WARN TaskSetManager: Lost TID 399 (task 3.0:378) 14/04/11 18:13:37 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 9] 14/04/11 18:13:37 INFO SparkDeploySchedulerBackend: Executor 63 disconnected, so removing it 14/04/11 18:13:37 ERROR TaskSchedulerImpl: Lost executor 63 on lonpldpuappu5.uk.db.com: remote Akka client disassociated 14/04/11 18:13:37 INFO TaskSetManager: Re-queueing tasks for 63 from TaskSet 3.0 14/04/11 18:13:37 WARN TaskSetManager: Lost TID 431 (task 3.0:405) 14/04/11 18:13:37 INFO SparkDeploySchedulerBackend: Executor 62 disconnected, so removing it 14/04/11 18:13:37 INFO DAGScheduler: Executor lost: 63 (epoch 10) 14/04/11 18:13:37 ERROR TaskSchedulerImpl: Lost executor 62 on lonpldpuappu6.uk.db.com: remote Akka client disassociated 14/04/11 18:13:37 INFO TaskSetManager: Re-queueing tasks for 62 from TaskSet 3.0 14/04/11 18:13:37 INFO BlockManagerMasterActor: Trying to remove executor 63 from BlockManagerMaster. 14/04/11 18:13:37 INFO BlockManagerMaster: Removed 63 successfully in removeExecutor 14/04/11 18:13:37 WARN TaskSetManager: Lost TID 428 (task 3.0:402) 14/04/11 18:13:37 INFO DAGScheduler: Executor lost: 62 (epoch 11) 14/04/11 18:13:37 INFO SparkDeploySchedulerBackend: Executor 6 disconnected, so removing it 14/04/11 18:13:37 ERROR TaskSchedulerImpl: Lost executor 6 on lonpldpuappu11.uk.db.com: remote Akka client disassociated 14/04/11 18:13:37 INFO BlockManagerMasterActor: Trying to remove executor 62 from BlockManagerMaster. 14/04/11 18:13:37 INFO TaskSetManager: Re-queueing tasks for 6 from TaskSet 3.0 14/04/11 18:13:37 INFO BlockManagerMaster: Removed 62 successfully in removeExecutor 14/04/11 18:13:37 INFO DAGScheduler: Executor lost: 6 (epoch 12) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-behaviour-when-executor-JVM-crashes-tp4142.html Sent from the Apache Spark User List mailing list archive at Nabble.com.