I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose & fix it would be appreciated.
There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per machine (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails. The relevant code is: val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _, seqDictBcast.value)) val result = reads.coalesce(numMachines * coresPerMachine * 4, true).persist(StorageLevel.DISK_ONLY_2) log.info("SNAP output DebugString:\n" + result.toDebugString) log.info("produced " + result.count + " reads") The toDebugString output is: 2014-08-07 18:50:43 INFO SnapInputStage:198 - SNAP output DebugString: MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions) CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions) ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions) MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10 partitions) MapPartitionsRDD[6] at mapPartitionsWithIndex at SnapInputStage.scala:195 (10 partitions) MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions) CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10 partitions) NewHadoopRDD[2] at newAPIHadoopFile at SnapInputStage.scala:182 (3003 partitions) The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and writes 25GB per task. The next 640-partition stage is where the failures occur. Here are the first few errors from a recent run (sorted by time): work/hpcraviplvm10/app-20140807185713-0000/14/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm10/app-20140807185713-0000/27/stderr: 14/08/07 20:32:18 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(hpcraviplvm1,49545) work/hpcraviplvm1/app-20140807185713-0000/9/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm2/app-20140807185713-0000/24/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm2/app-20140807185713-0000/36/stderr: 14/08/07 20:32:18 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(hpcraviplvm1,49545) work/hpcraviplvma1/app-20140807185713-0000/26/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-0000/15/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-0000/18/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-0000/23/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-0000/33/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found Thanks, Ravi Pandya Microsoft Research -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org