I'm running into a problem with executors failing, and it's not clear what's
causing it. Any suggestions on how to diagnose & fix it would be
appreciated.

There are a variety of errors in the logs, and I don't see a consistent
triggering error. I've tried varying the number of executors per machine
(1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails.

The relevant code is:
    val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _,
seqDictBcast.value))
    val result = reads.coalesce(numMachines * coresPerMachine * 4,
true).persist(StorageLevel.DISK_ONLY_2)
    log.info("SNAP output DebugString:\n" + result.toDebugString)
    log.info("produced " + result.count + " reads")

The toDebugString output is:
2014-08-07 18:50:43 INFO  SnapInputStage:198 - SNAP output DebugString:
MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions)
  CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions)
    ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions)
      MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10
partitions)
        MapPartitionsRDD[6] at mapPartitionsWithIndex at
SnapInputStage.scala:195 (10 partitions)
          MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions)
            CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10
partitions)
              NewHadoopRDD[2] at newAPIHadoopFile at
SnapInputStage.scala:182 (3003 partitions)

The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and
writes 25GB per task. The next 640-partition stage is where the failures
occur.

Here are the first few errors from a recent run (sorted by time):
work/hpcraviplvm10/app-20140807185713-0000/14/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm10/app-20140807185713-0000/27/stderr:   14/08/07 20:32:18
ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(hpcraviplvm1,49545)
work/hpcraviplvm1/app-20140807185713-0000/9/stderr:     14/08/07 20:32:18       
ERROR
ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm2/app-20140807185713-0000/24/stderr:    14/08/07 20:32:18       
ERROR
ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm2/app-20140807185713-0000/36/stderr:    14/08/07 20:32:18       
ERROR
SendingConnection: Exception while reading SendingConnection to
ConnectionManagerId(hpcraviplvm1,49545)
work/hpcraviplvma1/app-20140807185713-0000/26/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-0000/15/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-0000/18/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-0000/23/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-0000/33/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found

Thanks,

Ravi Pandya
Microsoft Research



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to