Same here Ravi. See my post on a similar thread. Are you running on YARN client? On Aug 7, 2014 2:56 PM, "rpandya" <r...@iecommerce.com> wrote:
> I'm running into a problem with executors failing, and it's not clear > what's > causing it. Any suggestions on how to diagnose & fix it would be > appreciated. > > There are a variety of errors in the logs, and I don't see a consistent > triggering error. I've tried varying the number of executors per machine > (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails. > > The relevant code is: > val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _, > seqDictBcast.value)) > val result = reads.coalesce(numMachines * coresPerMachine * 4, > true).persist(StorageLevel.DISK_ONLY_2) > log.info("SNAP output DebugString:\n" + result.toDebugString) > log.info("produced " + result.count + " reads") > > The toDebugString output is: > 2014-08-07 18:50:43 INFO SnapInputStage:198 - SNAP output DebugString: > MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions) > CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions) > ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions) > MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10 > partitions) > MapPartitionsRDD[6] at mapPartitionsWithIndex at > SnapInputStage.scala:195 (10 partitions) > MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions) > CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10 > partitions) > NewHadoopRDD[2] at newAPIHadoopFile at > SnapInputStage.scala:182 (3003 partitions) > > The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and > writes 25GB per task. The next 640-partition stage is where the failures > occur. > > Here are the first few errors from a recent run (sorted by time): > work/hpcraviplvm10/app-20140807185713-0000/14/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvm10/app-20140807185713-0000/27/stderr: 14/08/07 20:32:18 > ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get > block(s) > from ConnectionManagerId(hpcraviplvm1,49545) > work/hpcraviplvm1/app-20140807185713-0000/9/stderr: 14/08/07 20:32:18 > ERROR > ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvm2/app-20140807185713-0000/24/stderr: 14/08/07 20:32:18 > ERROR > ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvm2/app-20140807185713-0000/36/stderr: 14/08/07 20:32:18 > ERROR > SendingConnection: Exception while reading SendingConnection to > ConnectionManagerId(hpcraviplvm1,49545) > work/hpcraviplvma1/app-20140807185713-0000/26/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvma2/app-20140807185713-0000/15/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvma2/app-20140807185713-0000/18/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvma2/app-20140807185713-0000/23/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > work/hpcraviplvma2/app-20140807185713-0000/33/stderr: 14/08/07 20:32:18 > ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found > > Thanks, > > Ravi Pandya > Microsoft Research > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >