See this thread: https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs
Here're the relevant config parameters in Spark: val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000) val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000) Cheers On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber <thomas.ger...@radius.com> wrote: > Also, > > I was experiencing another problem which might be related: > "Error communicating with MapOutputTracker" (see email in the ML today). > > I just thought I would mention it in case it is relevant. > > On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <thomas.ger...@radius.com> > wrote: > >> 1.2.1 >> >> Also, I was using the following parameters, which are 10 times the >> default ones: >> spark.akka.timeout 1000 >> spark.akka.heartbeat.pauses 60000 >> spark.akka.failure-detector.threshold 3000.0 >> spark.akka.heartbeat.interval 10000 >> >> which should have helped *avoid* the problem if I understand correctly. >> >> Thanks, >> Thomas >> >> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> What release are you using ? >>> >>> SPARK-3923 went into 1.2.0 release. >>> >>> Cheers >>> >>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.ger...@radius.com> >>> wrote: >>> >>>> Hello, >>>> >>>> sometimes, in the *middle* of a job, the job stops (status is then >>>> seen as FINISHED in the master). >>>> >>>> There isn't anything wrong in the shell/submit output. >>>> >>>> When looking at the executor logs, I see logs like this: >>>> >>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker >>>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal >>>> :40019/user/MapOutputTracker#893807065] >>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs >>>> for shuffle 38, fetching them >>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver >>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] >>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>>> disassociated! Shutting down. >>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with >>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. >>>> >>>> How can I investigate further? >>>> Thanks >>>> >>> >>> >> >