Also, I was experiencing another problem which might be related: "Error communicating with MapOutputTracker" (see email in the ML today).
I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <thomas.ger...@radius.com> wrote: > 1.2.1 > > Also, I was using the following parameters, which are 10 times the default > ones: > spark.akka.timeout 1000 > spark.akka.heartbeat.pauses 60000 > spark.akka.failure-detector.threshold 3000.0 > spark.akka.heartbeat.interval 10000 > > which should have helped *avoid* the problem if I understand correctly. > > Thanks, > Thomas > > On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> What release are you using ? >> >> SPARK-3923 went into 1.2.0 release. >> >> Cheers >> >> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.ger...@radius.com> >> wrote: >> >>> Hello, >>> >>> sometimes, in the *middle* of a job, the job stops (status is then seen >>> as FINISHED in the master). >>> >>> There isn't anything wrong in the shell/submit output. >>> >>> When looking at the executor logs, I see logs like this: >>> >>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker >>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal >>> :40019/user/MapOutputTracker#893807065] >>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs >>> for shuffle 38, fetching them >>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver >>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] >>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>> disassociated! Shutting down. >>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with >>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. >>> >>> How can I investigate further? >>> Thanks >>> >> >> >