Thanks. I was already setting those (and I checked they were in use through the environment tab in the UI).
They were set at 10 times their default value: 60000 and 10000 respectively. I'll start poking at spark.shuffle.io.retryWait. Thanks! On Wed, Mar 4, 2015 at 7:02 PM, Ted Yu <yuzhih...@gmail.com> wrote: > See this thread: > https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs > > Here're the relevant config parameters in Spark: > val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", > 6000) > val akkaHeartBeatInterval = > conf.getInt("spark.akka.heartbeat.interval", 1000) > > Cheers > > On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber <thomas.ger...@radius.com> > wrote: > >> Also, >> >> I was experiencing another problem which might be related: >> "Error communicating with MapOutputTracker" (see email in the ML today). >> >> I just thought I would mention it in case it is relevant. >> >> On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <thomas.ger...@radius.com> >> wrote: >> >>> 1.2.1 >>> >>> Also, I was using the following parameters, which are 10 times the >>> default ones: >>> spark.akka.timeout 1000 >>> spark.akka.heartbeat.pauses 60000 >>> spark.akka.failure-detector.threshold 3000.0 >>> spark.akka.heartbeat.interval 10000 >>> >>> which should have helped *avoid* the problem if I understand correctly. >>> >>> Thanks, >>> Thomas >>> >>> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> What release are you using ? >>>> >>>> SPARK-3923 went into 1.2.0 release. >>>> >>>> Cheers >>>> >>>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.ger...@radius.com >>>> > wrote: >>>> >>>>> Hello, >>>>> >>>>> sometimes, in the *middle* of a job, the job stops (status is then >>>>> seen as FINISHED in the master). >>>>> >>>>> There isn't anything wrong in the shell/submit output. >>>>> >>>>> When looking at the executor logs, I see logs like this: >>>>> >>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; >>>>> tracker actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal >>>>> :40019/user/MapOutputTracker#893807065] >>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs >>>>> for shuffle 38, fetching them >>>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver >>>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] >>>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>>>> disassociated! Shutting down. >>>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with >>>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] >>>>> has failed, address is now gated for [5000] ms. Reason is: >>>>> [Disassociated]. >>>>> >>>>> How can I investigate further? >>>>> Thanks >>>>> >>>> >>>> >>> >> >