See this thread:
https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs

Here're the relevant config parameters in Spark:
    val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses",
6000)
    val akkaHeartBeatInterval =
conf.getInt("spark.akka.heartbeat.interval", 1000)

Cheers

On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber <thomas.ger...@radius.com>
wrote:

> Also,
>
> I was experiencing another problem which might be related:
> "Error communicating with MapOutputTracker" (see email in the ML today).
>
> I just thought I would mention it in case it is relevant.
>
> On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <thomas.ger...@radius.com>
> wrote:
>
>> 1.2.1
>>
>> Also, I was using the following parameters, which are 10 times the
>> default ones:
>> spark.akka.timeout 1000
>> spark.akka.heartbeat.pauses 60000
>> spark.akka.failure-detector.threshold 3000.0
>> spark.akka.heartbeat.interval 10000
>>
>> which should have helped *avoid* the problem if I understand correctly.
>>
>> Thanks,
>> Thomas
>>
>> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> What release are you using ?
>>>
>>> SPARK-3923 went into 1.2.0 release.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.ger...@radius.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> sometimes, in the *middle* of a job, the job stops (status is then
>>>> seen as FINISHED in the master).
>>>>
>>>> There isn't anything wrong in the shell/submit output.
>>>>
>>>> When looking at the executor logs, I see logs like this:
>>>>
>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
>>>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>>>> :40019/user/MapOutputTracker#893807065]
>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs
>>>> for shuffle 38, fetching them
>>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>> disassociated! Shutting down.
>>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>
>>>> How can I investigate further?
>>>> Thanks
>>>>
>>>
>>>
>>
>

Reply via email to