Re: advice on maintaining a production spark cluster?

Mark Hamstra Wed, 21 May 2014 09:20:41 -0700

After the several fixes that we have made to exception handling in Spark
1.0.0, I expect that this behavior will be quite different from 0.9.1.
 Executors should be far more likely to shutdown cleanly in the event of
errors, allowing easier restarts.  But I expect that there will be more
bugs to fix in the next couple of maintenance releases.



On Wed, May 21, 2014 at 8:58 AM, Han JU <ju.han.fe...@gmail.com> wrote:

> I've seen also worker loss and that's way I asked a question about worker
> re-spawn.
>
> My typical case is there's some job got OOM exception. Then on the master
> UI some worker's state becomes DEAD.
> In the master's log, there's error like:
>
> ```
> 14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkmas...@ec2-23-20-189-111.compute-1.amazonaws.com:7077]
> -> [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]: Error
> [Association failed with
> [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572
> ]
> 14/05/21 15:38:02 INFO master.Master:
> akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572 got
> disassociated, removing it.
> ```
>
> On the `DEAD` worker machine, there's 2 spark processes, worker and
> executor backend:
>   16280 org.apache.spark.deploy.worker.Worker
>   25989 org.apache.spark.executor.CoarseGrainedExecutorBackend
>
> The bad thing is that in this case, a sbin/stop-all.sh and
> sbin/start-all.sh cannot bring back the DEAD worker since the worker
> process cannot be terminated (maybe due to the executor backend). I have to
> log in, kill -9 both worker process and the executor backend.
>
> I'm on 0.9.1 and using ec2-script.
>
>
>
> 2014-05-21 11:42 GMT+02:00 sagi <zhpeng...@gmail.com>:
>
> if you saw some exception message like the JIRA
>> https://issues.apache.org/jira/browse/SPARK-1886  mentioned in work's
>> log file, you are welcome to have a try
>> https://github.com/apache/spark/pull/827
>>
>>
>>
>>
>> On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <jmar...@meetup.com> wrote:
>>
>>> Aaron:
>>>
>>> I see this in the Master's logs:
>>>
>>> 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
>>> address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
>>> 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
>>> worker-20140520011737-hdn3.int.meetup.com-50038
>>>
>>> There was an executor that launched that did fail, such as:
>>> 14/05/20 01:16:05 INFO Master: Launching executor
>>> app-20140520011605-0001/2 on worker
>>> worker-20140519155427-hdn3.int.meetup.com-50
>>> 038
>>> 14/05/20 01:17:37 INFO Master: Removing executor
>>> app-20140520011605-0001/2 because it is FAILED
>>>
>>> ... but other executors on other machines also failed without
>>> permanently disassociating.
>>>
>>> There are these messages which I don't know if they are related:
>>>  14/05/20 01:17:38 INFO LocalActorRef: Message
>>> [akka.remote.transport.AssociationHandle$Disassociated] from
>>> Actor[akka://sparkMaste
>>> r/deadLetters] to
>>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
>>> 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
>>> encountered. This logging can be turned off or adjusted with confi
>>> guration settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>> 14/05/20 01:17:38 INFO LocalActorRef: Message
>>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>>> Actor[akka
>>> ://sparkMaster/deadLetters] to
>>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
>>> aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
>>> letters encountered. This logging can be turned off or adjust
>>> ed with configuration settings 'akka.log-dead-letters' and
>>> 'akka.log-dead-letters-during-shutdown'.
>>>
>>>
>>>
>>>
>>> On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilike...@gmail.com>wrote:
>>>
>>>> Unfortunately, those errors are actually due to an Executor that
>>>> exited, such that the connection between the Worker and Executor failed.
>>>> This is not a fatal issue, unless there are analogous messages from the
>>>> Worker to the Master (which should be present, if they exist, at around the
>>>> same point in time).
>>>>
>>>> Do you happen to have the logs from the Master that indicate that the
>>>> Worker terminated? Is it just an Akka disassociation, or some exception?
>>>>
>>>>
>>>> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> This isn't helpful of me to say, but, I see the same sorts of problem
>>>>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
>>>>> into when it happens, but usually after heavy use and after running
>>>>> for a long time. I had figured I'd see if the changes since 0.9.0
>>>>> addressed it and revisit later.
>>>>>
>>>>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com>
>>>>> wrote:
>>>>> > So, for example, I have two disassociated worker machines at the
>>>>> moment.
>>>>> > The last messages in the spark logs are akka association error
>>>>> messages,
>>>>> > like the following:
>>>>> >
>>>>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
>>>>> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] ->
>>>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error
>>>>> [Association
>>>>> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
>>>>> > akka.remote.EndpointAssociationException: Association failed with
>>>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
>>>>> > Caused by:
>>>>> >
>>>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>>>>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
>>>>> > ]
>>>>> >
>>>>> > On the master side, there are lots and lots of messages of the form:
>>>>> >
>>>>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
>>>>> > worker-20140520011737-hdn3.int.meetup.com-50038
>>>>> >
>>>>> > --j
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> ---------------------------------
>> Best Regards
>>
>
>
>
> --
> *JU Han*
>
> Data Engineer @ Botify.com
>
> +33 0619608888
>

Re: advice on maintaining a production spark cluster?

Reply via email to