After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to fix in the next couple of maintenance releases.
On Wed, May 21, 2014 at 8:58 AM, Han JU <[email protected]> wrote: > I've seen also worker loss and that's way I asked a question about worker > re-spawn. > > My typical case is there's some job got OOM exception. Then on the master > UI some worker's state becomes DEAD. > In the master's log, there's error like: > > ``` > 14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError > [akka.tcp://[email protected]:7077] > -> [akka.tcp://[email protected]:38572]: Error > [Association failed with > [akka.tcp://[email protected]:38572]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://[email protected]:38572] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572 > ] > 14/05/21 15:38:02 INFO master.Master: > akka.tcp://[email protected]:38572 got > disassociated, removing it. > ``` > > On the `DEAD` worker machine, there's 2 spark processes, worker and > executor backend: > 16280 org.apache.spark.deploy.worker.Worker > 25989 org.apache.spark.executor.CoarseGrainedExecutorBackend > > The bad thing is that in this case, a sbin/stop-all.sh and > sbin/start-all.sh cannot bring back the DEAD worker since the worker > process cannot be terminated (maybe due to the executor backend). I have to > log in, kill -9 both worker process and the executor backend. > > I'm on 0.9.1 and using ec2-script. > > > > 2014-05-21 11:42 GMT+02:00 sagi <[email protected]>: > > if you saw some exception message like the JIRA >> https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's >> log file, you are welcome to have a try >> https://github.com/apache/spark/pull/827 >> >> >> >> >> On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <[email protected]> wrote: >> >>> Aaron: >>> >>> I see this in the Master's logs: >>> >>> 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same >>> address: akka.tcp://[email protected]:50038 >>> 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker >>> worker-20140520011737-hdn3.int.meetup.com-50038 >>> >>> There was an executor that launched that did fail, such as: >>> 14/05/20 01:16:05 INFO Master: Launching executor >>> app-20140520011605-0001/2 on worker >>> worker-20140519155427-hdn3.int.meetup.com-50 >>> 038 >>> 14/05/20 01:17:37 INFO Master: Removing executor >>> app-20140520011605-0001/2 because it is FAILED >>> >>> ... but other executors on other machines also failed without >>> permanently disassociating. >>> >>> There are these messages which I don't know if they are related: >>> 14/05/20 01:17:38 INFO LocalActorRef: Message >>> [akka.remote.transport.AssociationHandle$Disassociated] from >>> Actor[akka://sparkMaste >>> r/deadLetters] to >>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. >>> 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters >>> encountered. This logging can be turned off or adjusted with confi >>> guration settings 'akka.log-dead-letters' and >>> 'akka.log-dead-letters-during-shutdown'. >>> 14/05/20 01:17:38 INFO LocalActorRef: Message >>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from >>> Actor[akka >>> ://sparkMaster/deadLetters] to >>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM >>> aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead >>> letters encountered. This logging can be turned off or adjust >>> ed with configuration settings 'akka.log-dead-letters' and >>> 'akka.log-dead-letters-during-shutdown'. >>> >>> >>> >>> >>> On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <[email protected]>wrote: >>> >>>> Unfortunately, those errors are actually due to an Executor that >>>> exited, such that the connection between the Worker and Executor failed. >>>> This is not a fatal issue, unless there are analogous messages from the >>>> Worker to the Master (which should be present, if they exist, at around the >>>> same point in time). >>>> >>>> Do you happen to have the logs from the Master that indicate that the >>>> Worker terminated? Is it just an Akka disassociation, or some exception? >>>> >>>> >>>> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <[email protected]> wrote: >>>> >>>>> This isn't helpful of me to say, but, I see the same sorts of problem >>>>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight >>>>> into when it happens, but usually after heavy use and after running >>>>> for a long time. I had figured I'd see if the changes since 0.9.0 >>>>> addressed it and revisit later. >>>>> >>>>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <[email protected]> >>>>> wrote: >>>>> > So, for example, I have two disassociated worker machines at the >>>>> moment. >>>>> > The last messages in the spark logs are akka association error >>>>> messages, >>>>> > like the following: >>>>> > >>>>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError >>>>> > [akka.tcp://[email protected]:50038] -> >>>>> > [akka.tcp://[email protected]:46288]: Error >>>>> [Association >>>>> > failed with [akka.tcp://[email protected]:46288]] [ >>>>> > akka.remote.EndpointAssociationException: Association failed with >>>>> > [akka.tcp://[email protected]:46288] >>>>> > Caused by: >>>>> > >>>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>>>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 >>>>> > ] >>>>> > >>>>> > On the master side, there are lots and lots of messages of the form: >>>>> > >>>>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker >>>>> > worker-20140520011737-hdn3.int.meetup.com-50038 >>>>> > >>>>> > --j >>>>> > >>>>> > >>>>> >>>> >>>> >>> >> >> >> -- >> --------------------------------- >> Best Regards >> > > > > -- > *JU Han* > > Data Engineer @ Botify.com > > +33 0619608888 >
