if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827
On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <[email protected]> wrote: > Aaron: > > I see this in the Master's logs: > > 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same > address: akka.tcp://[email protected]:50038 > 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker > worker-20140520011737-hdn3.int.meetup.com-50038 > > There was an executor that launched that did fail, such as: > 14/05/20 01:16:05 INFO Master: Launching executor > app-20140520011605-0001/2 on worker > worker-20140519155427-hdn3.int.meetup.com-50 > 038 > 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 > because it is FAILED > > ... but other executors on other machines also failed without permanently > disassociating. > > There are these messages which I don't know if they are related: > 14/05/20 01:17:38 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaste > r/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. > 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters > encountered. This logging can be turned off or adjusted with confi > guration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/05/20 01:17:38 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka > ://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM > aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead > letters encountered. This logging can be turned off or adjust > ed with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > > > > > On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <[email protected]>wrote: > >> Unfortunately, those errors are actually due to an Executor that exited, >> such that the connection between the Worker and Executor failed. This is >> not a fatal issue, unless there are analogous messages from the Worker to >> the Master (which should be present, if they exist, at around the same >> point in time). >> >> Do you happen to have the logs from the Master that indicate that the >> Worker terminated? Is it just an Akka disassociation, or some exception? >> >> >> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <[email protected]> wrote: >> >>> This isn't helpful of me to say, but, I see the same sorts of problem >>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight >>> into when it happens, but usually after heavy use and after running >>> for a long time. I had figured I'd see if the changes since 0.9.0 >>> addressed it and revisit later. >>> >>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <[email protected]> wrote: >>> > So, for example, I have two disassociated worker machines at the >>> moment. >>> > The last messages in the spark logs are akka association error >>> messages, >>> > like the following: >>> > >>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError >>> > [akka.tcp://[email protected]:50038] -> >>> > [akka.tcp://[email protected]:46288]: Error >>> [Association >>> > failed with [akka.tcp://[email protected]:46288]] [ >>> > akka.remote.EndpointAssociationException: Association failed with >>> > [akka.tcp://[email protected]:46288] >>> > Caused by: >>> > >>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 >>> > ] >>> > >>> > On the master side, there are lots and lots of messages of the form: >>> > >>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker >>> > worker-20140520011737-hdn3.int.meetup.com-50038 >>> > >>> > --j >>> > >>> > >>> >> >> > -- --------------------------------- Best Regards
