This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later.
On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com> wrote: > So, for example, I have two disassociated worker machines at the moment. > The last messages in the spark logs are akka association error messages, > like the following: > > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 > ] > > On the master side, there are lots and lots of messages of the form: > > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker > worker-20140520011737-hdn3.int.meetup.com-50038 > > --j > >