So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following:
14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:50038] -> [akka.tcp:// [email protected]:46288]: Error [Association failed with [akka.tcp://[email protected]:46288]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:46288] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 ] On the master side, there are lots and lots of messages of the form: 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 --j On Tue, May 20, 2014 at 3:28 PM, Josh Marcus <[email protected]> wrote: > We're using spark 0.9.0, and we're using it "out of the box" -- not using > Cloudera Manager or anything similar. > > There are warnings from the master that there continue to be heartbeats > from the unregistered workers. I will see if there are particular > telltale errors on the worker side. > > We've had occasional problems with running out of memory on the driver > side (esp. with large broadcast variables) so that may be related. > > --j > > > On Tuesday, May 20, 2014, Matei Zaharia <[email protected]> wrote: > >> Are you guys both using Cloudera Manager? Maybe there’s also an issue >> with the integration with that. >> >> Matei >> >> On May 20, 2014, at 11:44 AM, Aaron Davidson <[email protected]> wrote: >> >> I'd just like to point out that, along with Matei, I have not seen >> workers drop even under the most exotic job failures. We're running pretty >> close to master, though; perhaps it is related to an uncaught exception in >> the Worker from a prior version of Spark. >> >> >> On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <[email protected]> wrote: >> >>> Hi Matei, >>> >>> Unfortunately, I don't have more detailed information, but we have seen >>> the loss of workers in standalone mode as well. If a job is killed through >>> CTRL-C we will often see in the Spark Master page the number of workers and >>> cores decrease. They are still alive and well in the Cloudera Manager >>> page, but not visible on the Spark master, simply restarting the workers >>> usually resolves this, but we often seen workers disappear after a failed >>> or killed job. >>> >>> If we see this occur again, I'll try and provide some logs. >>> >>> >>> >>> >>> On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <[email protected] >>> > wrote: >>> >>>> Which version is this with? I haven’t seen standalone masters lose >>>> workers. Is there other stuff on the machines that’s killing them, or what >>>> errors do you see? >>>> >>>> Matei >>>> >>>> On May 16, 2014, at 9:53 AM, Josh Marcus <[email protected]> wrote: >>>> >>>> > Hey folks, >>>> > >>>> > I'm wondering what strategies other folks are using for maintaining >>>> and monitoring the stability of stand-alone spark clusters. >>>> > >>>> > Our master very regularly loses workers, and they (as expected) never >>>> rejoin the cluster. This is the same behavior I've seen >>>> > using akka cluster (if that's what spark is using in stand-alone >>>> mode) -- are there configuration options we could be setting >>>> > to make the cluster more robust? >>>> > >>>> > We have a custom script which monitors the number of workers (through >>>> the web interface) and restarts the cluster when >>>> > necessary, as well as resolving other issues we face (like spark >>>> shells left open permanently claiming resources), and it >>>> > works, but it's no where close to a great solution. >>>> > >>>> > What are other folks doing? Is this something that other folks >>>> observe as well? I suspect that the loss of workers is tied to >>>> > jobs that run out of memory on the client side or our use of very >>>> large broadcast variables, but I don't have an isolated test case. >>>> > I'm open to general answers here: for example, perhaps we should >>>> simply be using mesos or yarn instead of stand-alone mode. >>>> > >>>> > --j >>>> > >>>> >>>> >>> >> >>
