Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that.
Matei On May 20, 2014, at 11:44 AM, Aaron Davidson <ilike...@gmail.com> wrote: > I'd just like to point out that, along with Matei, I have not seen workers > drop even under the most exotic job failures. We're running pretty close to > master, though; perhaps it is related to an uncaught exception in the Worker > from a prior version of Spark. > > > On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuj...@gmail.com> wrote: > Hi Matei, > > Unfortunately, I don't have more detailed information, but we have seen the > loss of workers in standalone mode as well. If a job is killed through > CTRL-C we will often see in the Spark Master page the number of workers and > cores decrease. They are still alive and well in the Cloudera Manager page, > but not visible on the Spark master, simply restarting the workers usually > resolves this, but we often seen workers disappear after a failed or killed > job. > > If we see this occur again, I'll try and provide some logs. > > > > > On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > Which version is this with? I haven’t seen standalone masters lose workers. > Is there other stuff on the machines that’s killing them, or what errors do > you see? > > Matei > > On May 16, 2014, at 9:53 AM, Josh Marcus <jmar...@meetup.com> wrote: > > > Hey folks, > > > > I'm wondering what strategies other folks are using for maintaining and > > monitoring the stability of stand-alone spark clusters. > > > > Our master very regularly loses workers, and they (as expected) never > > rejoin the cluster. This is the same behavior I've seen > > using akka cluster (if that's what spark is using in stand-alone mode) -- > > are there configuration options we could be setting > > to make the cluster more robust? > > > > We have a custom script which monitors the number of workers (through the > > web interface) and restarts the cluster when > > necessary, as well as resolving other issues we face (like spark shells > > left open permanently claiming resources), and it > > works, but it's no where close to a great solution. > > > > What are other folks doing? Is this something that other folks observe as > > well? I suspect that the loss of workers is tied to > > jobs that run out of memory on the client side or our use of very large > > broadcast variables, but I don't have an isolated test case. > > I'm open to general answers here: for example, perhaps we should simply be > > using mesos or yarn instead of stand-alone mode. > > > > --j > > > > >