Are you guys both using Cloudera Manager? Maybe there’s also an issue with the 
integration with that.

Matei

On May 20, 2014, at 11:44 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> I'd just like to point out that, along with Matei, I have not seen workers 
> drop even under the most exotic job failures. We're running pretty close to 
> master, though; perhaps it is related to an uncaught exception in the Worker 
> from a prior version of Spark.
> 
> 
> On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuj...@gmail.com> wrote:
> Hi Matei,
> 
> Unfortunately, I don't have more detailed information, but we have seen the 
> loss of workers in standalone mode as well.  If a job is killed through 
> CTRL-C we will often see in the Spark Master page the number of workers and 
> cores decrease.  They are still alive and well in the Cloudera Manager page, 
> but not visible on the Spark master, simply restarting the workers usually 
> resolves this, but we often seen workers disappear after a failed or killed 
> job.
> 
> If we see this occur again, I'll try and provide some logs.
> 
> 
> 
> 
> On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaha...@gmail.com> 
> wrote:
> Which version is this with? I haven’t seen standalone masters lose workers. 
> Is there other stuff on the machines that’s killing them, or what errors do 
> you see?
> 
> Matei
> 
> On May 16, 2014, at 9:53 AM, Josh Marcus <jmar...@meetup.com> wrote:
> 
> > Hey folks,
> >
> > I'm wondering what strategies other folks are using for maintaining and 
> > monitoring the stability of stand-alone spark clusters.
> >
> > Our master very regularly loses workers, and they (as expected) never 
> > rejoin the cluster.  This is the same behavior I've seen
> > using akka cluster (if that's what spark is using in stand-alone mode) -- 
> > are there configuration options we could be setting
> > to make the cluster more robust?
> >
> > We have a custom script which monitors the number of workers (through the 
> > web interface) and restarts the cluster when
> > necessary, as well as resolving other issues we face (like spark shells 
> > left open permanently claiming resources), and it
> > works, but it's no where close to a great solution.
> >
> > What are other folks doing?  Is this something that other folks observe as 
> > well?  I suspect that the loss of workers is tied to
> > jobs that run out of memory on the client side or our use of very large 
> > broadcast variables, but I don't have an isolated test case.
> > I'm open to general answers here: for example, perhaps we should simply be 
> > using mesos or yarn instead of stand-alone mode.
> >
> > --j
> >
> 
> 
> 

Reply via email to