Re: Mesos/Spark Deadlock

Cody Koeninger Mon, 25 Aug 2014 13:01:06 -0700

I definitely saw a case where

a. the only job running was a 256m shell
b. I started a 2g job
c. a little while later the same user as in a started another 256m shell


My job immediately stopped making progress.  Once user a killed his shells,
it started again.

This is on nodes with ~15G of memory, on which we have successfully run 8G
jobs.


On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> BTW it seems to me that even without that patch, you should be getting
> tasks launched as long as you leave at least 32 MB of memory free on each
> machine (that is, the sum of the executor memory sizes is not exactly the
> same as the total size of the machine). Then Mesos will be able to re-offer
> that machine whenever CPUs free up.
>
> Matei
>
> On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com)
> wrote:
>
> We have not tried the work-around because there are other bugs in there
> that affected our set-up, though it seems it would help.
>
>
> On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen <tnac...@gmail.com> wrote:
>
> > +1 to have the work around in.
> >
> > I'll be investigating from the Mesos side too.
> >
> > Tim
> >
> > On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia <matei.zaha...@gmail.com>
> > wrote:
> > > Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's
> too
> > bad that this happens in fine-grained mode -- would be really good to
> fix.
> > I'll see if we can get the workaround in
> > https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally
> > have you tried that?
> > >
> > > Matei
> > >
> > > On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com)
> > wrote:
> > >
> > > Hi Matei,
> > >
> > > We have an analytics team that uses the cluster on a daily basis. They
> > use two types of 'run modes':
> > >
> > > 1) For running actual queries, they set the spark.executor.memory to
> > something between 4 and 8GB of RAM/worker.
> > >
> > > 2) A shell that takes a minimal amount of memory on workers (128MB) for
> > prototyping out a larger query. This allows them to not take up RAM on
> the
> > cluster when they do not really need it.
> > >
> > > We see the deadlocks when there are a few shells in either case. From
> > the usage patterns we have, coarse-grained mode would be a challenge as
> we
> > have to constantly remind people to kill their shells as soon as their
> > queries finish.
> > >
> > > Am I correct in viewing Mesos in coarse-grained mode as being similar
> to
> > Spark Standalone's cpu allocation behavior?
> > >
> > >
> > >
> > >
> > > On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia <
> matei.zaha...@gmail.com>
> > wrote:
> > > Hey Gary, just as a workaround, note that you can use Mesos in
> > coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold
> > onto CPUs for the duration of the job.
> > >
> > > Matei
> > >
> > > On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com)
> > wrote:
> > >
> > > I just wanted to bring up a significant Mesos/Spark issue that makes
> the
> > > combo difficult to use for teams larger than 4-5 people. It's covered
> in
> > > https://issues.apache.org/jira/browse/MESOS-1688. My understanding is
> > that
> > > Spark's use of executors in fine-grained mode is a very different
> > behavior
> > > than many of the other common frameworks for Mesos.
> > >
> >
>

Re: Mesos/Spark Deadlock

Reply via email to