Re: Mesos/Spark Deadlock

Timothy Chen Mon, 25 Aug 2014 14:28:58 -0700

Hi Matei,

I'm going to investigate from both Mesos and Spark side will hopefully
have a good long term solution. In the mean time having a work around
to start with is going to unblock folks.


Tim

On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Anyway it would be good if someone from the Mesos side investigates this and
> proposes a solution. The 32 MB per task hack isn't completely foolproof
> either (e.g. people might allocate all the RAM to their executor and thus
> stop being able to launch tasks), so maybe we wait on a Mesos fix for this
> one.
>
> Matei
>
> On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com)
> wrote:
>
> This is kind of weird then, seems perhaps unrelated to this issue (or at
> least to the way I understood it). Is the problem maybe that Mesos saw 0 MB
> being freed and didn't re-offer the machine *even though there was more than
> 32 MB free overall*?
>
> Matei
>
> On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org)
> wrote:
>
> I definitely saw a case where
>
> a. the only job running was a 256m shell
> b. I started a 2g job
> c. a little while later the same user as in a started another 256m shell
>
> My job immediately stopped making progress.  Once user a killed his shells,
> it started again.
>
> This is on nodes with ~15G of memory, on which we have successfully run 8G
> jobs.
>
>
> On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>>
>> BTW it seems to me that even without that patch, you should be getting
>> tasks launched as long as you leave at least 32 MB of memory free on each
>> machine (that is, the sum of the executor memory sizes is not exactly the
>> same as the total size of the machine). Then Mesos will be able to re-offer
>> that machine whenever CPUs free up.
>>
>> Matei
>>
>> On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com)
>> wrote:
>>
>> We have not tried the work-around because there are other bugs in there
>> that affected our set-up, though it seems it would help.
>>
>>
>> On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen <tnac...@gmail.com> wrote:
>>
>> > +1 to have the work around in.
>> >
>> > I'll be investigating from the Mesos side too.
>> >
>> > Tim
>> >
>> > On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> > wrote:
>> > > Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's
>> > > too
>> > bad that this happens in fine-grained mode -- would be really good to
>> > fix.
>> > I'll see if we can get the workaround in
>> > https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally
>> > have you tried that?
>> > >
>> > > Matei
>> > >
>> > > On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com)
>> > wrote:
>> > >
>> > > Hi Matei,
>> > >
>> > > We have an analytics team that uses the cluster on a daily basis. They
>> > use two types of 'run modes':
>> > >
>> > > 1) For running actual queries, they set the spark.executor.memory to
>> > something between 4 and 8GB of RAM/worker.
>> > >
>> > > 2) A shell that takes a minimal amount of memory on workers (128MB)
>> > > for
>> > prototyping out a larger query. This allows them to not take up RAM on
>> > the
>> > cluster when they do not really need it.
>> > >
>> > > We see the deadlocks when there are a few shells in either case. From
>> > the usage patterns we have, coarse-grained mode would be a challenge as
>> > we
>> > have to constantly remind people to kill their shells as soon as their
>> > queries finish.
>> > >
>> > > Am I correct in viewing Mesos in coarse-grained mode as being similar
>> > > to
>> > Spark Standalone's cpu allocation behavior?
>> > >
>> > >
>> > >
>> > >
>> > > On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia
>> > > <matei.zaha...@gmail.com>
>> > wrote:
>> > > Hey Gary, just as a workaround, note that you can use Mesos in
>> > coarse-grained mode by setting spark.mesos.coarse=true. Then it will
>> > hold
>> > onto CPUs for the duration of the job.
>> > >
>> > > Matei
>> > >
>> > > On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com)
>> > wrote:
>> > >
>> > > I just wanted to bring up a significant Mesos/Spark issue that makes
>> > > the
>> > > combo difficult to use for teams larger than 4-5 people. It's covered
>> > > in
>> > > https://issues.apache.org/jira/browse/MESOS-1688. My understanding is
>> > that
>> > > Spark's use of executors in fine-grained mode is a very different
>> > behavior
>> > > than many of the other common frameworks for Mesos.
>> > >
>> >
>
>

Re: Mesos/Spark Deadlock

Reply via email to