I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell
My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > BTW it seems to me that even without that patch, you should be getting > tasks launched as long as you leave at least 32 MB of memory free on each > machine (that is, the sum of the executor memory sizes is not exactly the > same as the total size of the machine). Then Mesos will be able to re-offer > that machine whenever CPUs free up. > > Matei > > On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) > wrote: > > We have not tried the work-around because there are other bugs in there > that affected our set-up, though it seems it would help. > > > On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen <tnac...@gmail.com> wrote: > > > +1 to have the work around in. > > > > I'll be investigating from the Mesos side too. > > > > Tim > > > > On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia <matei.zaha...@gmail.com> > > wrote: > > > Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's > too > > bad that this happens in fine-grained mode -- would be really good to > fix. > > I'll see if we can get the workaround in > > https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally > > have you tried that? > > > > > > Matei > > > > > > On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) > > wrote: > > > > > > Hi Matei, > > > > > > We have an analytics team that uses the cluster on a daily basis. They > > use two types of 'run modes': > > > > > > 1) For running actual queries, they set the spark.executor.memory to > > something between 4 and 8GB of RAM/worker. > > > > > > 2) A shell that takes a minimal amount of memory on workers (128MB) for > > prototyping out a larger query. This allows them to not take up RAM on > the > > cluster when they do not really need it. > > > > > > We see the deadlocks when there are a few shells in either case. From > > the usage patterns we have, coarse-grained mode would be a challenge as > we > > have to constantly remind people to kill their shells as soon as their > > queries finish. > > > > > > Am I correct in viewing Mesos in coarse-grained mode as being similar > to > > Spark Standalone's cpu allocation behavior? > > > > > > > > > > > > > > > On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia < > matei.zaha...@gmail.com> > > wrote: > > > Hey Gary, just as a workaround, note that you can use Mesos in > > coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold > > onto CPUs for the duration of the job. > > > > > > Matei > > > > > > On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) > > wrote: > > > > > > I just wanted to bring up a significant Mesos/Spark issue that makes > the > > > combo difficult to use for teams larger than 4-5 people. It's covered > in > > > https://issues.apache.org/jira/browse/MESOS-1688. My understanding is > > that > > > Spark's use of executors in fine-grained mode is a very different > > behavior > > > than many of the other common frameworks for Mesos. > > > > > >