Hi Josh, I see, I haven't heard folks using larger JVM heap size than you mentioned (30gb), but in your scenario what you're proposing does make sense.
I've created SPARK-5095 and we can continue our discussion about how to address this. Tim On Mon, Jan 5, 2015 at 1:22 AM, Josh Devins <j...@soundcloud.com> wrote: > Hey Tim, sorry for the delayed reply, been on vacation for a couple weeks. > > The reason we want to control the number of executors is that running > executors with JVM heaps over 30GB causes significant garbage > collection problems. We have observed this through much > trial-and-error for jobs that are a dozen-or-so stages, running for > more than ~20m. For example, if we run 8 executors with 60GB heap each > (for example, we have also other values larger than 30GB), even after > much tuning of heap parameters (% for RDD cache, etc.) we run into GC > problems. Effectively GC becomes so high that it takes over all > compute time from the JVM. If we then halve the heap (30GB) but double > the number of executors (16), all GC problems are relieved and we get > to use the full memory resources of the cluster. We talked with some > engineers from Databricks at Strata in Barcelona recently and received > the same advice — do not run executors with more than 30GB heaps. > Since our machines are 64GB machines and we are typically only running > one or two jobs at a time on the cluster (for now), we can only use > half the cluster memory with the current configuration options > available in Mesos. > > Happy to hear your thoughts and actually very curious about how others > are running Spark on Mesos with large heaps (as a result of large > memory machines). Perhaps this is a non-issue when we have more > multi-tenancy in the cluster, but for now, this is not the case. > > Thanks, > > Josh > > > On 24 December 2014 at 06:22, Tim Chen <t...@mesosphere.io> wrote: > > > > Hi Josh, > > > > If you want to cap the amount of memory per executor in Coarse grain > mode, then yes you only get 240GB of memory as you mentioned. What's the > reason you don't want to raise the capacity of memory you use per executor? > > > > In coarse grain mode the Spark executor is long living and it internally > will get tasks distributed by Spark internal Coarse grained scheduler. I > think the assumption is that it already allocated the maximum available on > that slave and don't really assume we need another one. > > > > I think it's worth considering having a configuration of number of cores > per executor, especially when Mesos have inverse offers and optimistic > offers so we can choose to launch more executors when resources becomes > available even in coarse grain mode and then support giving the executors > back but more higher priority tasks arrive. > > > > For fine grain mode, the spark executors are started by Mesos executors > that is configured from Mesos scheduler backend. I believe the RDD is > cached as long as the Mesos executor is running as the BlockManager is > created on executor registration. > > > > Let me know if you need any more info. > > > > Tim > > > > > >> > >> ---------- Forwarded message ---------- > >> From: Josh Devins <j...@soundcloud.com> > >> Date: 22 December 2014 at 17:23 > >> Subject: Mesos resource allocation > >> To: user@spark.apache.org > >> > >> > >> We are experimenting with running Spark on Mesos after running > >> successfully in Standalone mode for a few months. With the Standalone > >> resource manager (as well as YARN), you have the option to define the > >> number of cores, number of executors and memory per executor. In > >> Mesos, however, it appears as though you cannot specify the number of > >> executors, even in coarse-grained mode. If this is the case, how do > >> you define the number of executors to run with? > >> > >> Here's an example of why this matters (to us). Let's say we have the > >> following cluster: > >> > >> num nodes: 8 > >> num cores: 256 (32 per node) > >> total memory: 512GB (64GB per node) > >> > >> If I set my job to require 256 cores and per-executor-memory to 30GB, > >> then Mesos will schedule a single executor per machine (8 executors > >> total) and each executor will get 32 cores to work with. This means > >> that we have 8 executors * 32GB each for a total of 240G of cluster > >> memory in use — less than half of what is available. If you want > >> actually 16 executors in order to increase the amount of memory in use > >> across the cluster, how can you do this with Mesos? It seems that a > >> parameter is missing (or I haven't found it yet) which lets me tune > >> this for Mesos: > >> * number of executors per n-cores OR > >> * number of executors total > >> > >> Furthermore, in fine-grained mode in Mesos, how are the executors > >> started/allocated? That is, since Spark tasks map to Mesos tasks, when > >> and how are executors started? If they are transient and an executor > >> per task is created, does this mean we cannot have cached RDDs? > >> > >> Thanks for any advice or pointers, > >> > >> Josh > > > > > > >