Yup, if you turn off YARN's CPU scheduling then you can run executors to take advantage of the extra memory on the larger boxes. But then some of the nodes will end up severely oversubscribed from a CPU perspective, so I would definitely recommend against that.
On Fri, Jan 30, 2015 at 3:31 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Sorry, but I think there’s a disconnect. > > When you launch a job under YARN on any of the hadoop clusters, the number > of mappers/reducers is not set and is dependent on the amount of available > resources. > So under Ambari, CM, or MapR’s Admin, you should be able to specify the > amount of resources available on any node which is to be allocated to > YARN’s RM. > So if your node has 32GB allocated, you can run N jobs concurrently based > on the amount of resources you request when you submit your application. > > If you have 64GB allocated, you can run up to 2N jobs concurrently based > on the same memory constraints. > > In terms of job scheduling, where and when a job can run is going to be > based on available resources. So if you want to run a job that needs 16GB > of resources, and all of your nodes are busy and only have 4GB per node > available to YARN, your 16GB job will wait until there is at least that > much resources available. > > To your point, if you say you need 4GB per task, then it must be the same > per task for that job. The larger the cluster node, in this case memory, > the more jobs you can run. > > This is of course assuming you could over subscribe a node in terms of cpu > cores if you have memory available. > > YMMV > > HTH > -Mike > > On Jan 30, 2015, at 7:10 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > > My answer was based off the specs that Antony mentioned: different amounts > of memory, but 10 cores on all the boxes. In that case, a single Spark > application's homogeneously sized executors won't be able to take advantage > of the extra memory on the bigger boxes. > > Cloudera Manager can certainly configure YARN with different resource > profiles for different nodes if that's what you're wondering. > > -Sandy > > On Thu, Jan 29, 2015 at 11:03 PM, Michael Segel <msegel_had...@hotmail.com > > wrote: > >> @Sandy, >> >> There are two issues. >> The spark context (executor) and then the cluster under YARN. >> >> If you have a box where each yarn job needs 3GB, and your machine has >> 36GB dedicated as a YARN resource, you can run 12 executors on the single >> node. >> If you have a box that has 72GB dedicated to YARN, you can run up to 24 >> contexts (executors) in parallel. >> >> Assuming that you’re not running any other jobs. >> >> The larger issue is if your version of Hadoop will easily let you run >> with multiple profiles or not. Ambari (1.6 and early does not.) Its >> supposed to be fixed in 1.7 but I haven’t evaluated it yet. >> Cloudera? YMMV >> >> If I understood the question raised by the OP, its more about a >> heterogeneous cluster than spark. >> >> -Mike >> >> On Jan 26, 2015, at 5:02 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: >> >> Hi Antony, >> >> Unfortunately, all executors for any single Spark application must have >> the same amount of memory. It's possibly to configure YARN with different >> amounts of memory for each host (using >> yarn.nodemanager.resource.memory-mb), so other apps might be able to take >> advantage of the extra memory. >> >> -Sandy >> >> On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel <msegel_had...@hotmail.com >> > wrote: >> >>> If you’re running YARN, then you should be able to mix and max where >>> YARN is managing the resources available on the node. >>> >>> Having said that… it depends on which version of Hadoop/YARN. >>> >>> If you’re running Hortonworks and Ambari, then setting up multiple >>> profiles may not be straight forward. (I haven’t seen the latest version of >>> Ambari) >>> >>> So in theory, one profile would be for your smaller 36GB of ram, then >>> one profile for your 128GB sized machines. >>> Then as your request resources for your spark job, it should schedule >>> the jobs based on the cluster’s available resources. >>> (At least in theory. I haven’t tried this so YMMV) >>> >>> HTH >>> >>> -Mike >>> >>> On Jan 26, 2015, at 4:25 PM, Antony Mayi <antonym...@yahoo.com.INVALID> >>> wrote: >>> >>> should have said I am running as yarn-client. all I can see is >>> specifying the generic executor memory that is then to be used in all >>> containers. >>> >>> >>> On Monday, 26 January 2015, 16:48, Charles Feduke < >>> charles.fed...@gmail.com> wrote: >>> >>> >>> >>> You should look at using Mesos. This should abstract away the individual >>> hosts into a pool of resources and make the different physical >>> specifications manageable. >>> >>> I haven't tried configuring Spark Standalone mode to have different >>> specs on different machines but based on spark-env.sh.template: >>> >>> # - SPARK_WORKER_CORES, to set the number of cores to use on this machine >>> # - SPARK_WORKER_MEMORY, to set how much total memory workers have to >>> give executors (e.g. 1000m, 2g) >>> # - SPARK_WORKER_OPTS, to set config properties only for the worker >>> (e.g. "-Dx=y") >>> it looks like you should be able to mix. (Its not clear to me whether >>> SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where >>> the config file resides.) >>> >>> On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi < >>> antonym...@yahoo.com.invalid> wrote: >>> >>> Hi, >>> >>> is it possible to mix hosts with (significantly) different specs within >>> a cluster (without wasting the extra resources)? for example having 10 >>> nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is >>> there a way to utilize the extra memory by spark executors (as my >>> understanding is all spark executors must have same memory). >>> >>> thanks, >>> Antony. >>> >>> >>> >>> >>> >> >> > >