Not an expert on capacity scheduler but the above two are not queue-level
configurations, so I think the changes would not reflect on running
refreshqueues. You would need to restart the RM for the new values to take


On Thu, Jan 10, 2019 at 7:41 PM Or Raz <> wrote:

> I have googled more about it, and it seems like two parameters should
> define the "bin packing problem".
> According to
>   yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled is
> by default set to true and with parameter
> yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments r
> set to -1 it can assign all the containers the Node manager "said" it is
> capable of (which could somehow explain the bin packing problem for the
> first Nodemanager who answer with a Heartbeat message).
> Following Apache's instructions, I have inserted to my
> *capacity-scheduler.xml*  in hadoop/etc/hadoop folder
>   <property>
> <name>yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled</name>
>     <value>true</value>
>     <description>
>         Whether to allow multiple container assignments in one NodeManager
> heartbeat. Defaults to true.
>     </description>
>   </property>
>   <property>
> <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments</name>
>     <value>2</value>
>     <description>
>         If multiple-assignments-enabled is true, the maximum amount of
> containers that can be assigned in one NodeManager heartbeat. Defaults to
> -1, which sets no limit.
>     </description>
>   </property>
> I have checked the configuration file, and I am using the capacity
> scheduler (I have enabled
> yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled again
> just to be sure).
> Furthermore, after I have run "yarn rmadmin -refreshQueues" I haven't seen
> any change in the Mappers allocation nor Reducers.
> hadoop2@master:~$ yarn rmadmin -refreshQueues
> 19/01/10 16:06:33 INFO client.RMProxy: Connecting to ResourceManager at
> master/
> What am I missing over here?
> Or
> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:57 מאת ‪Or Raz‬‏ <‪
> ‬‏>:‬
>> Thanks for the tips!
>> Because I haven't set any scheduler (on purpose) for YARN then, I am
>> using the default one (Capacity).
>> I have looked in yarn-site.xml and in the configuration tab (using
>> JobHistory UI), and both of the parameters that you have mentioned weren't
>> there (so they haven't been set).
>> You said that I should look at "locality settings" can you be more
>> specific on what and where to look?
>> Also, it is worth mentioning that I am using three computers and the
>> replication factor (of HDFS) is three too. Thus, every data (even input)
>> would be on every computer, and the memory of each computer is the same
>> (two t2.xlarge and one m4.xlarge) while I am
>> using DefaultResourceCalculator.
>> Or
>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:28 מאת ‪Aaron Eng‬‏ <‪
>> ‬‏>:‬
>>> The settings are very relevant to having an equal number of containers
>>> running on each node if you have an idle cluster and want to distribute
>>> containers for a single job.  An application master submits requests for
>>> container allocations to the ResourceManager.  The MRAppMaster will request
>>> all the map containers at once, the FairScheduler will find NodeManagers
>>> with capacity to fulfill the container requests.  If assign multiple is
>>> enabled then you generally won't get an even number of containers assigned
>>> to each node +/- 1 container.  Before you say it's not relevant, you should
>>> check if your environment uses the FairScheduler and whether multiple
>>> assignment is enabled.  If so, that's likely why there isn't an even
>>> assignment +/- 1 container.  If not using FairScheduler and/or multiple
>>> assign, then you should look at locality settings, which can cause
>>> containers to be preferentially run on a subset of nodes, resulting in an
>>> uneven container assignment per node.
>>> On Wed, Jan 9, 2019 at 2:19 PM Or Raz <> wrote:
>>>> As far as I know, the scheduler in YARN is only scheduling the jobs and
>>>> not the containers inside each job. Therefore, I don't believe it is
>>>> relevant.
>>>> Also, I haven't used or set those two parameters, and I haven't picked
>>>> nor set any particular schedule for my research (Fair, FIFO or Capacity).
>>>> Please correct if I am wrong.
>>>> P.S. currently I have no interest in a situation when I run a few jobs
>>>> concurrently, my case is much simpler with one job that I would like that
>>>> allocation of containers will be more balanced...
>>>> Or
>>>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-19:11 מאת ‪Aaron Eng‬‏ <‪
>>>> ‬‏>:‬
>>>>> Have you checked the yarn.scheduler.fair.assignmultiple
>>>>> and yarn.scheduler.fair.max.assign parameters for the ResourceManager
>>>>> configuration?
>>>>> On Wed, Jan 9, 2019 at 9:49 AM Or Raz <> wrote:
>>>>>> How can I change/suggest a different allocation of containers to
>>>>>> tasks in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
>>>>>> I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not
>>>>>> EMR) and I want the scheduling/allocating of the containers
>>>>>> (Mappers/Reducers) would be more balanced than it is currently. It seems
>>>>>> like RM is assigning the Mappers in a Bin Packing way (where the data
>>>>>> resides) and for the reducers, it looks more balanced. My setup includes
>>>>>> three Machines with replication rate three (all the data is on every
>>>>>> machine), and I run my jobs with
>>>>>> mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle as fast 
>>>>>> as
>>>>>> possible (It is vital for me that all the containers are working in
>>>>>> concurrency, it is a must condition). Also, according to the EC2 
>>>>>> instances
>>>>>> I have chosen and my settings of the YARN cluster, I can run at most 93
>>>>>> containers (31 each).
>>>>>> For example, if I want to have nine reducers then (93-9-1=83), 83
>>>>>> containers could be left for the mappers, and one is for the AM. I have
>>>>>> played with the size of split input
>>>>>> (mapreduce.input.fileinputformat.split.minsize,
>>>>>> mapreduce.input.fileinputformat.split.maxsize) to find the right balance
>>>>>> where all of the machines have the same "work" for the map phase. But it
>>>>>> seems like the first 31 mappers would be allocated in one computer, the
>>>>>> next 31 to the second one and the last 31 in the last machine. Thus, I 
>>>>>> can
>>>>>> try to use 87 mappers where 31 of them in Machine #1, another 31 in 
>>>>>> Machine
>>>>>> #2 and another 25 in Machine #3 and the rest is left for the reducers and
>>>>>> as Machine #1 and Machine #2 are fully occupied then the reducers would
>>>>>> have to be placed in Machine #3. This way I get an almost balanced
>>>>>> allocation of mappers at the expense of unbalanced reducers allocation. 
>>>>>> And
>>>>>> this is not what I want...
>>>>>> # of mappers = size_input / split size [Bytes]
>>>>>> split size
>>>>>> =max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize,
>>>>>> dfs.blocksize))

Reply via email to