Re: Allocation of containers to tasks in Hadoop

Or Raz Thu, 10 Jan 2019 06:11:34 -0800

I have googled more about it, and it seems like two parameters should
define the "bin packing problem".
According to
https://hadoop.apache.org/docs/r2.9.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Other_Properties
  yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled is
by default set to true and with parameter
yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments r
set to -1 it can assign all the containers the Node manager "said" it is
capable of (which could somehow explain the bin packing problem for the
first Nodemanager who answer with a Heartbeat message).
Following Apache's instructions, I have inserted to my
*capacity-scheduler.xml*  in hadoop/etc/hadoop folder


  <property>

<name>yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled</name>
    <value>true</value>
    <description>
        Whether to allow multiple container assignments in one NodeManager
heartbeat. Defaults to true.
    </description>
  </property>
  <property>

<name>yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments</name>
    <value>2</value>
    <description>
        If multiple-assignments-enabled is true, the maximum amount of
containers that can be assigned in one NodeManager heartbeat. Defaults to
-1, which sets no limit.
    </description>
  </property>
I have checked the configuration file, and I am using the capacity
scheduler (I have enabled
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled again
just to be sure).
Furthermore, after I have run "yarn rmadmin -refreshQueues" I haven't seen
any change in the Mappers allocation nor Reducers.
hadoop2@master:~$ yarn rmadmin -refreshQueues
19/01/10 16:06:33 INFO client.RMProxy: Connecting to ResourceManager at
master/172.31.24.83:8033

What am I missing over here?

Or


‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:57 מאת ‪Or Raz‬‏ <‪r...@post.bgu.ac.il
‬‏>:‬

> Thanks for the tips!
> Because I haven't set any scheduler (on purpose) for YARN then, I am using
> the default one (Capacity).
> I have looked in yarn-site.xml and in the configuration tab (using
> JobHistory UI), and both of the parameters that you have mentioned weren't
> there (so they haven't been set).
> You said that I should look at "locality settings" can you be more
> specific on what and where to look?
> Also, it is worth mentioning that I am using three computers and the
> replication factor (of HDFS) is three too. Thus, every data (even input)
> would be on every computer, and the memory of each computer is the same
> (two t2.xlarge and one m4.xlarge) while I am
> using DefaultResourceCalculator.
>
> Or
>
> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:28 מאת ‪Aaron Eng‬‏ <‪a...@mapr.com‬‏>:‬
>
>> The settings are very relevant to having an equal number of containers
>> running on each node if you have an idle cluster and want to distribute
>> containers for a single job.  An application master submits requests for
>> container allocations to the ResourceManager.  The MRAppMaster will request
>> all the map containers at once, the FairScheduler will find NodeManagers
>> with capacity to fulfill the container requests.  If assign multiple is
>> enabled then you generally won't get an even number of containers assigned
>> to each node +/- 1 container.  Before you say it's not relevant, you should
>> check if your environment uses the FairScheduler and whether multiple
>> assignment is enabled.  If so, that's likely why there isn't an even
>> assignment +/- 1 container.  If not using FairScheduler and/or multiple
>> assign, then you should look at locality settings, which can cause
>> containers to be preferentially run on a subset of nodes, resulting in an
>> uneven container assignment per node.
>>
>> On Wed, Jan 9, 2019 at 2:19 PM Or Raz <r...@post.bgu.ac.il> wrote:
>>
>>> As far as I know, the scheduler in YARN is only scheduling the jobs and
>>> not the containers inside each job. Therefore, I don't believe it is
>>> relevant.
>>> Also, I haven't used or set those two parameters, and I haven't picked
>>> nor set any particular schedule for my research (Fair, FIFO or Capacity).
>>> Please correct if I am wrong.
>>> P.S. currently I have no interest in a situation when I run a few jobs
>>> concurrently, my case is much simpler with one job that I would like that
>>> allocation of containers will be more balanced...
>>> Or
>>>
>>>
>>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-19:11 מאת ‪Aaron Eng‬‏ <‪a...@mapr.com
>>> ‬‏>:‬
>>>
>>>> Have you checked the yarn.scheduler.fair.assignmultiple
>>>> and yarn.scheduler.fair.max.assign parameters for the ResourceManager
>>>> configuration?
>>>>
>>>> On Wed, Jan 9, 2019 at 9:49 AM Or Raz <r...@post.bgu.ac.il> wrote:
>>>>
>>>>> How can I change/suggest a different allocation of containers to tasks
>>>>> in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
>>>>>
>>>>> I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not
>>>>> EMR) and I want the scheduling/allocating of the containers
>>>>> (Mappers/Reducers) would be more balanced than it is currently. It seems
>>>>> like RM is assigning the Mappers in a Bin Packing way (where the data
>>>>> resides) and for the reducers, it looks more balanced. My setup includes
>>>>> three Machines with replication rate three (all the data is on every
>>>>> machine), and I run my jobs with
>>>>> mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle as fast as
>>>>> possible (It is vital for me that all the containers are working in
>>>>> concurrency, it is a must condition). Also, according to the EC2 instances
>>>>> I have chosen and my settings of the YARN cluster, I can run at most 93
>>>>> containers (31 each).
>>>>>
>>>>> For example, if I want to have nine reducers then (93-9-1=83), 83
>>>>> containers could be left for the mappers, and one is for the AM. I have
>>>>> played with the size of split input
>>>>> (mapreduce.input.fileinputformat.split.minsize,
>>>>> mapreduce.input.fileinputformat.split.maxsize) to find the right balance
>>>>> where all of the machines have the same "work" for the map phase. But it
>>>>> seems like the first 31 mappers would be allocated in one computer, the
>>>>> next 31 to the second one and the last 31 in the last machine. Thus, I can
>>>>> try to use 87 mappers where 31 of them in Machine #1, another 31 in 
>>>>> Machine
>>>>> #2 and another 25 in Machine #3 and the rest is left for the reducers and
>>>>> as Machine #1 and Machine #2 are fully occupied then the reducers would
>>>>> have to be placed in Machine #3. This way I get an almost balanced
>>>>> allocation of mappers at the expense of unbalanced reducers allocation. 
>>>>> And
>>>>> this is not what I want...
>>>>>
>>>>> # of mappers = size_input / split size [Bytes]
>>>>>
>>>>> split size
>>>>> =max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize,
>>>>> dfs.blocksize))
>>>>>
>>>>

Re: Allocation of containers to tasks in Hadoop

Reply via email to