Re: What happens when you have fewer input files than mapper slots?

Harsh J Tue, 19 Mar 2013 17:06:45 -0700

Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)


On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was added to the 2.0.3 release via
> https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
> need exactly. However, looking at that patch, it seems like
> param-config support for MR apps wasn't added by this so it may
> require some work before you can easily leverage it in MRv2.
>
> On MRv1, you can achieve the per-node memory supply vs. requirement
> hack Rahul suggested by using the CapacityScheduler instead. It does
> not have CPU Core based scheduling directly though.
>
> On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
> <athomewithagroove...@gmail.com> wrote:
>> The job we need to run executes some third-party code that utilizes multiple
>> cores.  The only way the job will get done in a timely fashion is if we give
>> it all the cores available on the machine.  This is not a task that can be
>> split up.
>>
>> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>>
>>
>> On Tue, Mar 19, 2013 at 3:15 PM, hari <harib...@gmail.com> wrote:
>>>
>>> This may not be what you were looking for, but I was just curious when you
>>> mentioned that
>>>  you would only want to run only one map task because it was cpu
>>> intensive. Well, the map
>>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>>> are 10 then that
>>> would mean you have close to 10 cores available in each node. So, if you
>>> run only one
>>> map task, no matter how much cpu intensive it is, it will only be able to
>>> max out one core, so the
>>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>>> map tasks on that machine.
>>>
>>> Or, maybe your node's core count is way less than 10, in which case you
>>> might be better off setting
>>> the mapper slots to a lower value anyway.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <athomewithagroove...@gmail.com>
>>> wrote:
>>>>
>>>> Thank you for your help.
>>>>
>>>> We're using MRv1.  I've tried setting
>>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>>> helped me at all.
>>>>
>>>> Per-job control is definitely what I need.  I need to be able to say,
>>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>>> node".  I have not found any way to do this.
>>>>
>>>> I will definitely look into schedulers.  Are there any examples you can
>>>> point me to where someone does what I'm needing to do?
>>>>
>>>> --Jeremy
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rja...@gmail.com> wrote:
>>>>>
>>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>>
>>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>>> and
>>>>> mapreduce.map.memory.mb  (job level setting)
>>>>>
>>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>>> and mapreduce.map.memory.mb= 40
>>>>> a maximum of two mapper can run on a node at any time.
>>>>>
>>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>>> machine:
>>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>>> 'per job' control. on mappers.
>>>>>
>>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>>> capability in addition to restrict the overall use of grid resource. Do 
>>>>> read
>>>>> fair scheduler and capacity scheduler documentation...
>>>>>
>>>>>
>>>>> -Rahul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>>> <athomewithagroove...@gmail.com> wrote:
>>>>>>
>>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>>> mapper slots.  You start a job with 20 very small input files.  How is 
>>>>>> the
>>>>>> work distributed to the cluster?  Will it be even, with each node 
>>>>>> spawning
>>>>>> one mapper task?  Is there any way of predicting or controlling how the 
>>>>>> work
>>>>>> will be distributed?
>>>>>>
>>>>>> Long version : My cluster is currently used for two different jobs.
>>>>>> The cluster is currently optimized for Job A, so each node has a maximum 
>>>>>> of
>>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>>> given time.  I've done a bunch of research, and it doesn't seem like 
>>>>>> Hadoop
>>>>>> gives you any way to set the maximum number of mappers per node on a 
>>>>>> per-job
>>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>>> workarounds.  If you can think of anything that can help me, I'd very 
>>>>>> much
>>>>>> appreciate it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --Jeremy
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Reply via email to