Correction to my previous post: I completely missed https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the MR config ends already in 2.0.3. My bad :)
On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote: > You can leverage YARN's CPU Core scheduling feature for this purpose. > It was added to the 2.0.3 release via > https://issues.apache.org/jira/browse/YARN-2 and seems to fit your > need exactly. However, looking at that patch, it seems like > param-config support for MR apps wasn't added by this so it may > require some work before you can easily leverage it in MRv2. > > On MRv1, you can achieve the per-node memory supply vs. requirement > hack Rahul suggested by using the CapacityScheduler instead. It does > not have CPU Core based scheduling directly though. > > On Wed, Mar 20, 2013 at 4:08 AM, jeremy p > <athomewithagroove...@gmail.com> wrote: >> The job we need to run executes some third-party code that utilizes multiple >> cores. The only way the job will get done in a timely fashion is if we give >> it all the cores available on the machine. This is not a task that can be >> split up. >> >> Yes, I know, it's not ideal, but this is the situation I have to deal with. >> >> >> On Tue, Mar 19, 2013 at 3:15 PM, hari <harib...@gmail.com> wrote: >>> >>> This may not be what you were looking for, but I was just curious when you >>> mentioned that >>> you would only want to run only one map task because it was cpu >>> intensive. Well, the map >>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots >>> are 10 then that >>> would mean you have close to 10 cores available in each node. So, if you >>> run only one >>> map task, no matter how much cpu intensive it is, it will only be able to >>> max out one core, so the >>> rest of the 9 cores would go under utilized. So, you can still run 9 more >>> map tasks on that machine. >>> >>> Or, maybe your node's core count is way less than 10, in which case you >>> might be better off setting >>> the mapper slots to a lower value anyway. >>> >>> >>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <athomewithagroove...@gmail.com> >>> wrote: >>>> >>>> Thank you for your help. >>>> >>>> We're using MRv1. I've tried setting >>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one >>>> helped me at all. >>>> >>>> Per-job control is definitely what I need. I need to be able to say, >>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per >>>> node". I have not found any way to do this. >>>> >>>> I will definitely look into schedulers. Are there any examples you can >>>> point me to where someone does what I'm needing to do? >>>> >>>> --Jeremy >>>> >>>> >>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rja...@gmail.com> wrote: >>>>> >>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? >>>>> >>>>> For MRv2 (yarn): you can pretty much achieve this using: >>>>> >>>>> yarn.nodemanager.resource.memory-mb (system wide setting) >>>>> and >>>>> mapreduce.map.memory.mb (job level setting) >>>>> >>>>> e.g. if yarn.nodemanager.resource.memory-mb=100 >>>>> and mapreduce.map.memory.mb= 40 >>>>> a maximum of two mapper can run on a node at any time. >>>>> >>>>> For MRv1, The equivalent way will be to control mapper slots on each >>>>> machine: >>>>> mapred.tasktracker.map.tasks.maximum, of course this does not give you >>>>> 'per job' control. on mappers. >>>>> >>>>> In addition in both cases, you can use a scheduler with 'pools / queues' >>>>> capability in addition to restrict the overall use of grid resource. Do >>>>> read >>>>> fair scheduler and capacity scheduler documentation... >>>>> >>>>> >>>>> -Rahul >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p >>>>> <athomewithagroove...@gmail.com> wrote: >>>>>> >>>>>> Short version : let's say you have 20 nodes, and each node has 10 >>>>>> mapper slots. You start a job with 20 very small input files. How is >>>>>> the >>>>>> work distributed to the cluster? Will it be even, with each node >>>>>> spawning >>>>>> one mapper task? Is there any way of predicting or controlling how the >>>>>> work >>>>>> will be distributed? >>>>>> >>>>>> Long version : My cluster is currently used for two different jobs. >>>>>> The cluster is currently optimized for Job A, so each node has a maximum >>>>>> of >>>>>> 18 mapper slots. However, I also need to run Job B. Job B is VERY >>>>>> cpu-intensive, so we really only want one mapper to run on a node at any >>>>>> given time. I've done a bunch of research, and it doesn't seem like >>>>>> Hadoop >>>>>> gives you any way to set the maximum number of mappers per node on a >>>>>> per-job >>>>>> basis. I'm at my wit's end here, and considering some rather egregious >>>>>> workarounds. If you can think of anything that can help me, I'd very >>>>>> much >>>>>> appreciate it. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> --Jeremy >>>>> >>>>> >>>> >>> >> > > > > -- > Harsh J -- Harsh J