Changing subject... On May 10, 2012, at 3:40 PM, Jeffrey Buell wrote:
> I have the right #slots to fill up memory across the cluster, and all those > slots are filled with tasks. The problem I ran into was that the maps grabbed > all the slots initially and the reduces had a hard time getting started. As > maps finished, more maps were started and only rarely was a reduce started. > I assume this behavior occurred because I had ~4000 map tasks in the queue, > but only ~100 reduce tasks. If the scheduler lumps maps and reduces > together, then whenever a slot opens up it will almost surely be taken by a > map task. To get good performance I need all reduce tasks started early on, > and have only map tasks compete for open slots. Other apps may need > different priorities between maps and reduces. In any case, I don’t > understand how treating maps and reduces the same is workable. > Are you playing with YARN or MR1? IAC, you are getting hit by 'slowstart' for reduces where-in reduces aren't scheduled till sufficient % of maps are completed. Set mapred.reduce.slowstart.completed.maps to 0. (That should work for either MR1 or MR2). Arun > Jeff > > From: Arun C Murthy [mailto:a...@hortonworks.com] > Sent: Thursday, May 10, 2012 1:27 PM > To: mapreduce-user@hadoop.apache.org > Subject: Re: max 1 mapper per node > > For terasort you want to fill up your entire cluster with maps/reduces as > fast as you can to get the best performance. > > Just play with #slots. > > Arun > > On May 9, 2012, at 12:36 PM, Jeffrey Buell wrote: > > > Not to speak for Radim, but what I’m trying to achieve is performance at > least as good as 0.20 for all cases. That is, no regressions. For something > as simple as terasort, I don’t think that is possible without being able to > specify the max number of mappers/reducers per node. As it is, I see > slowdowns as much as 2X. Hopefully I’m wrong and somebody will straighten me > out. But if I’m not, adding such a feature won’t lead to bad behavior of any > kind since the default could be set to unlimited and thus have no effect > whatsoever. > > I should emphasize that I support the goal of greater automation since Hadoop > has way too many parameters and is so hard to tune. Just not at the expense > of performance regressions. > > Jeff > > > We've been against these 'features' since it leads to very bad behaviour > across the cluster with multiple apps/users etc. > > What is your use-case i.e. what are you trying to achieve with this? > > thanks, > Arun > > On May 3, 2012, at 5:59 AM, Radim Kolar wrote: > > > > if plugin system for AM is overkill, something simpler can be made like: > > maximum number of mappers per node > maximum number of reducers per node > > maximum percentage of non data local tasks > maximum percentage of rack local tasks > > and set this in job properties. > > > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/