Terasort

Arun C Murthy Thu, 10 May 2012 15:50:10 -0700

Changing subject...

On May 10, 2012, at 3:40 PM, Jeffrey Buell wrote:


> I have the right #slots to fill up memory across the cluster, and all those 
> slots are filled with tasks. The problem I ran into was that the maps grabbed 
> all the slots initially and the reduces had a hard time getting started.  As 
> maps finished, more maps were started and only rarely was a reduce started.  
> I assume this behavior occurred because I had ~4000 map tasks in the queue, 
> but only ~100 reduce tasks.  If the scheduler lumps maps and reduces 
> together, then whenever a slot opens up it will almost surely be taken by a 
> map task.  To get good performance I need all reduce tasks started early on, 
> and have only map tasks compete for open slots.  Other apps may need 
> different priorities between maps and reduces.  In any case, I don’t 
> understand how treating maps and reduces the same is workable.
>  

Are you playing with YARN or MR1?

IAC, you are getting hit by 'slowstart' for reduces where-in reduces aren't 
scheduled till sufficient % of maps are completed.

Set mapred.reduce.slowstart.completed.maps to 0. (That should work for either 
MR1 or MR2).

Arun

> Jeff
>  
> From: Arun C Murthy [mailto:a...@hortonworks.com] 
> Sent: Thursday, May 10, 2012 1:27 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: max 1 mapper per node
>  
> For terasort you want to fill up your entire cluster with maps/reduces as 
> fast as you can to get the best performance.
>  
> Just play with #slots.
>  
> Arun
>  
> On May 9, 2012, at 12:36 PM, Jeffrey Buell wrote:
> 
> 
> Not to speak for Radim, but what I’m trying to achieve is performance at 
> least as good as 0.20 for all cases.  That is, no regressions.  For something 
> as simple as terasort, I don’t think that is possible without being able to 
> specify the max number of mappers/reducers per node.  As it is, I see 
> slowdowns as much as 2X.  Hopefully I’m wrong and somebody will straighten me 
> out.  But if I’m not, adding such a feature won’t lead to bad behavior of any 
> kind since the default could be set to unlimited and thus have no effect 
> whatsoever.
>  
> I should emphasize that I support the goal of greater automation since Hadoop 
> has way too many parameters and is so hard to tune.  Just not at the expense 
> of performance regressions. 
>  
> Jeff
>  
>  
> We've been against these 'features' since it leads to very bad behaviour 
> across the cluster with multiple apps/users etc.
>  
> What is your use-case i.e. what are you trying to achieve with this?
>  
> thanks,
> Arun
>  
> On May 3, 2012, at 5:59 AM, Radim Kolar wrote:
> 
> 
> 
> if plugin system for AM is overkill, something simpler can be made like:
> 
> maximum number of mappers per node
> maximum number of reducers per node
> 
> maximum percentage of non data local tasks
> maximum percentage of rack local tasks
> 
> and set this in job properties.
>  
>  
>  
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
>  

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Terasort

Reply via email to