RE: Questions regarding configuration parameters...

2008-02-22 Thread C G
Guys:
   
  Thanks for the information...I've gotten some pretty good results twiddling 
some parameters.  I've also reminded myself about the pitfalls of 
oversubscribing resources (like number of reducers).  Here's what I learned, 
written up here to hopefully help somebody later...
   
  I set up one of my apps on a 4-node test grid.  Each grid member is a 4-way 
box.  The configuration had default values (2) for 
mapred.tasktracker.(map,reduce).tasks.maximum.  The values for mapred.map.tasks 
and mapred.reduce.tasks were 29 and 3 respectively (using the prime number 
recommendations in the docs).
   
  The initial run took 00:23:21...not so good.  I changed 
(map,reduce).tasks.maximum to 4 and the time fell to 19:40.  Then I tried 7 and 
it fell to 14:37.  So far so good.
   
  I then looked at my code and realized that I was specifying 32 for the number 
of reducers (damned hard-coded constants...I bop myself on the head and call 
myself a moron).  The large value was based on running on a much larger grid.  
   
  So I backed that value down to 3, and my execution time fell to 09:17.  Then 
I changed (map,reduce).tasks.maximum from 7 to 4 and ran again in 06:48.  w00t!
   
  Bottom line:  Carefully setting configuration parameters, and paying 
attention to map/reduce task values relative to the size of the grid is VERY 
important in achieving good performance.
   
  Thanks,
  C G

Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
  
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 
4 core box with this setting. in practice - a little bit of oversubscription (3 
each on a 4 core) seems to be working out well for us (maybe overlapping some 
compute and io - but mostly we are trading off for higher # concurrent jobs 
against per job latency).

unlikely that these settings are causing slowness in processing small amounts 
of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
consumption when map task is running .. etc.


-Original Message-
From: Andy Li [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...

Try the 2 parameters to utilize all the cores per node/host.



mapred.tasktracker.map.tasks.maximum
7
The maximum number of map tasks that will be run
simultaneously by a task tracker.






mapred.tasktracker.reduce.tasks.maximum
7
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.




The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G 
wrote:

> Hi All:
>
> The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these values in terms of "number of available
> hosts" in the grid. This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more. The documentation is also vague about
> computing the actual value. For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.". I'm curious about how people
> are interpreting the descriptions and what values people are using.
> Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
> In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24. For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> Is this what was intended?
>
> Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
> I am hoping that some amount of tuning will resolve the problems.
>
> Any thoughts and insights most appreciated.
>
> Thanks,
> C G
>
>
>
> -
> Never miss a thing. Make Yahoo your homepage.
>



   
-
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: Questions regarding configuration parameters...

2008-02-22 Thread Tim Wintle
I have had exactly the same problem with using the command line to cat
files - they can take for ages, although I don't know why. Network
utilisation does not seem to be the bottleneck, though.

(Running 0.15.3)

Is the slow part of the reduce while you are waiting for the map data to
copy over to the reducers? I believe there was a bug prior to 0.16.0
that could leave you waiting for a long time if mappers had been too
slow to respond to previous requests (even if they were completely free
now)


On Thu, 2008-02-21 at 21:51 -0800, C G wrote:
> My performance problems fall into 2 categories:
>
>   1.  Extremely slow reduce phases - our map phases march along at impressive 
> speed, but during reduce phases most nodes go idle...the active machines 
> mostly clunk along at 10-30% CPU.  Compare this to the map phase where I get 
> all grid nodes cranking away at > 100% CPU.  This is a vague explanation I 
> realize.
>
>   2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  
> Frequently I'll be iterating over a list of HDFS files cat-ing them into one 
> file to bulk load into a database.  Many times I'll see one of the 
> copies/cats sit for anywhere from 2-5 minutes.  During that time no data is 
> transferred, all nodes are idle, and absolutely nothing is written to any of 
> the logs.  The file sizes being copied are relatively small...less than 1G 
> each in most cases.
>
>   Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm 
> sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but 
> the long pauses during a dfs command line operation seems like a bug to me.  
> Unfortunately I've not seen anybody else report this.
>
>   Any thoughts/ideas most welcome...
>
>   Thanks,
>   C G
>   
> 
> Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
>   
> > The default value are 2 so you might only see 2 cores used by Hadoop per
> > node/host.
> 
> that's 2 each for map and reduce. so theoretically - one could fully utilize 
> a 4 core box with this setting. in practice - a little bit of 
> oversubscription (3 each on a 4 core) seems to be working out well for us 
> (maybe overlapping some compute and io - but mostly we are trading off for 
> higher # concurrent jobs against per job latency).
> 
> unlikely that these settings are causing slowness in processing small amounts 
> of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
> consumption when map task is running .. etc.
> 
> 
> -Original Message-
> From: Andy Li [mailto:[EMAIL PROTECTED]
> Sent: Thu 2/21/2008 2:36 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Questions regarding configuration parameters...
> 
> Try the 2 parameters to utilize all the cores per node/host.
> 
> 
> 
> mapred.tasktracker.map.tasks.maximum
> 7
> The maximum number of map tasks that will be run
> simultaneously by a task tracker.
> 
> 
> 
> 
> 
> 
> mapred.tasktracker.reduce.tasks.maximum
> 7
> The maximum number of reduce tasks that will be run
> simultaneously by a task tracker.
> 
> 
> 
> 
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.
> If each system/machine has 4 cores (dual dual core), then you can change
> them to 3.
> 
> Hope this works for you.
> 
> -Andy
> 
> 
> On Wed, Feb 20, 2008 at 9:30 AM, C G 
> wrote:
> 
> > Hi All:
> >
> > The documentation for the configuration parameters mapred.map.tasks and
> > mapred.reduce.tasks discuss these values in terms of "number of available
> > hosts" in the grid. This description strikes me as a bit odd given that a
> > "host" could be anything from a uniprocessor to an N-way box, where values
> > for N could vary from 2..16 or more. The documentation is also vague about
> > computing the actual value. For example, for mapred.map.tasks the doc
> > says ".a prime number several times greater.". I'm curious about how people
> > are interpreting the descriptions and what values people are using.
> > Specifically, I'm wondering if I should be using "core count" instead of
> > "host count" to set these values.
> >
> > In the specific case of my system, we have 24 hosts where each host is a
> > 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> > value 173, as that is a prime number which is near 7*24. For
> > mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> > Is this what was intended?
> >
> > Beyond curiousity, I'm concerned about setting these values and oth

RE: Questions regarding configuration parameters...

2008-02-21 Thread C G
My performance problems fall into 2 categories:
   
  1.  Extremely slow reduce phases - our map phases march along at impressive 
speed, but during reduce phases most nodes go idle...the active machines mostly 
clunk along at 10-30% CPU.  Compare this to the map phase where I get all grid 
nodes cranking away at > 100% CPU.  This is a vague explanation I realize.
   
  2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  Frequently 
I'll be iterating over a list of HDFS files cat-ing them into one file to bulk 
load into a database.  Many times I'll see one of the copies/cats sit for 
anywhere from 2-5 minutes.  During that time no data is transferred, all nodes 
are idle, and absolutely nothing is written to any of the logs.  The file sizes 
being copied are relatively small...less than 1G each in most cases.
   
  Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm 
sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but 
the long pauses during a dfs command line operation seems like a bug to me.  
Unfortunately I've not seen anybody else report this.
   
  Any thoughts/ideas most welcome...
   
  Thanks,
  C G
  

Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
  
> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 
4 core box with this setting. in practice - a little bit of oversubscription (3 
each on a 4 core) seems to be working out well for us (maybe overlapping some 
compute and io - but mostly we are trading off for higher # concurrent jobs 
against per job latency).

unlikely that these settings are causing slowness in processing small amounts 
of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
consumption when map task is running .. etc.


-Original Message-
From: Andy Li [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...

Try the 2 parameters to utilize all the cores per node/host.



mapred.tasktracker.map.tasks.maximum
7
The maximum number of map tasks that will be run
simultaneously by a task tracker.






mapred.tasktracker.reduce.tasks.maximum
7
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.




The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G 
wrote:

> Hi All:
>
> The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these values in terms of "number of available
> hosts" in the grid. This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more. The documentation is also vague about
> computing the actual value. For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.". I'm curious about how people
> are interpreting the descriptions and what values people are using.
> Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
> In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24. For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
> Is this what was intended?
>
> Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
> I am hoping that some amount of tuning will resolve the problems.
>
> Any thoughts and insights most appreciated.
>
> Thanks,
> C G
>
>
>
> -
> Never miss a thing. Make Yahoo your homepage.
>



   
-
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: Questions regarding configuration parameters...

2008-02-21 Thread Joydeep Sen Sarma

> The default value are 2 so you might only see 2 cores used by Hadoop per
> node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 
4 core box with this setting. in practice - a little bit of oversubscription (3 
each on a 4 core) seems to be working out well for us (maybe overlapping some 
compute and io - but mostly we are trading off for higher # concurrent jobs 
against per job latency).

unlikely that these settings are causing slowness in processing small amounts 
of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
consumption when map task is running .. etc.


-Original Message-
From: Andy Li [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...
 
Try the 2 parameters to utilize all the cores per node/host.


  mapred.tasktracker.map.tasks.maximum
  7
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  



  mapred.tasktracker.reduce.tasks.maximum
  7
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  


The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G <[EMAIL PROTECTED]> wrote:

> Hi All:
>
>  The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these  values in terms of "number of available
> hosts" in the grid.  This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more.  The documentation is also vague about
> computing the actual value.  For example, for mapred.map.tasks the doc
> says ".a prime number several times greater.".  I'm curious about how people
> are interpreting the descriptions and what values people are using.
>  Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
>  In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24.  For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
>  Is this what was intended?
>
>  Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
>  I am hoping that some amount of tuning will resolve the problems.
>
>  Any thoughts and insights most appreciated.
>
>  Thanks,
>   C G
>
>
>
> -
> Never miss a thing.   Make Yahoo your homepage.
>



Re: Questions regarding configuration parameters...

2008-02-21 Thread Andy Li
Try the 2 parameters to utilize all the cores per node/host.


  mapred.tasktracker.map.tasks.maximum
  7
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  



  mapred.tasktracker.reduce.tasks.maximum
  7
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  


The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G <[EMAIL PROTECTED]> wrote:

> Hi All:
>
>  The documentation for the configuration parameters mapred.map.tasks and
> mapred.reduce.tasks discuss these  values in terms of "number of available
> hosts" in the grid.  This description strikes me as a bit odd given that a
> "host" could be anything from a uniprocessor to an N-way box, where values
> for N could vary from 2..16 or more.  The documentation is also vague about
> computing the actual value.  For example, for mapred.map.tasks the doc
> says "…a prime number several times greater…".  I'm curious about how people
> are interpreting the descriptions and what values people are using.
>  Specifically, I'm wondering if I should be using "core count" instead of
> "host count" to set these values.
>
>  In the specific case of my system, we have 24 hosts where each host is a
> 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
> value 173, as that is a prime number which is near 7*24.  For
> mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
>  Is this what was intended?
>
>  Beyond curiousity, I'm concerned about setting these values and other
> configuration parameters correctly because I am pursuing some performance
> issues where it is taking a very long time to process small amounts of data.
>  I am hoping that some amount of tuning will resolve the problems.
>
>  Any thoughts and insights most appreciated.
>
>  Thanks,
>   C G
>
>
>
> -
> Never miss a thing.   Make Yahoo your homepage.
>