Cluster Tuning

2011-07-07 Thread Juan P.
Hi guys!

I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
alike. Single core machines with 600MB of RAM. No chance of upgrading the
hardware.

My cluster is made out of 1 NameNode/JobTracker box and 19
DataNode/TaskTracker boxes.

All my config is default except i've set the following in my mapred-site.xml
in an effort to try and prevent choking my boxes.
  **
*  mapred.tasktracker.map.tasks.maximum*
*  1*
*  *

I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
hosts to each record and then in the reduce task it accumulates the amount
of bytes received from each host.

Currently it's producing about 65000 keys

The hole job takes forever to complete, specially the reduce part. I've
tried different tuning configs by I can't bring it down under 20mins.

Any ideas?

Thanks for your help!
Pony


Re: Cluster Tuning

2011-07-07 Thread Joey Echeverria
Have you tried using a Combiner?

Here's an example of using one:

http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0

-Joey

On Thu, Jul 7, 2011 at 4:29 PM, Juan P.  wrote:
> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  **
> *      mapred.tasktracker.map.tasks.maximum*
> *      1*
> *  *
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Cluster Tuning

2011-07-07 Thread Esteban Gutierrez
Hi Pony,

There is a good chance that your boxes are doing some heavy swapping and
that is a killer for Hadoop.  Have you tried
with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
heap on that boxes?

Cheers,
Esteban.

--
Get Hadoop!  http://www.cloudera.com/downloads/



On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:

> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes
> exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my
> mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  **
> *  mapred.tasktracker.map.tasks.maximum*
> *  1*
> *  *
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>


Re: Cluster Tuning

2011-07-07 Thread Ceriasmex
Eres el Esteban que conozco?



El 07/07/2011, a las 15:53, Esteban Gutierrez  escribió:

> Hi Pony,
> 
> There is a good chance that your boxes are doing some heavy swapping and
> that is a killer for Hadoop.  Have you tried
> with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> heap on that boxes?
> 
> Cheers,
> Esteban.
> 
> --
> Get Hadoop!  http://www.cloudera.com/downloads/
> 
> 
> 
> On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
> 
>> Hi guys!
>> 
>> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> exactly
>> alike. Single core machines with 600MB of RAM. No chance of upgrading the
>> hardware.
>> 
>> My cluster is made out of 1 NameNode/JobTracker box and 19
>> DataNode/TaskTracker boxes.
>> 
>> All my config is default except i've set the following in my
>> mapred-site.xml
>> in an effort to try and prevent choking my boxes.
>> **
>> *  mapred.tasktracker.map.tasks.maximum*
>> *  1*
>> *  *
>> 
>> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
>> hosts to each record and then in the reduce task it accumulates the amount
>> of bytes received from each host.
>> 
>> Currently it's producing about 65000 keys
>> 
>> The hole job takes forever to complete, specially the reduce part. I've
>> tried different tuning configs by I can't bring it down under 20mins.
>> 
>> Any ideas?
>> 
>> Thanks for your help!
>> Pony
>> 


Re: Cluster Tuning

2011-07-08 Thread Juan P.
Hey guys,
Thanks all of you for your help.

Joey,
I tweaked my MapReduce to serialize/deserialize only escencial values and
added a combiner and that helped a lot. Previously I had a domain object
which was being passed between Mapper and Reducer when I only needed a
single value.

Esteban,
I think you underestimate the constraints of my cluster. Adding multiple
jobs per JVM really kills me in terms of memory. Not to mention that by
having a single core there's not much to gain in terms of paralelism (other
than perhaps while a process is waiting of an I/O operation). Still I gave
it a shot, but even though I kept changing the config I always ended with a
Java heap space error.

Is it me or performance tuning is mostly a per job task? I mean it will, in
the end, depend on the the data you are processing (structure, size, weather
it's in one file or many, etc). If my jobs have different sets of data,
which are in different formats and organized in different  file structures,
Do you guys recommend moving some of the configuration to Java code?

Thanks!
Pony

On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:

> Eres el Esteban que conozco?
>
>
>
> El 07/07/2011, a las 15:53, Esteban Gutierrez 
> escribió:
>
> > Hi Pony,
> >
> > There is a good chance that your boxes are doing some heavy swapping and
> > that is a killer for Hadoop.  Have you tried
> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> > heap on that boxes?
> >
> > Cheers,
> > Esteban.
> >
> > --
> > Get Hadoop!  http://www.cloudera.com/downloads/
> >
> >
> >
> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
> >
> >> Hi guys!
> >>
> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
> >> exactly
> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
> the
> >> hardware.
> >>
> >> My cluster is made out of 1 NameNode/JobTracker box and 19
> >> DataNode/TaskTracker boxes.
> >>
> >> All my config is default except i've set the following in my
> >> mapred-site.xml
> >> in an effort to try and prevent choking my boxes.
> >> **
> >> *  mapred.tasktracker.map.tasks.maximum*
> >> *  1*
> >> *  *
> >>
> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
> maps
> >> hosts to each record and then in the reduce task it accumulates the
> amount
> >> of bytes received from each host.
> >>
> >> Currently it's producing about 65000 keys
> >>
> >> The hole job takes forever to complete, specially the reduce part. I've
> >> tried different tuning configs by I can't bring it down under 20mins.
> >>
> >> Any ideas?
> >>
> >> Thanks for your help!
> >> Pony
> >>
>


Re: Cluster Tuning

2011-07-08 Thread Juan P.
Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> **
>> >> *  mapred.tasktracker.map.tasks.maximum*
>> >> *  1*
>> >> *  *
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>


Re: Cluster Tuning

2011-07-08 Thread Robert Evans
I doubt It is going to make that much of a difference, even with the hardware 
constraints.  All that the reduce is doing during this period of time is 
downloading the map output data doing a merge sort on it and possibly dumping 
parts of it to disk. It may take up some RAM and if you are swapping a lot then 
it might be a speed bump to keep it from running, but only if you are really on 
the edge of the amount of RAM available to the system.  Looking at how you can 
reduce the data you transfer and tuning the heap size for the various JVMs can 
probably have a bigger impact.

--Bobby Evans

On 7/8/11 10:25 AM, "Juan P."  wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> **
>> >> *  mapred.tasktracker.map.tasks.maximum*
>> >> *  1*
>> >> *  *
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>



Re: Cluster Tuning

2011-07-08 Thread Joey Echeverria
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P.  wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes really slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> **
>>> >> *      mapred.tasktracker.map.tasks.maximum*
>>> >> *      1*
>>> >> *  *
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Cluster Tuning

2011-07-08 Thread Bharath Mundlapudi
Slow start is an important parameter. Definitely impacts job runtime. My 
experience in the past has been that, setting this parameter to too low or 
setting to too high can have issues with job latencies. If you are trying to 
run same job then its easy to set right value but if your cluster is 
multi-tenancy then getting this to right requires some benchmarking of 
different workloads concurrently.

But you case is interesting, you are running on a single core(How many disks 
per node?). So setting to higher side of the spectrum as suggested by Joey 
makes sense. 


-Bharath






From: Joey Echeverria 
To: common-user@hadoop.apache.org
Sent: Friday, July 8, 2011 9:14 AM
Subject: Re: Cluster Tuning

Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P.  wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes really slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> **
>>> >> *      mapred.tasktracker.map.tasks.maximum*
>>> >> *      1*
>>> >> *  *
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Cluster Tuning

2011-07-11 Thread Juan P.
Hi guys! Here's my mapred-site.xml
I've tweaked a few properties but still it's taking about 8-10mins to
process 4GB of data. Thought maybe you guys could find something you'd
comment on.
Thanks!
Pony

**
**
*
*
**
*  *
*mapred.job.tracker*
*name-node:54311*
*  *
*  *
*mapred.tasktracker.map.tasks.maximum*
*1*
*  *
*  *
*mapred.tasktracker.reduce.tasks.maximum*
*1*
*  *
*  *
*mapred.compress.map.output*
*true*
*  *
*  *
*mapred.map.output.compression.codec*
*org.apache.hadoop.io.compress.GzipCodec*
*  *
*  *
*mapred.child.java.opts*
*-Xmx400m*
*  *
*  *
*map.sort.class*
*org.apache.hadoop.util.HeapSort*
*  *
*  *
*mapred.reduce.slowstart.completed.maps*
*0.85*
*  *
*  *
*mapred.map.tasks.speculative.execution*
*false*
*  *
*  *
*mapred.reduce.tasks.speculative.execution*
*false*
*  *
**

On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi wrote:

> Slow start is an important parameter. Definitely impacts job runtime. My
> experience in the past has been that, setting this parameter to too low or
> setting to too high can have issues with job latencies. If you are trying to
> run same job then its easy to set right value but if your cluster is
> multi-tenancy then getting this to right requires some benchmarking of
> different workloads concurrently.
>
> But you case is interesting, you are running on a single core(How many
> disks per node?). So setting to higher side of the spectrum as suggested by
> Joey makes sense.
>
>
> -Bharath
>
>
>
>
>
> 
> From: Joey Echeverria 
> To: common-user@hadoop.apache.org
> Sent: Friday, July 8, 2011 9:14 AM
> Subject: Re: Cluster Tuning
>
> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
> 1.0 means the maps have to completely finish before the reduce starts
> copying any data. I often run jobs with this set to .90-.95.
>
> -Joey
>
> On Fri, Jul 8, 2011 at 11:25 AM, Juan P.  wrote:
> > Here's another thought. I realized that the reduce operation in my
> > map/reduce jobs is a flash. But it goes really slow until the
> > mappers end. Is there a way to configure the cluster to make the reduce
> wait
> > for the map operations to complete? Specially considering my hardware
> > restraints
> >
> > Thanks!
> > Pony
> >
> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:
> >
> >> Hey guys,
> >> Thanks all of you for your help.
> >>
> >> Joey,
> >> I tweaked my MapReduce to serialize/deserialize only escencial values
> and
> >> added a combiner and that helped a lot. Previously I had a domain object
> >> which was being passed between Mapper and Reducer when I only needed a
> >> single value.
> >>
> >> Esteban,
> >> I think you underestimate the constraints of my cluster. Adding multiple
> >> jobs per JVM really kills me in terms of memory. Not to mention that by
> >> having a single core there's not much to gain in terms of paralelism
> (other
> >> than perhaps while a process is waiting of an I/O operation). Still I
> gave
> >> it a shot, but even though I kept changing the config I always ended
> with a
> >> Java heap space error.
> >>
> >> Is it me or performance tuning is mostly a per job task? I mean it will,
> in
> >> the end, depend on the the data you are processing (structure, size,
> weather
> >> it's in one file or many, etc). If my jobs have different sets of data,
> >> which are in different formats and organized in different  file
> structures,
> >> Do you guys recommend moving some of the configuration to Java code?
> >>
> >> Thanks!
> >> Pony
> >>
> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
> >>
> >>> Eres el Esteban que conozco?
> >>>
> >>>
> >>>
> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
> >>> escribió:
> >>>
> >>> > Hi Pony,
> >>> >
> >>> > There is a good chance that your boxes are doing some heavy swapping
> and
> >>> > that is a killer for Hadoop.  Have you tried
> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
> the
> >>> > heap on that boxes?
> >>> >
> >>> > Cheers,
> >>> > Esteban.
> >>> >
> >>> > --
> >>> > Get Hadoop!  http://www.cloudera.com/downloads/
> >>> >
> >>> >
> >>> >
> >>

Re: Cluster Tuning

2011-07-11 Thread Juan P.
BTW: Here's the Job Output

https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHc&hl=en_US

On Mon, Jul 11, 2011 at 1:28 PM, Juan P.  wrote:

> Hi guys! Here's my mapred-site.xml
> I've tweaked a few properties but still it's taking about 8-10mins to
> process 4GB of data. Thought maybe you guys could find something you'd
> comment on.
> Thanks!
> Pony
>
> **
> **
> *
> *
> **
> *  *
> *mapred.job.tracker*
> *name-node:54311*
> *  *
> *  *
> *mapred.tasktracker.map.tasks.maximum*
> *1*
> *  *
> *  *
> *mapred.tasktracker.reduce.tasks.maximum*
> *1*
> *  *
> *  *
> *mapred.compress.map.output*
> *true*
> *  *
> *  *
> *mapred.map.output.compression.codec*
> *org.apache.hadoop.io.compress.GzipCodec*
> *  *
> *  *
> *mapred.child.java.opts*
> *-Xmx400m*
> *  *
> *  *
> *map.sort.class*
> *org.apache.hadoop.util.HeapSort*
> *  *
> *  *
> *mapred.reduce.slowstart.completed.maps*
> *0.85*
> *  *
> *  *
> *mapred.map.tasks.speculative.execution*
> *false*
> *  *
> *  *
> *mapred.reduce.tasks.speculative.execution*
> *false*
> *  *
> **
>
> On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi 
> wrote:
>
>> Slow start is an important parameter. Definitely impacts job runtime. My
>> experience in the past has been that, setting this parameter to too low or
>> setting to too high can have issues with job latencies. If you are trying to
>> run same job then its easy to set right value but if your cluster is
>> multi-tenancy then getting this to right requires some benchmarking of
>> different workloads concurrently.
>>
>> But you case is interesting, you are running on a single core(How many
>> disks per node?). So setting to higher side of the spectrum as suggested by
>> Joey makes sense.
>>
>>
>> -Bharath
>>
>>
>>
>>
>>
>> 
>> From: Joey Echeverria 
>> To: common-user@hadoop.apache.org
>> Sent: Friday, July 8, 2011 9:14 AM
>> Subject: Re: Cluster Tuning
>>
>> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
>> 1.0 means the maps have to completely finish before the reduce starts
>> copying any data. I often run jobs with this set to .90-.95.
>>
>> -Joey
>>
>> On Fri, Jul 8, 2011 at 11:25 AM, Juan P.  wrote:
>> > Here's another thought. I realized that the reduce operation in my
>> > map/reduce jobs is a flash. But it goes really slow until the
>> > mappers end. Is there a way to configure the cluster to make the reduce
>> wait
>> > for the map operations to complete? Specially considering my hardware
>> > restraints
>> >
>> > Thanks!
>> > Pony
>> >
>> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:
>> >
>> >> Hey guys,
>> >> Thanks all of you for your help.
>> >>
>> >> Joey,
>> >> I tweaked my MapReduce to serialize/deserialize only escencial values
>> and
>> >> added a combiner and that helped a lot. Previously I had a domain
>> object
>> >> which was being passed between Mapper and Reducer when I only needed a
>> >> single value.
>> >>
>> >> Esteban,
>> >> I think you underestimate the constraints of my cluster. Adding
>> multiple
>> >> jobs per JVM really kills me in terms of memory. Not to mention that by
>> >> having a single core there's not much to gain in terms of paralelism
>> (other
>> >> than perhaps while a process is waiting of an I/O operation). Still I
>> gave
>> >> it a shot, but even though I kept changing the config I always ended
>> with a
>> >> Java heap space error.
>> >>
>> >> Is it me or performance tuning is mostly a per job task? I mean it
>> will, in
>> >> the end, depend on the the data you are processing (structure, size,
>> weather
>> >> it's in one file or many, etc). If my jobs have different sets of data,
>> >> which are in different formats and organized in different  file
>> structures,
>> >> Do you guys recommend moving some of the configuration to Java code?
>> >>
>> >> Thanks!
>> >> Pony
>> >>
>> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>> >>
>> >>> Eres el Esteban que conozco?
>>

Re: Cluster Tuning

2011-07-11 Thread Allen Wittenauer

On Jul 11, 2011, at 9:28 AM, Juan P. wrote:
> 
> *  *
> *mapred.child.java.opts*
> *-Xmx400m*
> *  *

"Single core machines with 600MB of RAM."

2x400m = 800m just for the heap of the map and reduce phases, 
not counting the other memory that the jvm will need.  io buffer sizes aren't 
adjusted downward either, so you're likely looking at a swapping + spills = 
death scenario.  slowstart set to 1 is going to be pretty much required.

Re: Cluster Tuning

2011-07-11 Thread Juan P.
Allen,
Say I were to bring the property back to the default of -Xmx200m, which
buffers do you think I should adjust? io.sort.mb? io.sort.factor? How would
you adjust them?

Thanks for your help!
Pony

On Mon, Jul 11, 2011 at 4:41 PM, Allen Wittenauer  wrote:

>
> On Jul 11, 2011, at 9:28 AM, Juan P. wrote:
> >
> > *  *
> > *mapred.child.java.opts*
> > *-Xmx400m*
> > *  *
>
> "Single core machines with 600MB of RAM."
>
> 2x400m = 800m just for the heap of the map and reduce
> phases, not counting the other memory that the jvm will need.  io buffer
> sizes aren't adjusted downward either, so you're likely looking at a
> swapping + spills = death scenario.  slowstart set to 1 is going to be
> pretty much required.


Re: Cluster Tuning

2011-07-15 Thread Steve Loughran

On 08/07/2011 16:25, Juan P. wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints


take a look to see if its usually the same machine that's taking too 
long; test your HDDs to see if there are any signs of problems in the 
SMART messages. Then turn on speculation. It could be the problem with a 
slow mapper is caused by disk problems or an overloaded server.