Re: Cluster Tuning

2011-07-11 Thread Juan P.
Hi guys! Here's my mapred-site.xml
I've tweaked a few properties but still it's taking about 8-10mins to
process 4GB of data. Thought maybe you guys could find something you'd
comment on.
Thanks!
Pony

*?xml version=1.0?*
*?xml-stylesheet type=text/xsl href=configuration.xsl?*
*
*
*configuration*
*  property*
*namemapred.job.tracker/name*
*valuename-node:54311/value*
*  /property*
*  property*
*namemapred.tasktracker.map.tasks.maximum/name*
*value1/value*
*  /property*
*  property*
*namemapred.tasktracker.reduce.tasks.maximum/name*
*value1/value*
*  /property*
*  property*
*namemapred.compress.map.output/name*
*valuetrue/value*
*  /property*
*  property*
*namemapred.map.output.compression.codec/name*
*valueorg.apache.hadoop.io.compress.GzipCodec/value*
*  /property*
*  property*
*namemapred.child.java.opts/name*
*value-Xmx400m/value*
*  /property*
*  property*
*namemap.sort.class/name*
*valueorg.apache.hadoop.util.HeapSort/value*
*  /property*
*  property*
*namemapred.reduce.slowstart.completed.maps/name*
*value0.85/value*
*  /property*
*  property*
*namemapred.map.tasks.speculative.execution/name*
*valuefalse/value*
*  /property*
*  property*
*namemapred.reduce.tasks.speculative.execution/name*
*valuefalse/value*
*  /property*
*/configuration*

On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote:

 Slow start is an important parameter. Definitely impacts job runtime. My
 experience in the past has been that, setting this parameter to too low or
 setting to too high can have issues with job latencies. If you are trying to
 run same job then its easy to set right value but if your cluster is
 multi-tenancy then getting this to right requires some benchmarking of
 different workloads concurrently.

 But you case is interesting, you are running on a single core(How many
 disks per node?). So setting to higher side of the spectrum as suggested by
 Joey makes sense.


 -Bharath





 
 From: Joey Echeverria j...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Friday, July 8, 2011 9:14 AM
 Subject: Re: Cluster Tuning

 Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
 1.0 means the maps have to completely finish before the reduce starts
 copying any data. I often run jobs with this set to .90-.95.

 -Joey

 On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote:
  Here's another thought. I realized that the reduce operation in my
  map/reduce jobs is a flash. But it goes really slow until the
  mappers end. Is there a way to configure the cluster to make the reduce
 wait
  for the map operations to complete? Specially considering my hardware
  restraints
 
  Thanks!
  Pony
 
  On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote:
 
  Hey guys,
  Thanks all of you for your help.
 
  Joey,
  I tweaked my MapReduce to serialize/deserialize only escencial values
 and
  added a combiner and that helped a lot. Previously I had a domain object
  which was being passed between Mapper and Reducer when I only needed a
  single value.
 
  Esteban,
  I think you underestimate the constraints of my cluster. Adding multiple
  jobs per JVM really kills me in terms of memory. Not to mention that by
  having a single core there's not much to gain in terms of paralelism
 (other
  than perhaps while a process is waiting of an I/O operation). Still I
 gave
  it a shot, but even though I kept changing the config I always ended
 with a
  Java heap space error.
 
  Is it me or performance tuning is mostly a per job task? I mean it will,
 in
  the end, depend on the the data you are processing (structure, size,
 weather
  it's in one file or many, etc). If my jobs have different sets of data,
  which are in different formats and organized in different  file
 structures,
  Do you guys recommend moving some of the configuration to Java code?
 
  Thanks!
  Pony
 
  On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote:
 
  Eres el Esteban que conozco?
 
 
 
  El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com
  escribió:
 
   Hi Pony,
  
   There is a good chance that your boxes are doing some heavy swapping
 and
   that is a killer for Hadoop.  Have you tried
   with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
 the
   heap on that boxes?
  
   Cheers,
   Esteban.
  
   --
   Get Hadoop!  http://www.cloudera.com/downloads/
  
  
  
   On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com
 wrote:
  
   Hi guys!
  
   I'd like some help fine tuning my cluster. I currently have 20 boxes
   exactly
   alike. Single core machines with 600MB of RAM. No chance of
 upgrading
  the
   hardware.
  
   My cluster is made out of 1 NameNode/JobTracker box and 19
   DataNode/TaskTracker boxes.
  
   All my config is default except i've set the following in my
   mapred-site.xml
   in an effort

Re: Cluster Tuning

2011-07-11 Thread Juan P.
BTW: Here's the Job Output

https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHchl=en_US

On Mon, Jul 11, 2011 at 1:28 PM, Juan P. gordoslo...@gmail.com wrote:

 Hi guys! Here's my mapred-site.xml
 I've tweaked a few properties but still it's taking about 8-10mins to
 process 4GB of data. Thought maybe you guys could find something you'd
 comment on.
 Thanks!
 Pony

 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *configuration*
 *  property*
 *namemapred.job.tracker/name*
 *valuename-node:54311/value*
 *  /property*
 *  property*
 *namemapred.tasktracker.map.tasks.maximum/name*
 *value1/value*
 *  /property*
 *  property*
 *namemapred.tasktracker.reduce.tasks.maximum/name*
 *value1/value*
 *  /property*
 *  property*
 *namemapred.compress.map.output/name*
 *valuetrue/value*
 *  /property*
 *  property*
 *namemapred.map.output.compression.codec/name*
 *valueorg.apache.hadoop.io.compress.GzipCodec/value*
 *  /property*
 *  property*
 *namemapred.child.java.opts/name*
 *value-Xmx400m/value*
 *  /property*
 *  property*
 *namemap.sort.class/name*
 *valueorg.apache.hadoop.util.HeapSort/value*
 *  /property*
 *  property*
 *namemapred.reduce.slowstart.completed.maps/name*
 *value0.85/value*
 *  /property*
 *  property*
 *namemapred.map.tasks.speculative.execution/name*
 *valuefalse/value*
 *  /property*
 *  property*
 *namemapred.reduce.tasks.speculative.execution/name*
 *valuefalse/value*
 *  /property*
 */configuration*

 On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi 
 bharathw...@yahoo.comwrote:

 Slow start is an important parameter. Definitely impacts job runtime. My
 experience in the past has been that, setting this parameter to too low or
 setting to too high can have issues with job latencies. If you are trying to
 run same job then its easy to set right value but if your cluster is
 multi-tenancy then getting this to right requires some benchmarking of
 different workloads concurrently.

 But you case is interesting, you are running on a single core(How many
 disks per node?). So setting to higher side of the spectrum as suggested by
 Joey makes sense.


 -Bharath





 
 From: Joey Echeverria j...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Friday, July 8, 2011 9:14 AM
 Subject: Re: Cluster Tuning

 Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
 1.0 means the maps have to completely finish before the reduce starts
 copying any data. I often run jobs with this set to .90-.95.

 -Joey

 On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote:
  Here's another thought. I realized that the reduce operation in my
  map/reduce jobs is a flash. But it goes really slow until the
  mappers end. Is there a way to configure the cluster to make the reduce
 wait
  for the map operations to complete? Specially considering my hardware
  restraints
 
  Thanks!
  Pony
 
  On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote:
 
  Hey guys,
  Thanks all of you for your help.
 
  Joey,
  I tweaked my MapReduce to serialize/deserialize only escencial values
 and
  added a combiner and that helped a lot. Previously I had a domain
 object
  which was being passed between Mapper and Reducer when I only needed a
  single value.
 
  Esteban,
  I think you underestimate the constraints of my cluster. Adding
 multiple
  jobs per JVM really kills me in terms of memory. Not to mention that by
  having a single core there's not much to gain in terms of paralelism
 (other
  than perhaps while a process is waiting of an I/O operation). Still I
 gave
  it a shot, but even though I kept changing the config I always ended
 with a
  Java heap space error.
 
  Is it me or performance tuning is mostly a per job task? I mean it
 will, in
  the end, depend on the the data you are processing (structure, size,
 weather
  it's in one file or many, etc). If my jobs have different sets of data,
  which are in different formats and organized in different  file
 structures,
  Do you guys recommend moving some of the configuration to Java code?
 
  Thanks!
  Pony
 
  On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote:
 
  Eres el Esteban que conozco?
 
 
 
  El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com
  escribió:
 
   Hi Pony,
  
   There is a good chance that your boxes are doing some heavy swapping
 and
   that is a killer for Hadoop.  Have you tried
   with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
 the
   heap on that boxes?
  
   Cheers,
   Esteban.
  
   --
   Get Hadoop!  http://www.cloudera.com/downloads/
  
  
  
   On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com
 wrote:
  
   Hi guys!
  
   I'd like some help fine tuning my cluster. I currently have 20
 boxes
   exactly
   alike. Single core

Re: Cluster Tuning

2011-07-11 Thread Juan P.
Allen,
Say I were to bring the property back to the default of -Xmx200m, which
buffers do you think I should adjust? io.sort.mb? io.sort.factor? How would
you adjust them?

Thanks for your help!
Pony

On Mon, Jul 11, 2011 at 4:41 PM, Allen Wittenauer a...@apache.org wrote:


 On Jul 11, 2011, at 9:28 AM, Juan P. wrote:
 
  *  property*
  *namemapred.child.java.opts/name*
  *value-Xmx400m/value*
  *  /property*

 Single core machines with 600MB of RAM.

 2x400m = 800m just for the heap of the map and reduce
 phases, not counting the other memory that the jvm will need.  io buffer
 sizes aren't adjusted downward either, so you're likely looking at a
 swapping + spills = death scenario.  slowstart set to 1 is going to be
 pretty much required.


Re: Cluster Tuning

2011-07-08 Thread Juan P.
Hey guys,
Thanks all of you for your help.

Joey,
I tweaked my MapReduce to serialize/deserialize only escencial values and
added a combiner and that helped a lot. Previously I had a domain object
which was being passed between Mapper and Reducer when I only needed a
single value.

Esteban,
I think you underestimate the constraints of my cluster. Adding multiple
jobs per JVM really kills me in terms of memory. Not to mention that by
having a single core there's not much to gain in terms of paralelism (other
than perhaps while a process is waiting of an I/O operation). Still I gave
it a shot, but even though I kept changing the config I always ended with a
Java heap space error.

Is it me or performance tuning is mostly a per job task? I mean it will, in
the end, depend on the the data you are processing (structure, size, weather
it's in one file or many, etc). If my jobs have different sets of data,
which are in different formats and organized in different  file structures,
Do you guys recommend moving some of the configuration to Java code?

Thanks!
Pony

On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote:

 Eres el Esteban que conozco?



 El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com
 escribió:

  Hi Pony,
 
  There is a good chance that your boxes are doing some heavy swapping and
  that is a killer for Hadoop.  Have you tried
  with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
  heap on that boxes?
 
  Cheers,
  Esteban.
 
  --
  Get Hadoop!  http://www.cloudera.com/downloads/
 
 
 
  On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote:
 
  Hi guys!
 
  I'd like some help fine tuning my cluster. I currently have 20 boxes
  exactly
  alike. Single core machines with 600MB of RAM. No chance of upgrading
 the
  hardware.
 
  My cluster is made out of 1 NameNode/JobTracker box and 19
  DataNode/TaskTracker boxes.
 
  All my config is default except i've set the following in my
  mapred-site.xml
  in an effort to try and prevent choking my boxes.
  *property*
  *  namemapred.tasktracker.map.tasks.maximum/name*
  *  value1/value*
  *  /property*
 
  I'm running a MapReduce job which reads a Proxy Server log file (2GB),
 maps
  hosts to each record and then in the reduce task it accumulates the
 amount
  of bytes received from each host.
 
  Currently it's producing about 65000 keys
 
  The hole job takes forever to complete, specially the reduce part. I've
  tried different tuning configs by I can't bring it down under 20mins.
 
  Any ideas?
 
  Thanks for your help!
  Pony
 



Re: Cluster Tuning

2011-07-08 Thread Juan P.
Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote:

 Hey guys,
 Thanks all of you for your help.

 Joey,
 I tweaked my MapReduce to serialize/deserialize only escencial values and
 added a combiner and that helped a lot. Previously I had a domain object
 which was being passed between Mapper and Reducer when I only needed a
 single value.

 Esteban,
 I think you underestimate the constraints of my cluster. Adding multiple
 jobs per JVM really kills me in terms of memory. Not to mention that by
 having a single core there's not much to gain in terms of paralelism (other
 than perhaps while a process is waiting of an I/O operation). Still I gave
 it a shot, but even though I kept changing the config I always ended with a
 Java heap space error.

 Is it me or performance tuning is mostly a per job task? I mean it will, in
 the end, depend on the the data you are processing (structure, size, weather
 it's in one file or many, etc). If my jobs have different sets of data,
 which are in different formats and organized in different  file structures,
 Do you guys recommend moving some of the configuration to Java code?

 Thanks!
 Pony

 On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote:

 Eres el Esteban que conozco?



 El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com
 escribió:

  Hi Pony,
 
  There is a good chance that your boxes are doing some heavy swapping and
  that is a killer for Hadoop.  Have you tried
  with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
  heap on that boxes?
 
  Cheers,
  Esteban.
 
  --
  Get Hadoop!  http://www.cloudera.com/downloads/
 
 
 
  On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote:
 
  Hi guys!
 
  I'd like some help fine tuning my cluster. I currently have 20 boxes
  exactly
  alike. Single core machines with 600MB of RAM. No chance of upgrading
 the
  hardware.
 
  My cluster is made out of 1 NameNode/JobTracker box and 19
  DataNode/TaskTracker boxes.
 
  All my config is default except i've set the following in my
  mapred-site.xml
  in an effort to try and prevent choking my boxes.
  *property*
  *  namemapred.tasktracker.map.tasks.maximum/name*
  *  value1/value*
  *  /property*
 
  I'm running a MapReduce job which reads a Proxy Server log file (2GB),
 maps
  hosts to each record and then in the reduce task it accumulates the
 amount
  of bytes received from each host.
 
  Currently it's producing about 65000 keys
 
  The hole job takes forever to complete, specially the reduce part. I've
  tried different tuning configs by I can't bring it down under 20mins.
 
  Any ideas?
 
  Thanks for your help!
  Pony
 





Cluster Tuning

2011-07-07 Thread Juan P.
Hi guys!

I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
alike. Single core machines with 600MB of RAM. No chance of upgrading the
hardware.

My cluster is made out of 1 NameNode/JobTracker box and 19
DataNode/TaskTracker boxes.

All my config is default except i've set the following in my mapred-site.xml
in an effort to try and prevent choking my boxes.
  *property*
*  namemapred.tasktracker.map.tasks.maximum/name*
*  value1/value*
*  /property*

I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
hosts to each record and then in the reduce task it accumulates the amount
of bytes received from each host.

Currently it's producing about 65000 keys

The hole job takes forever to complete, specially the reduce part. I've
tried different tuning configs by I can't bring it down under 20mins.

Any ideas?

Thanks for your help!
Pony


Setting names for nodes

2011-07-04 Thread Juan P.
Hi guys,
Is there a way to set human readable names for my nodes?

I've configured an Amazon cluster, and currently when browsing the NameNode
Web Console in the list of nodes I get part of the Amazon public DNS URL
which isn't very helpful when trying to figure out which node I'm looking
at. So I wanted to know if there was a way of telling Hadoop to generate
links using the public DNS but that it should display a specific name for
each node.

Thanks!
Pony


Re: Performance Tunning

2011-06-27 Thread Juan P.
Matt,
Thanks for your help!
I think I get it now, but this part is a bit confusing:
*
*
*so: tasktracker/datanode and 6 slots left. How you break it up from there
is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
/ 1 reducer.*
*
*
If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
Processes/Core = 32 Processes Total

So my configuration mapred-site.xml should include these props:

*property*
*  namemapred.map.tasks/name*
*  value28/value*
*/property*
*property*
*  namemapred.reduce.tasks/name*
*  value4/value*
*/property*
*
*

Is that correct?

On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 If you are running default configurations then you are only getting 2
 mappers and 1 reducer per node. The rule of thumb I have gone on (and back
 up by the definitive guide) is 2 processes per core so: tasktracker/datanode
 and 6 slots left. How you break it up from there is your call but I would
 suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

 Check out the below configs for details on what you are *most likely*
 running currently:
 http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

 HTH,
 Matt

 -Original Message-
 From: Juan P. [mailto:gordoslo...@gmail.com]
 Sent: Monday, June 27, 2011 2:50 PM
 To: common-user@hadoop.apache.org
 Subject: Performance Tunning

 I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
 cores each.
 My input data is 4GB in size and it's split into 100MB files. Current
 configuration is default so block size is 64MB.

 If I understand it correctly Hadoop should be running 64 Mappers to process
 the data.

 I'm running a simple data counting MapReduce and it's taking about 30mins
 to
 complete. This seems like way too much, doesn't it?
 Is there any tunning you guys would recommend to try and see an improvement
 in performance?

 Thanks,
 Pony
 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




Re: Performance Tunning

2011-06-27 Thread Juan P.
Ok,
So I tried putting the following config in the mapred-site.xml of all of my
nodes

configuration
  property
namemapred.job.tracker/name
valuename-node:54311/value
  /property
  property
namemapred.map.tasks/name
value7/value
  /property
  property
namemapred.reduce.tasks/name
value1/value
  /property
  property
namemapred.tasktracker.map.tasks.maximum/name
value7/value
  /property
  property
namemapred.tasktracker.reduce.tasks.maximum/name
value1/value
  /property
/configuration

but when I start a new job it gets stuck at

11/06/28 03:04:47 INFO mapred.JobClient:  map 0% reduce 0%

Any thoughts?
Thanks for your help guys!

On Mon, Jun 27, 2011 at 7:33 PM, Juan P. gordoslo...@gmail.com wrote:

 Matt,
 Thanks for your help!
 I think I get it now, but this part is a bit confusing:
 *
 *
 *so: tasktracker/datanode and 6 slots left. How you break it up from there
 is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
 / 1 reducer.*
 *
 *
 If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
 Processes/Core = 32 Processes Total

 So my configuration mapred-site.xml should include these props:

 *property*
 *  namemapred.map.tasks/name*
 *  value28/value*
 */property*
 *property*
 *  namemapred.reduce.tasks/name*
 *  value4/value*
 */property*
 *
 *

 Is that correct?

 On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:

 If you are running default configurations then you are only getting 2
 mappers and 1 reducer per node. The rule of thumb I have gone on (and back
 up by the definitive guide) is 2 processes per core so: tasktracker/datanode
 and 6 slots left. How you break it up from there is your call but I would
 suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

 Check out the below configs for details on what you are *most likely*
 running currently:
 http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

 HTH,
 Matt

 -Original Message-
 From: Juan P. [mailto:gordoslo...@gmail.com]
 Sent: Monday, June 27, 2011 2:50 PM
 To: common-user@hadoop.apache.org
 Subject: Performance Tunning

 I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
 cores each.
 My input data is 4GB in size and it's split into 100MB files. Current
 configuration is default so block size is 64MB.

 If I understand it correctly Hadoop should be running 64 Mappers to
 process
 the data.

 I'm running a simple data counting MapReduce and it's taking about 30mins
 to
 complete. This seems like way too much, doesn't it?
 Is there any tunning you guys would recommend to try and see an
 improvement
 in performance?

 Thanks,
 Pony
 This e-mail message may contain privileged and/or confidential
 information, and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other
 use of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.





Re: Starting JobTracker Locally but binding to remote Address

2011-06-01 Thread Juan P.
Joey,
I just tried it and it worked great. I configured the entire cluster (added
a couple more DataNodes) and I was able to run a simple map/reduce job.

Thanks for your help!
Pony

On Tue, May 31, 2011 at 6:26 PM, gordoslocos gordoslo...@gmail.com wrote:

 :D i'll give that a try 1st thing in the morning! Thanks a lot joey!!

 Sent from my iPhone

 On 31/05/2011, at 18:18, Joey Echeverria j...@cloudera.com wrote:

  The problem is that start-all.sh isn't all that intelligent. The way
  that start-all.sh works is by running start-dfs.sh and
  start-mapred.sh. The start-mapred.sh script always starts a job
  tracker on the local host and a task tracker on all of the hosts
  listed in slaves (it uses SSH to do the remote execution). The
  start-dfs.sh script always starts a name node on the local host, a
  data node on all of the hosts listed in slaves, and a secondary name
  node on all of the hosts listed in masters.
 
  In your case, you'll want to run start-dfs.sh on slave3 and
  start-mapred.sh on slave2.
 
  -Joey
 
  On Tue, May 31, 2011 at 5:07 PM, Juan P. gordoslo...@gmail.com wrote:
  Hi Guys,
  I recently configured my cluster to have 2 VMs. I configured 1
  machine (slave3) to be the namenode and another to be the
  jobtracker (slave2). They both work as datanode/tasktracker as well.
 
  Both configs have the following contents in their masters and slaves
 file:
  *slave2*
  *slave3*
 
  Both machines have the following contents on their mapred-site.xml file:
  *?xml version=1.0?*
  *?xml-stylesheet type=text/xsl href=configuration.xsl?*
  *
  *
  *!-- Put site-specific property overrides in this file. --*
  *
  *
  *configuration*
  * property*
  * namemapred.job.tracker/name*
  * valueslave2:9001/value*
  * /property*
  */configuration*
 
  Both machines have the following contents on their core-site.xml file:
  *?xml version=1.0?*
  *?xml-stylesheet type=text/xsl href=configuration.xsl?*
  *
  *
  *!-- Put site-specific property overrides in this file. --*
  *
  *
  *configuration*
  * property*
  * namefs.default.name/name*
  * valuehdfs://slave3:9000/value*
  * /property*
  */configuration*
 
  When I log into the namenode and I run the start-all.sh script,
 everything
  but the jobtracker starts. In the log files I get the following
 exception:
 
  */*
  *STARTUP_MSG: Starting JobTracker*
  *STARTUP_MSG:   host = slave3/10.20.11.112*
  *STARTUP_MSG:   args = []*
  *STARTUP_MSG:   version = 0.20.2*
  *STARTUP_MSG:   build =
  https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
  911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
  */*
  *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker:
 Scheduler
  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
  limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
  *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
  java.net.BindException: Problem binding to slave2/10.20.11.166:9001 :
 Cannot
  assign requested address*
  *at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
  *at
 org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
  *at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
  *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
  *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
  *at
 org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
  *
  *at
  org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
  *at
  org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
  *at
 org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
  *Caused by: java.net.BindException: Cannot assign requested address*
  *at sun.nio.ch.Net.bind(Native Method)*
  *at
 
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
  *at
 sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
  *
  *at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
  *... 8 more*
  *
  *
  *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
  SHUTDOWN_MSG:*
  */*
  *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
  */*
 
 
  As I see it, from the lines
 
  *STARTUP_MSG: Starting JobTracker*
  *STARTUP_MSG:   host = slave3/10.20.11.112*
 
  the namenode (slave3) is trying to run the jobtracker locally but when
 it
  starts the jobtracker server it binds it to the slave2 address and of
 course
  fails:
 
  *Problem binding to slave2/10.20.11.166:9001*
 
  What do you guys think could be going wrong?
 
  Thanks!
  Pony
 
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434



Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Juan P.
Hi Guys,
I recently configured my cluster to have 2 VMs. I configured 1
machine (slave3) to be the namenode and another to be the
jobtracker (slave2). They both work as datanode/tasktracker as well.

Both configs have the following contents in their masters and slaves file:
*slave2*
*slave3*

Both machines have the following contents on their mapred-site.xml file:
*?xml version=1.0?*
*?xml-stylesheet type=text/xsl href=configuration.xsl?*
*
*
*!-- Put site-specific property overrides in this file. --*
*
*
*configuration*
* property*
* namemapred.job.tracker/name*
* valueslave2:9001/value*
* /property*
*/configuration*

Both machines have the following contents on their core-site.xml file:
*?xml version=1.0?*
*?xml-stylesheet type=text/xsl href=configuration.xsl?*
*
*
*!-- Put site-specific property overrides in this file. --*
*
*
*configuration*
* property*
* namefs.default.name/name*
* valuehdfs://slave3:9000/value*
* /property*
*/configuration*

When I log into the namenode and I run the start-all.sh script, everything
but the jobtracker starts. In the log files I get the following exception:

*/*
*STARTUP_MSG: Starting JobTracker*
*STARTUP_MSG:   host = slave3/10.20.11.112*
*STARTUP_MSG:   args = []*
*STARTUP_MSG:   version = 0.20.2*
*STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
*/*
*2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
*2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
assign requested address*
*at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
*at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
*at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
*at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
*at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
*at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
*
*at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
*at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
*at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
*Caused by: java.net.BindException: Cannot assign requested address*
*at sun.nio.ch.Net.bind(Native Method)*
*at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
*at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
*
*at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
*... 8 more*
*
*
*2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
SHUTDOWN_MSG:*
*/*
*SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
*/*


As I see it, from the lines

*STARTUP_MSG: Starting JobTracker*
*STARTUP_MSG:   host = slave3/10.20.11.112*

the namenode (slave3) is trying to run the jobtracker locally but when it
starts the jobtracker server it binds it to the slave2 address and of course
fails:

*Problem binding to slave2/10.20.11.166:9001*

What do you guys think could be going wrong?

Thanks!
Pony


Re: Comparing

2011-05-26 Thread Juan P.
Harsh,
Thanks for your response, it was very helpful.
There are still a couple of things which are not really clear to me though.
You say that Keys have got to be compared by the MR framework. But I'm
still not 100% sure why keys are sorted. I thought what hadoop did was,
during shuffling it chose which keys went to which reducer and then for each
key/value it checked the key and sent them to the correct node. If that was
the case then a good equals implementation could be enough. So why instead
of just *shuffling* does the MP framework *sort* the items?

Also, you were very clear about the use of RawComparator, thank you. Do you
know how RawComparable works though?

Again, thanks for your help!
Cheers,
Pony

On Thu, May 26, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote:

 Pony,

 Keys have got to be compared by the MR framework somehow, and the way
 it does when you use Writables is by ensuring that your Key is of a
 Writable + Comparable type (WritableComparable).

 If you specify a specific comparator class, then that will be used;
 else the default WritableComparator will get asked if it can supply a
 comparator for use with your key type.

 AFAIK, the default WritableComparator wraps around RawComparator and
 does indeed deserialize the writables before applying the compare
 operation. The RawComparator's primary idea is to give you a pair of
 raw byte sequences to compare directly. Certain other serialization
 libraries (Apache Avro is one) provide ways to compare using bytes
 itself (Across different types), which can end up being faster when
 used in jobs.

 Hope this clears up your confusion.

 On Tue, May 24, 2011 at 2:06 AM, Juan P. gordoslo...@gmail.com wrote:
  Hi guys,
  I wanted to get your help with a couple of questions which came up while
  looking at the Hadoop Comparator/Comparable architecture.
 
  As I see it before each reducer operates on each key, a sorting algorithm
 is
  applied to them. *Why does Hadoop need to do that?*
 
  If I implement my own class and I intend to use it as a Key I must allow
 for
  instances of my class to be compared. So I have 2 choices: I can
 implement
  WritableComparable or I can register a WritableComparator for my
  class. Should I fail to do either, would the Job fail?
  If I register my WritableComparator which does not use the Comparable
  interface at all, does my Key need to implement WritableComparable?
  If I don't implement my Comparator and my Key implements
 WritableComparable,
  does it mean that Hadoop will deserialize my Keys twice? (once for
 sorting,
  and once for reducing)
  What is RawComparable used for?
 
  Thanks for your help!
  Pony
 



 --
 Harsh J



Re: Comparing

2011-05-25 Thread Juan P.
Hi guys!
Any thoughts on this? Should I have sent my queries to a different
distribution list?

Thanks!
Pony

On Mon, May 23, 2011 at 5:36 PM, Juan P. gordoslo...@gmail.com wrote:

 Hi guys,
 I wanted to get your help with a couple of questions which came up while
 looking at the Hadoop Comparator/Comparable architecture.

 As I see it before each reducer operates on each key, a sorting algorithm
 is applied to them. *Why does Hadoop need to do that?*

 If I implement my own class and I intend to use it as a Key I must allow
 for instances of my class to be compared. So I have 2 choices: I can
 implement WritableComparable or I can register a WritableComparator for my
 class. Should I fail to do either, would the Job fail?
 If I register my WritableComparator which does not use the Comparable
 interface at all, does my Key need to implement WritableComparable?
 If I don't implement my Comparator and my Key implements
 WritableComparable, does it mean that Hadoop will deserialize my Keys twice?
 (once for sorting, and once for reducing)
 What is RawComparable used for?

 Thanks for your help!
 Pony




Comparing

2011-05-23 Thread Juan P.
Hi guys,
I wanted to get your help with a couple of questions which came up while
looking at the Hadoop Comparator/Comparable architecture.

As I see it before each reducer operates on each key, a sorting algorithm is
applied to them. *Why does Hadoop need to do that?*

If I implement my own class and I intend to use it as a Key I must allow for
instances of my class to be compared. So I have 2 choices: I can implement
WritableComparable or I can register a WritableComparator for my
class. Should I fail to do either, would the Job fail?
If I register my WritableComparator which does not use the Comparable
interface at all, does my Key need to implement WritableComparable?
If I don't implement my Comparator and my Key implements WritableComparable,
does it mean that Hadoop will deserialize my Keys twice? (once for sorting,
and once for reducing)
What is RawComparable used for?

Thanks for your help!
Pony


Why is JUnit a compile scope dependency?

2011-04-29 Thread Juan P.
I was putting together a maven project and imported hadoop-core as a
dependency and noticed that among the jars it brought with it was JUnit 4.5.
Shouldn't it be a test scope dependency? It also happens with JUnit 3.8.1
for the commons-httpclient-3.0.1 dependency it pulls down from the repo.

Cheers,
Juan


Re: Why is JUnit a compile scope dependency?

2011-04-29 Thread Juan P.
Done! HADOOP-7252 https://issues.apache.org/jira/browse/HADOOP-7252

On Fri, Apr 29, 2011 at 1:44 PM, Konstantin Boudnik c...@apache.org wrote:

 Yes, this seems to be a dependency declaration bug. Not a big deal, but
 still.
 Do you care to open a JIRA under
 https://issues.apache.org/jira/browse/HADOOP

 Thanks,
   Cos

 On Fri, Apr 29, 2011 at 07:03, Juan P. gordoslo...@gmail.com wrote:
  I was putting together a maven project and imported hadoop-core as a
  dependency and noticed that among the jars it brought with it was JUnit
 4.5.
  Shouldn't it be a test scope dependency? It also happens with JUnit 3.8.1
  for the commons-httpclient-3.0.1 dependency it pulls down from the repo.
 
  Cheers,
  Juan
 



Should waitForCompletion throw so many exceptions?

2011-04-29 Thread Juan P.
Is it just me or is it weird that
org.apache.hadoop.mapreduce.Job#waitForCompletion(boolean verbose) throws
exceptions like ClassNotFoundException?

It seems like it's breaking encapsulation by throwing IOException,
ClassNotFoundException and InterruptedException. Has this been discussed?

Thanks,
Pony


Stable Release

2011-04-28 Thread Juan P.
Hi guys,
I wanted to know exactly which was the latest stable release of Hadoop. In
the site it says it's release 0.20.2, but 0.21.0 is also available and in
the repository there's already a branch for release 0.22.0.

Is it possible that the current development branch is 0.22, the stable is
0.21 and the site is just out of date?

Or is that 0.22 is dev, 0.21 is a non-stable release, and 0.20.2 is the
current stable release version?

Which release do you guys recommend I should start using (I'm brand new to
the technology)?

Thanks for your help,
Juan