Re: Cluster Tuning
Hi guys! Here's my mapred-site.xml I've tweaked a few properties but still it's taking about 8-10mins to process 4GB of data. Thought maybe you guys could find something you'd comment on. Thanks! Pony *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *configuration* * property* *namemapred.job.tracker/name* *valuename-node:54311/value* * /property* * property* *namemapred.tasktracker.map.tasks.maximum/name* *value1/value* * /property* * property* *namemapred.tasktracker.reduce.tasks.maximum/name* *value1/value* * /property* * property* *namemapred.compress.map.output/name* *valuetrue/value* * /property* * property* *namemapred.map.output.compression.codec/name* *valueorg.apache.hadoop.io.compress.GzipCodec/value* * /property* * property* *namemapred.child.java.opts/name* *value-Xmx400m/value* * /property* * property* *namemap.sort.class/name* *valueorg.apache.hadoop.util.HeapSort/value* * /property* * property* *namemapred.reduce.slowstart.completed.maps/name* *value0.85/value* * /property* * property* *namemapred.map.tasks.speculative.execution/name* *valuefalse/value* * /property* * property* *namemapred.reduce.tasks.speculative.execution/name* *valuefalse/value* * /property* */configuration* On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: Slow start is an important parameter. Definitely impacts job runtime. My experience in the past has been that, setting this parameter to too low or setting to too high can have issues with job latencies. If you are trying to run same job then its easy to set right value but if your cluster is multi-tenancy then getting this to right requires some benchmarking of different workloads concurrently. But you case is interesting, you are running on a single core(How many disks per node?). So setting to higher side of the spectrum as suggested by Joey makes sense. -Bharath From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, July 8, 2011 9:14 AM Subject: Re: Cluster Tuning Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95. -Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote: Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints Thanks! Pony On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote: Hey guys, Thanks all of you for your help. Joey, I tweaked my MapReduce to serialize/deserialize only escencial values and added a combiner and that helped a lot. Previously I had a domain object which was being passed between Mapper and Reducer when I only needed a single value. Esteban, I think you underestimate the constraints of my cluster. Adding multiple jobs per JVM really kills me in terms of memory. Not to mention that by having a single core there's not much to gain in terms of paralelism (other than perhaps while a process is waiting of an I/O operation). Still I gave it a shot, but even though I kept changing the config I always ended with a Java heap space error. Is it me or performance tuning is mostly a per job task? I mean it will, in the end, depend on the the data you are processing (structure, size, weather it's in one file or many, etc). If my jobs have different sets of data, which are in different formats and organized in different file structures, Do you guys recommend moving some of the configuration to Java code? Thanks! Pony On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote: Eres el Esteban que conozco? El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com escribió: Hi Pony, There is a good chance that your boxes are doing some heavy swapping and that is a killer for Hadoop. Have you tried with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the heap on that boxes? Cheers, Esteban. -- Get Hadoop! http://www.cloudera.com/downloads/ On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! I'd like some help fine tuning my cluster. I currently have 20 boxes exactly alike. Single core machines with 600MB of RAM. No chance of upgrading the hardware. My cluster is made out of 1 NameNode/JobTracker box and 19 DataNode/TaskTracker boxes. All my config is default except i've set the following in my mapred-site.xml in an effort
Re: Cluster Tuning
BTW: Here's the Job Output https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHchl=en_US On Mon, Jul 11, 2011 at 1:28 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! Here's my mapred-site.xml I've tweaked a few properties but still it's taking about 8-10mins to process 4GB of data. Thought maybe you guys could find something you'd comment on. Thanks! Pony *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *configuration* * property* *namemapred.job.tracker/name* *valuename-node:54311/value* * /property* * property* *namemapred.tasktracker.map.tasks.maximum/name* *value1/value* * /property* * property* *namemapred.tasktracker.reduce.tasks.maximum/name* *value1/value* * /property* * property* *namemapred.compress.map.output/name* *valuetrue/value* * /property* * property* *namemapred.map.output.compression.codec/name* *valueorg.apache.hadoop.io.compress.GzipCodec/value* * /property* * property* *namemapred.child.java.opts/name* *value-Xmx400m/value* * /property* * property* *namemap.sort.class/name* *valueorg.apache.hadoop.util.HeapSort/value* * /property* * property* *namemapred.reduce.slowstart.completed.maps/name* *value0.85/value* * /property* * property* *namemapred.map.tasks.speculative.execution/name* *valuefalse/value* * /property* * property* *namemapred.reduce.tasks.speculative.execution/name* *valuefalse/value* * /property* */configuration* On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: Slow start is an important parameter. Definitely impacts job runtime. My experience in the past has been that, setting this parameter to too low or setting to too high can have issues with job latencies. If you are trying to run same job then its easy to set right value but if your cluster is multi-tenancy then getting this to right requires some benchmarking of different workloads concurrently. But you case is interesting, you are running on a single core(How many disks per node?). So setting to higher side of the spectrum as suggested by Joey makes sense. -Bharath From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, July 8, 2011 9:14 AM Subject: Re: Cluster Tuning Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95. -Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote: Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints Thanks! Pony On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote: Hey guys, Thanks all of you for your help. Joey, I tweaked my MapReduce to serialize/deserialize only escencial values and added a combiner and that helped a lot. Previously I had a domain object which was being passed between Mapper and Reducer when I only needed a single value. Esteban, I think you underestimate the constraints of my cluster. Adding multiple jobs per JVM really kills me in terms of memory. Not to mention that by having a single core there's not much to gain in terms of paralelism (other than perhaps while a process is waiting of an I/O operation). Still I gave it a shot, but even though I kept changing the config I always ended with a Java heap space error. Is it me or performance tuning is mostly a per job task? I mean it will, in the end, depend on the the data you are processing (structure, size, weather it's in one file or many, etc). If my jobs have different sets of data, which are in different formats and organized in different file structures, Do you guys recommend moving some of the configuration to Java code? Thanks! Pony On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote: Eres el Esteban que conozco? El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com escribió: Hi Pony, There is a good chance that your boxes are doing some heavy swapping and that is a killer for Hadoop. Have you tried with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the heap on that boxes? Cheers, Esteban. -- Get Hadoop! http://www.cloudera.com/downloads/ On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! I'd like some help fine tuning my cluster. I currently have 20 boxes exactly alike. Single core
Re: Cluster Tuning
Allen, Say I were to bring the property back to the default of -Xmx200m, which buffers do you think I should adjust? io.sort.mb? io.sort.factor? How would you adjust them? Thanks for your help! Pony On Mon, Jul 11, 2011 at 4:41 PM, Allen Wittenauer a...@apache.org wrote: On Jul 11, 2011, at 9:28 AM, Juan P. wrote: * property* *namemapred.child.java.opts/name* *value-Xmx400m/value* * /property* Single core machines with 600MB of RAM. 2x400m = 800m just for the heap of the map and reduce phases, not counting the other memory that the jvm will need. io buffer sizes aren't adjusted downward either, so you're likely looking at a swapping + spills = death scenario. slowstart set to 1 is going to be pretty much required.
Re: Cluster Tuning
Hey guys, Thanks all of you for your help. Joey, I tweaked my MapReduce to serialize/deserialize only escencial values and added a combiner and that helped a lot. Previously I had a domain object which was being passed between Mapper and Reducer when I only needed a single value. Esteban, I think you underestimate the constraints of my cluster. Adding multiple jobs per JVM really kills me in terms of memory. Not to mention that by having a single core there's not much to gain in terms of paralelism (other than perhaps while a process is waiting of an I/O operation). Still I gave it a shot, but even though I kept changing the config I always ended with a Java heap space error. Is it me or performance tuning is mostly a per job task? I mean it will, in the end, depend on the the data you are processing (structure, size, weather it's in one file or many, etc). If my jobs have different sets of data, which are in different formats and organized in different file structures, Do you guys recommend moving some of the configuration to Java code? Thanks! Pony On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote: Eres el Esteban que conozco? El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com escribió: Hi Pony, There is a good chance that your boxes are doing some heavy swapping and that is a killer for Hadoop. Have you tried with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the heap on that boxes? Cheers, Esteban. -- Get Hadoop! http://www.cloudera.com/downloads/ On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! I'd like some help fine tuning my cluster. I currently have 20 boxes exactly alike. Single core machines with 600MB of RAM. No chance of upgrading the hardware. My cluster is made out of 1 NameNode/JobTracker box and 19 DataNode/TaskTracker boxes. All my config is default except i've set the following in my mapred-site.xml in an effort to try and prevent choking my boxes. *property* * namemapred.tasktracker.map.tasks.maximum/name* * value1/value* * /property* I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps hosts to each record and then in the reduce task it accumulates the amount of bytes received from each host. Currently it's producing about 65000 keys The hole job takes forever to complete, specially the reduce part. I've tried different tuning configs by I can't bring it down under 20mins. Any ideas? Thanks for your help! Pony
Re: Cluster Tuning
Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints Thanks! Pony On Fri, Jul 8, 2011 at 11:41 AM, Juan P. gordoslo...@gmail.com wrote: Hey guys, Thanks all of you for your help. Joey, I tweaked my MapReduce to serialize/deserialize only escencial values and added a combiner and that helped a lot. Previously I had a domain object which was being passed between Mapper and Reducer when I only needed a single value. Esteban, I think you underestimate the constraints of my cluster. Adding multiple jobs per JVM really kills me in terms of memory. Not to mention that by having a single core there's not much to gain in terms of paralelism (other than perhaps while a process is waiting of an I/O operation). Still I gave it a shot, but even though I kept changing the config I always ended with a Java heap space error. Is it me or performance tuning is mostly a per job task? I mean it will, in the end, depend on the the data you are processing (structure, size, weather it's in one file or many, etc). If my jobs have different sets of data, which are in different formats and organized in different file structures, Do you guys recommend moving some of the configuration to Java code? Thanks! Pony On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex cerias...@gmail.com wrote: Eres el Esteban que conozco? El 07/07/2011, a las 15:53, Esteban Gutierrez este...@cloudera.com escribió: Hi Pony, There is a good chance that your boxes are doing some heavy swapping and that is a killer for Hadoop. Have you tried with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the heap on that boxes? Cheers, Esteban. -- Get Hadoop! http://www.cloudera.com/downloads/ On Thu, Jul 7, 2011 at 1:29 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys! I'd like some help fine tuning my cluster. I currently have 20 boxes exactly alike. Single core machines with 600MB of RAM. No chance of upgrading the hardware. My cluster is made out of 1 NameNode/JobTracker box and 19 DataNode/TaskTracker boxes. All my config is default except i've set the following in my mapred-site.xml in an effort to try and prevent choking my boxes. *property* * namemapred.tasktracker.map.tasks.maximum/name* * value1/value* * /property* I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps hosts to each record and then in the reduce task it accumulates the amount of bytes received from each host. Currently it's producing about 65000 keys The hole job takes forever to complete, specially the reduce part. I've tried different tuning configs by I can't bring it down under 20mins. Any ideas? Thanks for your help! Pony
Cluster Tuning
Hi guys! I'd like some help fine tuning my cluster. I currently have 20 boxes exactly alike. Single core machines with 600MB of RAM. No chance of upgrading the hardware. My cluster is made out of 1 NameNode/JobTracker box and 19 DataNode/TaskTracker boxes. All my config is default except i've set the following in my mapred-site.xml in an effort to try and prevent choking my boxes. *property* * namemapred.tasktracker.map.tasks.maximum/name* * value1/value* * /property* I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps hosts to each record and then in the reduce task it accumulates the amount of bytes received from each host. Currently it's producing about 65000 keys The hole job takes forever to complete, specially the reduce part. I've tried different tuning configs by I can't bring it down under 20mins. Any ideas? Thanks for your help! Pony
Setting names for nodes
Hi guys, Is there a way to set human readable names for my nodes? I've configured an Amazon cluster, and currently when browsing the NameNode Web Console in the list of nodes I get part of the Amazon public DNS URL which isn't very helpful when trying to figure out which node I'm looking at. So I wanted to know if there was a way of telling Hadoop to generate links using the public DNS but that it should display a specific name for each node. Thanks! Pony
Re: Performance Tunning
Matt, Thanks for your help! I think I get it now, but this part is a bit confusing: * * *so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.* * * If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2 Processes/Core = 32 Processes Total So my configuration mapred-site.xml should include these props: *property* * namemapred.map.tasks/name* * value28/value* */property* *property* * namemapred.reduce.tasks/name* * value4/value* */property* * * Is that correct? On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Performance Tunning
Ok, So I tried putting the following config in the mapred-site.xml of all of my nodes configuration property namemapred.job.tracker/name valuename-node:54311/value /property property namemapred.map.tasks/name value7/value /property property namemapred.reduce.tasks/name value1/value /property property namemapred.tasktracker.map.tasks.maximum/name value7/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value /property /configuration but when I start a new job it gets stuck at 11/06/28 03:04:47 INFO mapred.JobClient: map 0% reduce 0% Any thoughts? Thanks for your help guys! On Mon, Jun 27, 2011 at 7:33 PM, Juan P. gordoslo...@gmail.com wrote: Matt, Thanks for your help! I think I get it now, but this part is a bit confusing: * * *so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.* * * If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2 Processes/Core = 32 Processes Total So my configuration mapred-site.xml should include these props: *property* * namemapred.map.tasks/name* * value28/value* */property* *property* * namemapred.reduce.tasks/name* * value4/value* */property* * * Is that correct? On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Starting JobTracker Locally but binding to remote Address
Joey, I just tried it and it worked great. I configured the entire cluster (added a couple more DataNodes) and I was able to run a simple map/reduce job. Thanks for your help! Pony On Tue, May 31, 2011 at 6:26 PM, gordoslocos gordoslo...@gmail.com wrote: :D i'll give that a try 1st thing in the morning! Thanks a lot joey!! Sent from my iPhone On 31/05/2011, at 18:18, Joey Echeverria j...@cloudera.com wrote: The problem is that start-all.sh isn't all that intelligent. The way that start-all.sh works is by running start-dfs.sh and start-mapred.sh. The start-mapred.sh script always starts a job tracker on the local host and a task tracker on all of the hosts listed in slaves (it uses SSH to do the remote execution). The start-dfs.sh script always starts a name node on the local host, a data node on all of the hosts listed in slaves, and a secondary name node on all of the hosts listed in masters. In your case, you'll want to run start-dfs.sh on slave3 and start-mapred.sh on slave2. -Joey On Tue, May 31, 2011 at 5:07 PM, Juan P. gordoslo...@gmail.com wrote: Hi Guys, I recently configured my cluster to have 2 VMs. I configured 1 machine (slave3) to be the namenode and another to be the jobtracker (slave2). They both work as datanode/tasktracker as well. Both configs have the following contents in their masters and slaves file: *slave2* *slave3* Both machines have the following contents on their mapred-site.xml file: *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *!-- Put site-specific property overrides in this file. --* * * *configuration* * property* * namemapred.job.tracker/name* * valueslave2:9001/value* * /property* */configuration* Both machines have the following contents on their core-site.xml file: *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *!-- Put site-specific property overrides in this file. --* * * *configuration* * property* * namefs.default.name/name* * valuehdfs://slave3:9000/value* * /property* */configuration* When I log into the namenode and I run the start-all.sh script, everything but the jobtracker starts. In the log files I get the following exception: */* *STARTUP_MSG: Starting JobTracker* *STARTUP_MSG: host = slave3/10.20.11.112* *STARTUP_MSG: args = []* *STARTUP_MSG: version = 0.20.2* *STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010* */* *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)* *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot assign requested address* *at org.apache.hadoop.ipc.Server.bind(Server.java:190)* *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)* *at org.apache.hadoop.ipc.Server.init(Server.java:1026)* *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)* *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)* *at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595) * *at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)* *at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)* *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)* *Caused by: java.net.BindException: Cannot assign requested address* *at sun.nio.ch.Net.bind(Native Method)* *at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)* *at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) * *at org.apache.hadoop.ipc.Server.bind(Server.java:188)* *... 8 more* * * *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:* */* *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112* */* As I see it, from the lines *STARTUP_MSG: Starting JobTracker* *STARTUP_MSG: host = slave3/10.20.11.112* the namenode (slave3) is trying to run the jobtracker locally but when it starts the jobtracker server it binds it to the slave2 address and of course fails: *Problem binding to slave2/10.20.11.166:9001* What do you guys think could be going wrong? Thanks! Pony -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Starting JobTracker Locally but binding to remote Address
Hi Guys, I recently configured my cluster to have 2 VMs. I configured 1 machine (slave3) to be the namenode and another to be the jobtracker (slave2). They both work as datanode/tasktracker as well. Both configs have the following contents in their masters and slaves file: *slave2* *slave3* Both machines have the following contents on their mapred-site.xml file: *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *!-- Put site-specific property overrides in this file. --* * * *configuration* * property* * namemapred.job.tracker/name* * valueslave2:9001/value* * /property* */configuration* Both machines have the following contents on their core-site.xml file: *?xml version=1.0?* *?xml-stylesheet type=text/xsl href=configuration.xsl?* * * *!-- Put site-specific property overrides in this file. --* * * *configuration* * property* * namefs.default.name/name* * valuehdfs://slave3:9000/value* * /property* */configuration* When I log into the namenode and I run the start-all.sh script, everything but the jobtracker starts. In the log files I get the following exception: */* *STARTUP_MSG: Starting JobTracker* *STARTUP_MSG: host = slave3/10.20.11.112* *STARTUP_MSG: args = []* *STARTUP_MSG: version = 0.20.2* *STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010* */* *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)* *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot assign requested address* *at org.apache.hadoop.ipc.Server.bind(Server.java:190)* *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)* *at org.apache.hadoop.ipc.Server.init(Server.java:1026)* *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)* *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)* *at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595) * *at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)* *at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)* *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)* *Caused by: java.net.BindException: Cannot assign requested address* *at sun.nio.ch.Net.bind(Native Method)* *at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)* *at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) * *at org.apache.hadoop.ipc.Server.bind(Server.java:188)* *... 8 more* * * *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:* */* *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112* */* As I see it, from the lines *STARTUP_MSG: Starting JobTracker* *STARTUP_MSG: host = slave3/10.20.11.112* the namenode (slave3) is trying to run the jobtracker locally but when it starts the jobtracker server it binds it to the slave2 address and of course fails: *Problem binding to slave2/10.20.11.166:9001* What do you guys think could be going wrong? Thanks! Pony
Re: Comparing
Harsh, Thanks for your response, it was very helpful. There are still a couple of things which are not really clear to me though. You say that Keys have got to be compared by the MR framework. But I'm still not 100% sure why keys are sorted. I thought what hadoop did was, during shuffling it chose which keys went to which reducer and then for each key/value it checked the key and sent them to the correct node. If that was the case then a good equals implementation could be enough. So why instead of just *shuffling* does the MP framework *sort* the items? Also, you were very clear about the use of RawComparator, thank you. Do you know how RawComparable works though? Again, thanks for your help! Cheers, Pony On Thu, May 26, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote: Pony, Keys have got to be compared by the MR framework somehow, and the way it does when you use Writables is by ensuring that your Key is of a Writable + Comparable type (WritableComparable). If you specify a specific comparator class, then that will be used; else the default WritableComparator will get asked if it can supply a comparator for use with your key type. AFAIK, the default WritableComparator wraps around RawComparator and does indeed deserialize the writables before applying the compare operation. The RawComparator's primary idea is to give you a pair of raw byte sequences to compare directly. Certain other serialization libraries (Apache Avro is one) provide ways to compare using bytes itself (Across different types), which can end up being faster when used in jobs. Hope this clears up your confusion. On Tue, May 24, 2011 at 2:06 AM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I wanted to get your help with a couple of questions which came up while looking at the Hadoop Comparator/Comparable architecture. As I see it before each reducer operates on each key, a sorting algorithm is applied to them. *Why does Hadoop need to do that?* If I implement my own class and I intend to use it as a Key I must allow for instances of my class to be compared. So I have 2 choices: I can implement WritableComparable or I can register a WritableComparator for my class. Should I fail to do either, would the Job fail? If I register my WritableComparator which does not use the Comparable interface at all, does my Key need to implement WritableComparable? If I don't implement my Comparator and my Key implements WritableComparable, does it mean that Hadoop will deserialize my Keys twice? (once for sorting, and once for reducing) What is RawComparable used for? Thanks for your help! Pony -- Harsh J
Re: Comparing
Hi guys! Any thoughts on this? Should I have sent my queries to a different distribution list? Thanks! Pony On Mon, May 23, 2011 at 5:36 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I wanted to get your help with a couple of questions which came up while looking at the Hadoop Comparator/Comparable architecture. As I see it before each reducer operates on each key, a sorting algorithm is applied to them. *Why does Hadoop need to do that?* If I implement my own class and I intend to use it as a Key I must allow for instances of my class to be compared. So I have 2 choices: I can implement WritableComparable or I can register a WritableComparator for my class. Should I fail to do either, would the Job fail? If I register my WritableComparator which does not use the Comparable interface at all, does my Key need to implement WritableComparable? If I don't implement my Comparator and my Key implements WritableComparable, does it mean that Hadoop will deserialize my Keys twice? (once for sorting, and once for reducing) What is RawComparable used for? Thanks for your help! Pony
Comparing
Hi guys, I wanted to get your help with a couple of questions which came up while looking at the Hadoop Comparator/Comparable architecture. As I see it before each reducer operates on each key, a sorting algorithm is applied to them. *Why does Hadoop need to do that?* If I implement my own class and I intend to use it as a Key I must allow for instances of my class to be compared. So I have 2 choices: I can implement WritableComparable or I can register a WritableComparator for my class. Should I fail to do either, would the Job fail? If I register my WritableComparator which does not use the Comparable interface at all, does my Key need to implement WritableComparable? If I don't implement my Comparator and my Key implements WritableComparable, does it mean that Hadoop will deserialize my Keys twice? (once for sorting, and once for reducing) What is RawComparable used for? Thanks for your help! Pony
Why is JUnit a compile scope dependency?
I was putting together a maven project and imported hadoop-core as a dependency and noticed that among the jars it brought with it was JUnit 4.5. Shouldn't it be a test scope dependency? It also happens with JUnit 3.8.1 for the commons-httpclient-3.0.1 dependency it pulls down from the repo. Cheers, Juan
Re: Why is JUnit a compile scope dependency?
Done! HADOOP-7252 https://issues.apache.org/jira/browse/HADOOP-7252 On Fri, Apr 29, 2011 at 1:44 PM, Konstantin Boudnik c...@apache.org wrote: Yes, this seems to be a dependency declaration bug. Not a big deal, but still. Do you care to open a JIRA under https://issues.apache.org/jira/browse/HADOOP Thanks, Cos On Fri, Apr 29, 2011 at 07:03, Juan P. gordoslo...@gmail.com wrote: I was putting together a maven project and imported hadoop-core as a dependency and noticed that among the jars it brought with it was JUnit 4.5. Shouldn't it be a test scope dependency? It also happens with JUnit 3.8.1 for the commons-httpclient-3.0.1 dependency it pulls down from the repo. Cheers, Juan
Should waitForCompletion throw so many exceptions?
Is it just me or is it weird that org.apache.hadoop.mapreduce.Job#waitForCompletion(boolean verbose) throws exceptions like ClassNotFoundException? It seems like it's breaking encapsulation by throwing IOException, ClassNotFoundException and InterruptedException. Has this been discussed? Thanks, Pony
Stable Release
Hi guys, I wanted to know exactly which was the latest stable release of Hadoop. In the site it says it's release 0.20.2, but 0.21.0 is also available and in the repository there's already a branch for release 0.22.0. Is it possible that the current development branch is 0.22, the stable is 0.21 and the site is just out of date? Or is that 0.22 is dev, 0.21 is a non-stable release, and 0.20.2 is the current stable release version? Which release do you guys recommend I should start using (I'm brand new to the technology)? Thanks for your help, Juan