Re: Reducer getting key-value pairs in wrong order
On 02/03/2011 07:02 PM, Harsh J wrote: For a ValueGrouping comparator to work, your Partitioner must act in tandem with it. I do not know if you have implemented a custom hashCode() method for your Key class, but your partitioner should look like: Yes I did and it works like this return leftElement.hashCode() + rightElement; return (key.getLeftElement().hashCode() Integer.MAX_VALUE) % numPartitions; This was definitely a bug, the result is always the same though :( This will ensure that the to-be grouped data is actually partitioned properly too. The actual sorting (which ought to occur for the full composite key field-by-field, and is the only real 'sorter') would be handled by the compare() call of your Writable, if you are using a WritableComparable. I am using a WritableComparable...here's PairOfStringInt https://gist.github.com/810905 Thanks again
Hadoop 0.21 HDFS fails to connect
Dear All,I have been trying to configure hadoop on a cluster but when i try to issue any comand regarding the hdfs like mkdir I get that the it is trying to connect to the server and after that it fails. I issued to commands the format and the mkdir but the fail please help and advise.RegardsAhmed ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./hadoop namenode -formatDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it. 11/02/04 12:04:03 INFO namenode.NameNode: STARTUP_MSG:/STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = cannonau.isti.cnr.it/146.48.82.190STARTUP_MSG: args = [-format]STARTUP_MSG: version = 0.21.0STARTUP_MSG: classpath = /home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/..:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-common-0.21.0.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-common-test-0.21.0.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-hdfs-0.21.0.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-hdfs-0.21.0-sources.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-hdfs-ant-0.21.0.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../hadoop-hdfs-test-0. A long list of paths STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326; compiled by 'tomwhite' on Tue Aug 17 01:02:28 EDT 2010/11/02/04 12:04:03 WARN common.Util: Path /tmp/mylocal should be specified as a URI in configuration files. Please update hdfs configuration.11/02/04 12:04:03 WARN common.Util: Path /tmp/mylocal should be specified as a URI in configuration files. Please update hdfs configuration.Re-format filesystem in /tmp/mylocal ? (Y or N) yFormat aborted in /tmp/mylocal11/02/04 12:04:14 INFO namenode.NameNode: SHUTDOWN_MSG:/SHUTDOWN_MSG: Shutting down NameNode at cannonau.isti.cnr.it/146.48.82.190/ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ /hadoop dfs -mkdir input-bash: /hadoop: No such file or directoryahmednagy@cannonau:~/HadoopStandalone/had oop-0.21.0/bin$ ./hadoop dfs -mkdir inputDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it. 11/02/04 12:04:30 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=3011/02/04 12:04:31 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id11/02/04 12:04:32 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 0 time(s).11/02/04 12:04:33 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 1 time(s).11/02/04 12:04:34 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 2 time(s).11/02/04 12:04:35 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 3 time(s).11/02/04 12:04:36 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 4 time(s).11/02/04 12:04:37 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. A lready tried 5 time(s).11/02/04 12:04:38 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 6 time(s).11/02/04 12:04:39 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 7 time(s).11/02/04 12:04:40 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 8 time(s).11/02/04 12:04:41 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 9 time(s).Bad connection to FS. command aborted.ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ Path /tmp/mylocal should be specified as a URI in configuration files. Please update hdfs configuration.-bash: Path: command not found 11/02/04 12:04:03 WARN common.Util: Path /tmp/mylocal should be specified as a URI in configuration files. Please update hdfs configuration.11/02/04 12:04:03 WARN common.Util: Path /tmp/mylocal should be specified as a URI in configuration files. Please update hdfs configuration.Re-format filesystem in /tmp/mylocal ? (Y or N) yFormat aborted in /tmp/mylocal11/02/04 12:04:14 INFO namenode.NameNode: SHUTDOWN_MSG:/SHUTDOWN_MSG: Shutting down NameNode at
Re: file:/// has no authority
Could you elaborate more please how did u fix that Thanks Ahmed Nagy danoomistmatiste wrote: I managed to fix this issue. It had to do with permissions on the default directory. danoomistmatiste wrote: Hi, I have setup a Hadoop cluster as per the instructions for CDH3. When I try to start the datanode on the slave, I get this error, org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority. I have setup the right parameters in core-site.xml where master is the IP address where the namenode is running configuration property namefs.default.name/name valuehdfs://master:54310/value /property -- View this message in context: http://old.nabble.com/file%3Ahas-no-authority-tp30813534p30844547.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Unable to accesss the HDFS hadoop .21 please help
I have a cluster with a master and 7 nodes when i try to start hadoop it starts the mapreduce processes and the hdfs processes on all the nodes. formated the hdfs but it says it is aborted and it does not complete When I try to access the HDFS by making a new directory for example it throws some errors and it dies.I have atached below the error messages that I get please advise any ideas? Thanks in advance ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./hadoop namenode -format DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 11/02/04 16:36:59 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cannonau.isti.cnr.it/146.48.82.190 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.21.0 alot of directories r:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../lib/*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/conf:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/hadoop-mapred-*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/lib/*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../hadoop-hdfs-*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../lib/*.jar STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326; compiled by 'tomwhite' on Tue Aug 17 01:02:28 EDT 2010 / 11/02/04 16:37:00 WARN common.Util: Path /tmp/mylocal/ should be specified as a URI in configuration files. Please update hdfs configuration. 11/02/04 16:37:00 WARN common.Util: Path /tmp/mylocal/ should be specified as a URI in configuration files. Please update hdfs configuration. Re-format filesystem in /tmp/mylocal ? (Y or N) y Format aborted in /tmp/mylocal 11/02/04 16:38:00 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at cannonau.isti.cnr.it/146.48.82.190 / ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ log file message ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/logs$ hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log -bash: hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log: command not found ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/logs$ tail hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:417) at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:207) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1025) at org.apache.hadoop.ipc.Client.call(Client.java:885) ... 5 more message on the terminal i get 2011-02-04 16:26:54,999 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./hadoop dfs -mkdir in DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 11/02/04 16:28:22 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/02/04 16:28:22 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 11/02/04 16:28:24 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 0 time(s). 11/02/04 16:28:25 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 1 time(s). 11/02/04 16:28:26 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 2 time(s). 11/02/04 16:28:27 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 3 time(s). 11/02/04 16:28:28 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 4 time(s). 11/02/04 16:28:29 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 5 time(s). 11/02/04 16:28:30 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 6 time(s). 11/02/04 16:28:31 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 7 time(s). 11/02/04 16:28:32 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 8 time(s). 11/02/04 16:28:33 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 9 time(s). Bad
Re: Multiple various streaming questions
That's weird. I thought I responded to this, but I don't see one on the list (and have vague recollection at best of whether I actually did respond)...anyway... On Feb 3, 2011, at 6:41 PM, Allen Wittenauer wrote: On Feb 1, 2011, at 11:40 PM, Keith Wiley wrote: I would really appreciate any help people can offer on the following matters. When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first four parameters anywhere in command I always get a Streaming Command Failed! error. The last four work though. Note that some of those parameters (-files) do work when I a run a Hadoop job in the normal framework, just not when I specify the streaming jar. There are some issues with how the streaming jar processes the command line, especially in 0.20, in that they need to be in the correct order. In general, the -D's need to be *before* the rest of the streaming params. This is what works for me: hadoop \ jar \ `ls $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar` \ -Dmapred.reduce.tasks.speculative.execution=false \ -Dmapred.map.tasks.speculative.execution=false \ -Dmapred.job.name=oh noes aw is doing perl again \ -input ${ATTEMPTIN} \ -output ${ATTEMPTOUT} \ -mapper map.pl \ -reducer reduce.pl \ -file jobsvs-map1.pl \ -file jobsvs-reduce1.pl I'll give that a shot today. Thanks. I hate deprication warnings, they make me feel so guilty. How do I force a single record (input file) to be processed by a single mapper to get maximum parallelism? I don't understand exactly what that means and how to go about doing it. In the normal Hadoop framework I have achieved this goal by setting mapred.max.split.size small enough that only one input record fits (about 6MBs), but I tried that with my streaming job ala -jobconf mapred.max.split.size=X where X is a very low number, about as many as a single streaming input record (which in the streaming case is not 6MB, but merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't work, it sent multiple records to each mapper anyway. What you actually want to do is set mapred.min.split.size set to an extremely high value. I agree except that method I described helps force parallelism. Setting mapred.max.split.size to a size slightly larger than a single record does a very good job of forcing 1-to-1 parallelism. Forcing it to just larger than two records forces 2-to-1, etc. It is very nice to be able to achieve perfect parallelism...but it didn't work with streaming. I have since discovered that in the case of streaming, mapred.map.tasks is a good way to achieve this goal. Ironically, if I recall correctly, this seemingly obvious method for setting the number mappers did not work so well in my original nonstreaming case, which is why I resorted to the rather contrived method of calculating and setting mapred.max.split.size instead. Achieving 1-to-1 parallelism between map tasks, nodes, and input records is very import because my map tasks take a very long time to run, upwards of an hour. I cannot have them queueing up on a small number of nodes while there are numerous unused nodes (task slots) available to be doing work. If all the task slots are in use, why would you care if they are queueing up? Also keep in mind that if a node fails, that work will need to get re-done anyway. Because all slots are not in use. It's a very larger cluster and it's excruciating that Hadoop partially serializes a job by piling multiple map tasks onto a single map in a queue even when the cluster is massively underutilized. This occurs when the input records are significantly smaller than the block size (6MB vs 64MB in my case, give me about a 32x serialization cost!!!). To put it differently, if I let Hadoop do it its own stupid way, the job takes 32 times longer than it should take if it evenly distributed the map tasks across the nodes. Packing the input files into larger sequence fils does not help with this problem. The input splits are calculated from the individual files and thus, I still get this undesirable packing effect. Thanks a lot. Lots of stuff to think about in you post. I appreciate it. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Re: file:/// has no authority
I had to specify the IP address of the master in core-site and mapred-site.xml ahmednagy wrote: Could you elaborate more please how did u fix that Thanks Ahmed Nagy danoomistmatiste wrote: I managed to fix this issue. It had to do with permissions on the default directory. danoomistmatiste wrote: Hi, I have setup a Hadoop cluster as per the instructions for CDH3. When I try to start the datanode on the slave, I get this error, org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority. I have setup the right parameters in core-site.xml where master is the IP address where the namenode is running configuration property namefs.default.name/name valuehdfs://master:54310/value /property -- View this message in context: http://old.nabble.com/file%3Ahas-no-authority-tp30813534p30845552.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Unable to accesss the HDFS hadoop .21 please help
I think it wants you to type a capital Y, as silly as that may sound... On Feb 4, 2011, at 7:38 AM, ahmednagy ahmed_said_n...@hotmail.com wrote: I have a cluster with a master and 7 nodes when i try to start hadoop it starts the mapreduce processes and the hdfs processes on all the nodes. formated the hdfs but it says it is aborted and it does not complete When I try to access the HDFS by making a new directory for example it throws some errors and it dies.I have atached below the error messages that I get please advise any ideas? Thanks in advance ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./hadoop namenode -format DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 11/02/04 16:36:59 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cannonau.isti.cnr.it/146.48.82.190 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.21.0 alot of directories r:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../lib/*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/conf:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/hadoop-mapred-*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../mapred/lib/*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../hadoop-hdfs-*.jar:/home/ahmednagy/HadoopStandalone/hadoop-0.21.0/hdfs/bin/../lib/*.jar STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326; compiled by 'tomwhite' on Tue Aug 17 01:02:28 EDT 2010 / 11/02/04 16:37:00 WARN common.Util: Path /tmp/mylocal/ should be specified as a URI in configuration files. Please update hdfs configuration. 11/02/04 16:37:00 WARN common.Util: Path /tmp/mylocal/ should be specified as a URI in configuration files. Please update hdfs configuration. Re-format filesystem in /tmp/mylocal ? (Y or N) y Format aborted in /tmp/mylocal 11/02/04 16:38:00 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at cannonau.isti.cnr.it/146.48.82.190 / ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ log file message ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/logs$ hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log -bash: hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log: command not found ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/logs$ tail hadoop-ahmednagy-datanode-cannonau.isti.cnr.it.log at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:417) at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:207) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1025) at org.apache.hadoop.ipc.Client.call(Client.java:885) ... 5 more message on the terminal i get 2011-02-04 16:26:54,999 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./hadoop dfs -mkdir in DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 11/02/04 16:28:22 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/02/04 16:28:22 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 11/02/04 16:28:24 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 0 time(s). 11/02/04 16:28:25 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 1 time(s). 11/02/04 16:28:26 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 2 time(s). 11/02/04 16:28:27 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 3 time(s). 11/02/04 16:28:28 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 4 time(s). 11/02/04 16:28:29 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 5 time(s). 11/02/04 16:28:30 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 6 time(s). 11/02/04 16:28:31 INFO ipc.Client: Retrying connect to server: cannonau.isti.cnr.it/146.48.82.190:8020. Already tried 7 time(s). 11/02/04 16:28:32 INFO
Re: Multiple various streaming questions
On Feb 4, 2011, at 07:46 , Keith Wiley wrote: On Feb 3, 2011, at 6:41 PM, Allen Wittenauer wrote: If all the task slots are in use, why would you care if they are queueing up? Also keep in mind that if a node fails, that work will need to get re-done anyway. Because all slots are not in use. It's a very larger cluster and it's excruciating that Hadoop partially serializes a job by piling multiple map tasks onto a single map in a queue even when the cluster is massively underutilized. This occurs when the input records are significantly smaller than the block size (6MB vs 64MB in my case, give me about a 32x serialization cost!!!). To put it differently, if I let Hadoop do it its own stupid way, the job takes 32 times longer than it should take if it evenly distributed the map tasks across the nodes. Packing the input files into larger sequence fils does not help with this problem. The input splits are calculated from the individual files and thus, I still get this undesirable packing effect. Having reread my last paragraph, I am now reconsidering its tone. I apologize. I am entirely open to the possibility that there are smarter ways to achieve my desired goal of minimum job-turnaround time (maximum parallelism), perhaps via various configuration parameters which I have not learned how to use properly...and furthermore I am willing to admit that the seemingly frustrating and seemingly illogical partial serialism that I witnessed in my jobs using Hadoop's default configuration was not necessarily Hadoop's fault but rather originated from some ineptitude on my part w.r.t. configuring, programming, and using Hadoop properly. In other words, I am perfectly willing to admit I might just not be using Hadoop correctly and that this problem is therefore basically my fault. Sorry. Keith Wiley kwi...@keithwiley.com www.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: Writing Reducer output to database
Thanks - I switched to using the mapreduce version of dboutputformat and things look a little better but I am getting a ClassCastException .. here is my writable class public class LogRecord implements Writable, DBWritable { private long timestamp; private String userId; private String action; public LogRecord() { } public LogRecord(long timestamp, String userId, String action, String pageType, String pageName, String attrPath, String attrName, String forEntity, String forEntityInfo, long rendTime) { this.timestamp = timestamp; this.userId = userId; this.action = action; } public void clearFields(){ this.timestamp = 0; this.userId = ; this.action = ; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((action == null) ? 0 : action.hashCode()); result = prime * result + (int) (timestamp ^ (timestamp 32)); result = prime * result + ((userId == null) ? 0 : userId.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; LogRecord other = (LogRecord) obj; if (action == null) { if (other.action != null) return false; } else if (!action.equals(other.action)) return false; if (timestamp != other.timestamp) return false; if (userId == null) { if (other.userId != null) return false; } else if (!userId.equals(other.userId)) return false; return true; } @Override public void readFields(DataInput in) throws IOException { this.timestamp = in.readLong(); this.userId = Text.readString(in); this.action = Text.readString(in); } @Override public void write(DataOutput out) throws IOException { out.writeLong(this.timestamp); Text.writeString(out, this.userId); Text.writeString(out, this.action); } @Override public void readFields(ResultSet rs) throws SQLException { this.timestamp = rs.getLong(1); this.userId = rs.getString(2); this.action = rs.getString(3); } @Override public void write(PreparedStatement stmt) throws SQLException { stmt.setLong(1, this.timestamp); stmt.setString(2, this.userId); stmt.setString(3, this.action); } public void setTimestamp(long timestamp) { this.timestamp = timestamp; } public void setUserId(String userId) { this.userId = userId; } public void setAction(String action) { this.action = action; } } ** here is my job runner/configuration code //configuration Configuration conf = new Configuration(); Job job = new Job(conf, Log Parser Job); //configure database output job.setOutputFormatClass(DBOutputFormat.class); DBConfiguration.configureDB(conf, com.microsoft.sqlserver.jdbc.SQLServerDriver, jdbc:sqlserver://.., ..., ...); String[] fields = {timestamp, userId, action}; DBOutputFormat.setOutput(job, LogParser, fields); //job properties job.setJarByClass(Driver.class); job.setMapperClass(LogParserMapper.class); job.setReducerClass(LogParserReducer.class); job.setMapOutputKeyClass(LogRecord.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(LogRecord.class); job.setOutputValueClass(NullWritable.class); * mapper code: public class LogParserMapper extends MapperLongWritable, Text, LogRecord, IntWritable { private LogRecord rec = new LogRecord(); private final static IntWritable _val = new IntWritable(1); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); //parse the line into tokens ... rec.setUserId(userId); rec.setAction(test); rec.setTimestamp(0); } } ** reducer: public class LogParserReducer extends ReducerLogRecord, IntWritable, LogRecord, NullWritable { private NullWritable n = NullWritable.get(); public void reduce(LogRecord key, IterableIntWritable values, Context context) throws IOException, InterruptedException { context.write(key, n); } } ** finally when i run it I am getting this error message 11/02/04 13:47:55 INFO mapred.JobClient: Task Id : attempt_201101241250_0094_m_00_1, Status : FAILED java.lang.ClassCastException: class logparser.model.LogRecord at java.lang.Class.asSubclass(Class.java:3018) at
Override automatica final streaming job status message?
Streaming tasks seem to get a final status message from Hadoop itself (not my code) that reports R/W, presumably the number of records read and written (the number of lines of text I assume). Problem is, I like to display a final status message for my tasks. That way I can quickly glance at a list of hundreds of mappers on the job tracker and see a broad survey of their final statuses. This works fine in an all- Java job, but in streaming, this R/W status is automatically generated, thus overwriting my final status. Is there anything I can do about this? Thank you. Keith Wiley kwi...@keithwiley.com www.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Start-all.sh does not start the mapred or the dfs
Dear All, I have a 7 node cluster with a master. When I try to start it using start all I get the processes running as below. when I run the command again but exactly after that it says that the processes are up. However when i try to shut it down using stop-all.sh it says no datannode and no job tracker to stop. It is clear to me that the processes die. I am not sure why but I am attaching an error that I found on one of the slaves node 1. Even if i use start-mapred.sh or start-dfs.sh it does not work Please advise any ideas? Thanks in advance ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-mapred.sh namenode running as process 28763. Stop it first. n02: datanode running as process 21531. Stop it first. n01: datanode running as process 21480. Stop it first. n06: datanode running as process 21515. Stop it first. n03: datanode running as process 21197. Stop it first. n07: datanode running as process 21554. Stop it first. n05: datanode running as process 20794. Stop it first. n04: datanode running as process 21159. Stop it first. localhost: secondarynamenode running as process 28959. Stop it first. jobtracker running as process 29055. Stop it first. n01: tasktracker running as process 21560. Stop it first. n03: tasktracker running as process 21278. Stop it first. n02: tasktracker running as process 21613. Stop it first. n04: tasktracker running as process 21239. Stop it first. n07: tasktracker running as process 21635. Stop it first. n05: tasktracker running as process 20875. Stop it first. n06: tasktracker running as process 21597. Stop it first. ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ ./stop-all.sh This script is Deprecated. Instead use stop-dfs.sh and stop-mapred.sh stopping namenode n04: no datanode to stop n01: no datanode to stop n02: no datanode to stop n03: no datanode to stop n05: no datanode to stop n06: no datanode to stop n07: no datanode to stop localhost: no secondarynamenode to stop stopping jobtracker n01: stopping tasktracker n05: stopping tasktracker n06: stopping tasktracker n02: stopping tasktracker n07: stopping tasktracker n03: stopping tasktracker n04: stopping tasktracker ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/bin$ 2011-02-05 03:04:33,465 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = n01/192.168.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.21.0 STARTUP_MSG: classpath = /home/ahmednagy/HadoopStandalone/hadoop-0.21.0/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/ahmednagy/HadoopStandalone$ STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326; compiled by 'tomwhite' on Tue Aug 17 01:02:28 EDT 2010 / 2011-02-05 03:04:33,655 WARN org.apache.hadoop.hdfs.server.common.Util: Path /tmp/mylocal/ should be specified as a URI in configuration files. Please updat$ 2011-02-05 03:04:33,888 INFO org.apache.hadoop.security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=3000$ 2011-02-05 03:04:34,394 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/mylocal: namenode name$ at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:237) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:152) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:336) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:260) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:237) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1440) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1393) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1407) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1552) 2011-02-05 03:04:34,395 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at n01/192.168.0.1 / ahmednagy@cannonau:~/HadoopStandalone/hadoop-0.21.0/logs$ tail hadoop-ahmednagy-namenode-cannonau.isti.cnr.it.log at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at
Linear slowdown producing streaming output
I noticed that it takes much longer to write 55 MBs to a streaming output than it takes to write 12 MBs (much more than 4-5X longer), so I broke the output up, writing 1 MB at a time and discovered a perfectly linear slowdown. Bottom line, the more data I have already written to stdout from a streaming task, the longer it takes to write the next block of data. I have no idea if this is intrinsic to producing stdout from any Unix process (I've never heard of such a thing) or if this is a Hadoop issue. Does anyone have any idea what's going on here? From pos 0, wrote 1048576 bytes. Next pos will be 1048576. diffTimeWriteOneBlock: 0.31s From pos 1048576, wrote 1048576 bytes. Next pos will be 2097152. diffTimeWriteOneBlock: 0.9s From pos 2097152, wrote 1048576 bytes. Next pos will be 3145728. diffTimeWriteOneBlock: 1.46s From pos 3145728, wrote 1048576 bytes. Next pos will be 4194304. diffTimeWriteOneBlock: 1.98s From pos 4194304, wrote 1048576 bytes. Next pos will be 5242880. diffTimeWriteOneBlock: 2.47s From pos 5242880, wrote 1048576 bytes. Next pos will be 6291456. diffTimeWriteOneBlock: 3.06s From pos 6291456, wrote 1048576 bytes. Next pos will be 7340032. diffTimeWriteOneBlock: 3.53s From pos 7340032, wrote 1048576 bytes. Next pos will be 8388608. diffTimeWriteOneBlock: 3.96s From pos 8388608, wrote 1048576 bytes. Next pos will be 9437184. diffTimeWriteOneBlock: 4.24s From pos 9437184, wrote 1048576 bytes. Next pos will be 10485760. diffTimeWriteOneBlock: 4.74s From pos 10485760, wrote 1048576 bytes. Next pos will be 11534336. diffTimeWriteOneBlock: 5.24s From pos 11534336, wrote 1048576 bytes. Next pos will be 12582912. diffTimeWriteOneBlock: 5.72s From pos 12582912, wrote 1048576 bytes. Next pos will be 13631488. diffTimeWriteOneBlock: 6.25s From pos 13631488, wrote 1048576 bytes. Next pos will be 14680064. diffTimeWriteOneBlock: 6.77s From pos 14680064, wrote 1048576 bytes. Next pos will be 15728640. diffTimeWriteOneBlock: 7.37s From pos 15728640, wrote 1048576 bytes. Next pos will be 16777216. diffTimeWriteOneBlock: 7.76s From pos 16777216, wrote 1048576 bytes. Next pos will be 17825792. diffTimeWriteOneBlock: 8.74s From pos 17825792, wrote 1048576 bytes. Next pos will be 18874368. diffTimeWriteOneBlock: 8.99s From pos 18874368, wrote 1048576 bytes. Next pos will be 19922944. diffTimeWriteOneBlock: 9.35s From pos 19922944, wrote 1048576 bytes. Next pos will be 20971520. diffTimeWriteOneBlock: 9.85s From pos 20971520, wrote 1048576 bytes. Next pos will be 22020096. diffTimeWriteOneBlock:10.43s From pos 22020096, wrote 1048576 bytes. Next pos will be 23068672. diffTimeWriteOneBlock:11.05s From pos 23068672, wrote 1048576 bytes. Next pos will be 24117248. diffTimeWriteOneBlock:11.52s From pos 24117248, wrote 1048576 bytes. Next pos will be 25165824. diffTimeWriteOneBlock:12.23s From pos 25165824, wrote 1048576 bytes. Next pos will be 26214400. diffTimeWriteOneBlock:12.49s From pos 26214400, wrote 1048576 bytes. Next pos will be 27262976. diffTimeWriteOneBlock: 13.1s From pos 27262976, wrote 1048576 bytes. Next pos will be 28311552. diffTimeWriteOneBlock:13.83s From pos 28311552, wrote 1048576 bytes. Next pos will be 29360128. diffTimeWriteOneBlock:14.31s From pos 29360128, wrote 1048576 bytes. Next pos will be 30408704. diffTimeWriteOneBlock:14.65s From pos 30408704, wrote 1048576 bytes. Next pos will be 31457280. diffTimeWriteOneBlock:15.32s From pos 31457280, wrote 1048576 bytes. Next pos will be 32505856. diffTimeWriteOneBlock:15.88s From pos 32505856, wrote 1048576 bytes. Next pos will be 33554432. diffTimeWriteOneBlock:16.77s From pos 33554432, wrote 1048576 bytes. Next pos will be 34603008. diffTimeWriteOneBlock: 16.9s From pos 34603008, wrote 1048576 bytes. Next pos will be 35651584. diffTimeWriteOneBlock:17.39s From pos 35651584, wrote 1048576 bytes. Next pos will be 36700160. diffTimeWriteOneBlock:18.12s From pos 36700160, wrote 1048576 bytes. Next pos will be 37748736. diffTimeWriteOneBlock:18.69s From pos 37748736, wrote 1048576 bytes. Next pos will be 38797312. diffTimeWriteOneBlock:19.09s From pos 38797312, wrote 1048576 bytes. Next pos will be 39845888. diffTimeWriteOneBlock: 19.7s From pos 39845888, wrote 1048576 bytes. Next pos will be 40894464. diffTimeWriteOneBlock:20.65s From pos 40894464, wrote 1048576 bytes. Next pos will be 41943040.