Re: Confusing NameNodeFailover page in Hadoop Wiki
Doug Cutting wrote: Konstantin Shvachko wrote: Imho we either need to correct it or remove. +1 Doug I added some pages there on namenode/jobtracker, etc, linking to the faiover doc, which I didnt compare to the svn docs to see what was correct. Perhaps the failover page could be set up to say you can do some things here and point to the full docs at SVN or the hadoop site -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: DFS. How to read from a specific datanode
Kevin wrote: Thank you for the suggestion. I looked at DFSClient. It appears that chooseDataNode method decides which data node to connect to. Currently it chooses the first non-dead data node returned by namenode, which have sorted the nodes by proximity to the client. However, chooseDataNode is private, so overriding it seems infeasible. Neither are the callers of chooseDataNode public or protected. I need this because I do not want to trust namenode's ordering. For applications where network congestion is rare, we should let the client to decide which data node to load from. dangerous. what happens when network congestion arrives and the apps are out there. Maybe it should be negotiated -namenode provides an ordered list and the client can choose some based on its own measurements. If the name node provides one only, that's the one you get to use
Re: Configuration: I need help.
Allen Wittenauer wrote: On 8/6/08 11:52 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: You can put the same hadoop-site.xml on all machines. Yes, you do want a secondary NN - a single NN is a SPOF. Browser the archives a few days back to find an email from Paul about DRBD (disk replication) to avoid this SPOF. Keep in mind that even with a secondary name node, you still have a SPOF. If the NameNode process dies, so does your HDFS. There's always a SPOF. it just moves. Sometimes it moves out of your own infrastructure, and then you have big problems :)
Re: fuse-dfs
Thanks. After alot of experimenting (and ofcourse, right before you sent this reply) i figured it out. I also had to include the path to libhdfs.so in my ld.so.conf and update it before i was able to succesfully compile fuse_dfs. However when i try to mount the HDFS, it fails. I have tried both the wrapper script and the single binary. Both display the following error: fuse-dfs didn't recognize /mnt/hadoop,-2 fuse-dfs ignoring option -d regards, Sebastian On Wed, Aug 6, 2008 at 5:29 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Sorry - I see the problem now: should be: Ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1 Compile-contrib depends on compile-libhdfs which also requires the -Dlibhdfs=1 property to be set. pete On 8/6/08 5:04 AM, Sebastian Vieira [EMAIL PROTECTED] wrote: Hi, I have installed Hadoop on 20 nodes (data storage) and one master (namenode) to which i want to add data. I have learned that this is possible through a Java API or via the Hadoop shell. However, i would like to mount the HDFS using FUSE and i discovered that there's a contrib/fuse-dfs within the Hadoop tar.gz package. Now i read the README file and noticed that i was unable to compile using this command: ant compile-contrib -Dcompile.c++=1 -Dfusedfs=1 If i change the line to: ant compile-contrib -Dcompile.c++=1 -Dlibhdfs-fuse=1 It goes a little bit further. It will now start the configure script, but still fails. I've tried alot of different things but i'm unable to compile fuse-dfs. This is a piece of the error i get from ant: compile: [echo] contrib: fuse-dfs -snip- [exec] Making all in src [exec] make[1]: Entering directory `/usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/src' [exec] gcc -Wall -O3 -L/usr/local/src/hadoop-core-trunk/build/libhdfs -lhdfs -L/usr/lib -lfuse -L/usr/java/jdk1.6.0_07/jre/lib/i386/server -ljvm -o fuse_dfs fuse_dfs.o [exec] /usr/bin/ld: cannot find -lhdfs [exec] collect2: ld returned 1 exit status [exec] make[1]: *** [fuse_dfs] Error 1 [exec] make[1]: Leaving directory `/usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/src' [exec] make: *** [all-recursive] Error 1 BUILD FAILED /usr/local/src/hadoop-core-trunk/build.xml:413: The following error occurred while executing this line: /usr/local/src/hadoop-core-trunk/src/contrib/build.xml:30: The following error occurred while executing this line: /usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/build.xml:40: exec returned: 2 Could somebody shed some light on this? thanks, Sebastian.
Re: How to run hadoop without DNS server?
While I configure and use the hadoop framework, it seems that the DNS server must be used to do hostname resolution (even if i configure the IP address but not hostname in config/slaves and config/masters file). Because we don't have local DNS server in our local ethernet, so i have to add the hostname - IP mappings in /etc/hosts file. Yeah ...annoying isn't it? :) I have two questions about the hostname configuration: 1) Can we do some configuration in hadoop to avoid hostname resolution, but use IP address directly? We tried, failed and gave up. That said that was quite some time ago. (0.13?) I know some fixes went in but... 2) If I add a new machine to the cluster, it seems that i have to add the new machines hostname or IP address on each node's config/slaves file. If the cluster size is too large, this way could be impossible to maintain. Is there any simply way to add a node dynamically without modifying all the other cluster nodes? Good question! Would love lo see a somewhat more dynamic discovery as well. That said. For a big cluster you will probably have a central configuration management anyway. So for us it's just changing one file and Puppet will roll it out to the nodes. cheers -- Torsten
Why is scaling HBase much simpler then scaling a relational db?
Hello, can someone please explain oder point me to some documentation or papers, where i can read well proven facts, why scaling a relational db is so hard and scaling a document oriented db isnt? So perhaps if i got lots of requests to my relational db, i would duplicate it to several servers and partition the requests. So why this doenst scale and why HBase for instance could manage this? I'am really new to this topic and would like to dive in deeper. Thanks a lot
Re: Why is scaling HBase much simpler then scaling a relational db?
Mork0075 wrote: Hello, can someone please explain oder point me to some documentation or papers, where i can read well proven facts, why scaling a relational db is so hard and scaling a document oriented db isnt? http://labs.google.com/papers/bigtable.html relational dbs are great for having lots of structured data, where you can run SELECT operations, do O/R mapping to make them look like objects, etc. Its one thing to back up, and you get transactions. They're bad places to store binary data, or, say, billions and billions of rows of web server log data by relaxing some of the expectations of a relational db, things like bigtable, hbase and others can scale well, but as they have relaxed the rules, may not do everything you want. So perhaps if i got lots of requests to my relational db, i would duplicate it to several servers and partition the requests. So why this doenst scale and why HBase for instance could manage this? That's called sharding/horizontal partitioning. It works well if you can partition all your data so that different users can go on different places. though once you've done that. you cant think about JOIN-ing stuff from multiple machines. The alternative option (which is apparently common in places like myspace and imdb) is to or have one r/w master and a number of read only slaves. All changes go into the master, the slaves pick the changes later I'am really new to this topic and would like to dive in deeper. check out the articles in http://highscalability.com/ -steve
Re: DFS. How to read from a specific datanode
Yes, I agree with you that it should be negotiated. That is namenode provides an ordered list and the client can choose some based on its own measurements. But I am afraid 0.17.1 does not provide easy interface for this. -Kevin On Thu, Aug 7, 2008 at 3:40 AM, Steve Loughran [EMAIL PROTECTED] wrote: Kevin wrote: Thank you for the suggestion. I looked at DFSClient. It appears that chooseDataNode method decides which data node to connect to. Currently it chooses the first non-dead data node returned by namenode, which have sorted the nodes by proximity to the client. However, chooseDataNode is private, so overriding it seems infeasible. Neither are the callers of chooseDataNode public or protected. I need this because I do not want to trust namenode's ordering. For applications where network congestion is rare, we should let the client to decide which data node to load from. dangerous. what happens when network congestion arrives and the apps are out there. Maybe it should be negotiated -namenode provides an ordered list and the client can choose some based on its own measurements. If the name node provides one only, that's the one you get to use
Re: hdfs question
One way to get all Unix commands to work as is is to mount hdfs as a normal unix filesystem with either fuse-dfs (in contrib) or hdfs-fuse (on google code). Pete On 8/6/08 5:08 PM, Mori Bellamy [EMAIL PROTECTED] wrote: hey all, often i find it would be convenient for me to run conventional unix commands on hdfs, such as using the following to delete the contents of my HDFS hadoop dfs -rm * or moving files from one folder to another: hadoop dfs -mv /path/one/* path/two/ does anyone know of a way to do this?
extracting input to a task from a (streaming) job?
I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with machines of slightly different OS versions). I'd of course like to debug these problems, but they are embedded in a large job. Is there a way to extract the input given to a reducer from a job, given the task identity? (This would also be helpful for mappers.) This is clearly technically *possible*, since hadoop can rerun the jobs if they fail. But is an external program that actually does it? Or are there instructions for poking around on the compute nodes' local disks to assemble it by hand? Or better suggestions? It would be a real boon for people developing map and reduce user code. Thanks for any pointers. -John Heidemann
Re: hadoop question
Can you also post you hadoop-site.xml and hadoop-default.xml? -k On Thu, Aug 7, 2008 at 3:52 AM, Mr.Thien [EMAIL PROTECTED] wrote: Hi everyone, I am trying to use hadoop. I set up my computer (thientd-desktop) as master (jobtracker and namenode). Two other computers: trunght-desktop and quanglt-desktop as slave. When I execute an example as below, the map operation seems to be ok. (It always success and fast). However the reduce operation always fails at 16.66%. If I run hadoop on only one computer, it runs ok. Below are the screen when the problem arise. / [EMAIL PROTECTED]:~/projects/hadoop-0.17.1$ bin/hadoop jar hadoop-0.17.1-examples.jar wordcount gutenberg thien-out 08/08/07 14:17:15 INFO mapred.FileInputFormat: Total input paths to process : 1 08/08/07 14:17:15 INFO mapred.JobClient: Running job: job_200808071415_0001 08/08/07 14:17:16 INFO mapred.JobClient: map 0% reduce 0% 08/08/07 14:17:23 INFO mapred.JobClient: map 50% reduce 0% 08/08/07 14:17:24 INFO mapred.JobClient: map 100% reduce 0% 08/08/07 14:17:28 INFO mapred.JobClient: map 100% reduce 16% 08/08/07 14:25:34 INFO mapred.JobClient: Task Id : task_200808071415_0001_m_01_0, Status : FAILED Too many fetch-failures 08/08/07 14:25:34 WARN mapred.JobClient: Error reading task outputquanglt-desktop 08/08/07 14:25:34 WARN mapred.JobClient: Error reading task outputquanglt-desktop // Could anyone told me the possible reason of the error? Thanks in advanced. thientd.
Re: extracting input to a task from a (streaming) job?
Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with machines of slightly different OS versions). I'd of course like to debug these problems, but they are embedded in a large job. Is there a way to extract the input given to a reducer from a job, given the task identity? (This would also be helpful for mappers.) I believe you should set keep.failed.tasks.files to true -- this way, give a task id, you can see what input files it has in ~/ taskTracker/${taskid}/work (source: http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationRunner ) On top of that, you can always use the debugging facilities: http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Debugging When map/reduce task fails, user can run script for doing post-processing on task logs i.e task's stdout, stderr, syslog and jobconf. The stdout and stderr of the user-provided debug script are printed on the diagnostics. I hope this helps. Regards, Leon Mergen
Re: reduce job did not complete in a long time
On 28-Jul-08, at 6:33 PM, charles du wrote: Hi: I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). I used 10 reducers. 9 of them returns quickly ( in a few seconds), but one has been running for several hours, and still no sign of completion. Do you know how I can debug it or find out what is going on with this reducer? You can log, and set the status message. If you're using streaming, I think you're limited to writing to stderr. The only way I've found to read the logs on a distributed run is by sshing to the actual task box and looking at the log directory. I've almost gotten frustrated enough to have my tasks send email, but not quite. Debugging is easier on a single pseudodistributed box because all the logs and stderr is right there, so try that if you can.
Re: reduce job did not complete in a long time
you should use the web UI --each mapper / reducer can be inspected and there is no need to ssh in. Miles 2008/8/7 Karl Anderson [EMAIL PROTECTED] On 28-Jul-08, at 6:33 PM, charles du wrote: Hi: I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). I used 10 reducers. 9 of them returns quickly ( in a few seconds), but one has been running for several hours, and still no sign of completion. Do you know how I can debug it or find out what is going on with this reducer? You can log, and set the status message. If you're using streaming, I think you're limited to writing to stderr. The only way I've found to read the logs on a distributed run is by sshing to the actual task box and looking at the log directory. I've almost gotten frustrated enough to have my tasks send email, but not quite. Debugging is easier on a single pseudodistributed box because all the logs and stderr is right there, so try that if you can. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Join example
Hadoop ships with a few example programs. One of these is join, which I believe demonstrates map-side joins. I'm finding its usage instructions a little impenetrable; could anyone send me instructions that are more like type this then type this then type this? Thanks in advance. Cheers, John
Re: fuse-dfs
On Thu, Aug 7, 2008 at 4:25 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Hi Sebastian, Those 2 things are just warnings and shouldn't cause any problems. What happens when you ls /mnt/hadoop ? [EMAIL PROTECTED] fuse-dfs]# ls /mnt/hadoop ls: /mnt/hadoop: Transport endpoint is not connected Also, this happens when i start fuse-dfs in one terminal, and do a df -h in another: [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs_wrapper.sh dfs://master:9000 /mnt/hadoop -d port=9000,server=master fuse-dfs didn't recognize /mnt/hadoop,-2 fuse-dfs ignoring option -d unique: 1, opcode: INIT (26), nodeid: 0, insize: 56 INIT: 7.8 flags=0x0003 max_readahead=0x0002 INIT: 7.8 flags=0x0001 max_readahead=0x0002 max_write=0x0010 unique: 1, error: 0 (Success), outsize: 40 unique: 2, opcode: STATFS (17), nodeid: 1, insize: 40 -now i do a df -h in the other term- Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Then the output from df is: df: `/mnt/hadoop': Software caused connection abort And also what version of fuse-dfs are you using? The handling of options is different in trunk than in the last release. [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs --version ./fuse_dfs 0.1.0 I did a checkout of the latest svn and compiled using the command you gave in one of your previous mails. You can also look in /var/log/messages. Only one line: Aug 7 20:21:05 master fuse_dfs: mounting dfs://master:9000/ Thanks for your time, Sebastian
Re: fuse-dfs
This just means your classpath is not set properly, so when fuse-dfs uses libhdfs to try and connect to your server, it cannot instantiate hadoop objects. I have a JIRA open to improve error messaging when this happens: https://issues.apache.org/jira/browse/HADOOP-3918 If you use the fuse_dfs_wrapper.sh, you should be able to set HADOOP_HOME and it will create the classpath for you. In retrospect, fuse_dfs_wrapper.sh should probably complain and exit if HADOOP_HOME is not set. -- pete On 8/7/08 2:35 PM, Sebastian Vieira [EMAIL PROTECTED] wrote: On Thu, Aug 7, 2008 at 4:25 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Hi Sebastian, Those 2 things are just warnings and shouldn't cause any problems. What happens when you ls /mnt/hadoop ? [EMAIL PROTECTED] fuse-dfs]# ls /mnt/hadoop ls: /mnt/hadoop: Transport endpoint is not connected Also, this happens when i start fuse-dfs in one terminal, and do a df -h in another: [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs_wrapper.sh dfs://master:9000 /mnt/hadoop -d port=9000,server=master fuse-dfs didn't recognize /mnt/hadoop,-2 fuse-dfs ignoring option -d unique: 1, opcode: INIT (26), nodeid: 0, insize: 56 INIT: 7.8 flags=0x0003 max_readahead=0x0002 INIT: 7.8 flags=0x0001 max_readahead=0x0002 max_write=0x0010 unique: 1, error: 0 (Success), outsize: 40 unique: 2, opcode: STATFS (17), nodeid: 1, insize: 40 -now i do a df -h in the other term- Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Then the output from df is: df: `/mnt/hadoop': Software caused connection abort And also what version of fuse-dfs are you using? The handling of options is different in trunk than in the last release. [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs --version ./fuse_dfs 0.1.0 I did a checkout of the latest svn and compiled using the command you gave in one of your previous mails. You can also look in /var/log/messages. Only one line: Aug 7 20:21:05 master fuse_dfs: mounting dfs://master:9000/ Thanks for your time, Sebastian
java.io.IOException: Could not get block locations. Aborting...
Hi there: We would like to know what are the most likely causes of this sort of error: Exception closing file /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022 java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818) Our map-reduce job does not fail completely but over 50% of the map tasks fail with this same error. We recently migrated our cluster from 0.16.4 to 0.17.1, previously we didn't have this problem using the same input data in a similar map-reduce job Thank you, Piotr
RE: Distributed Lucene - from hadoop contrib
Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use distributed lucene from hadoop.contrib.index for indexing. Has anyone used this or tested it? Any issues or comments? I see that the design described is different from HDFS (Namenode is stateless, stores no information regarding blocks for files, etc) . Does anyone know how hard will it be to setup this kind of system or is there something that can be reused. A reference link - http://wiki.apache.org/hadoop/DistributedLucene Thanks, Deepika
Re: Are lines broken in dfs and/or in InputSplit
Kevin wrote: Yes, I have looked at the block files and it matches what you said. I am just wondering if there is some property or flag that would turn this feature on, if it exists. No. If you required this then you'd need to pad your data, but I'm not sure why you'd ever require it. Running off the end of a block in mapreduce makes for a small amount of non-local i/o, but it's generally insignificant. Doug
Re: Distributed Lucene - from hadoop contrib
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/. Ning On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote: Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use distributed lucene from hadoop.contrib.index for indexing. Has anyone used this or tested it? Any issues or comments? I see that the design described is different from HDFS (Namenode is stateless, stores no information regarding blocks for files, etc) . Does anyone know how hard will it be to setup this kind of system or is there something that can be reused. A reference link - http://wiki.apache.org/hadoop/DistributedLucene Thanks, Deepika
Passing TupleWritable between map and reduce
Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException { String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable { public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();} TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } );output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class);conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class);conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Re: Passing TupleWritable between map and reduce
Sorry about the massive code chunk, I am not used to this mail client, I attached the file instead. On 8/7/08 4:18 PM, Michael Andrews [EMAIL PROTECTED] wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Re: Passing TupleWritable between map and reduce
You need access to TupleWritable::setWritten(int). If you want to use TupleWritable outside the join package, then you need to make this (and probably related methods, like clearWritten(int)) public and recompile. Please file a JIRA if you think it should be more general. -C On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken()); output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();}TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } ); output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Re: extracting input to a task from a (streaming) job?
On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote: Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with machines of slightly different OS versions). I'd of course like to debug these problems, but they are embedded in a large job. Is there a way to extract the input given to a reducer from a job, given the task identity? (This would also be helpful for mappers.) I believe you should set keep.failed.tasks.files to true -- this way, give a task id, you can see what input files it has in ~/ taskTracker/${taskid}/work (source: http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationRunner ) On top of that, you can always use the debugging facilities: http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Debugging When map/reduce task fails, user can run script for doing post-processing on task logs i.e task's stdout, stderr, syslog and jobconf. The stdout and stderr of the user-provided debug script are printed on the diagnostics. I hope this helps. Thanks. It looks like IsolationRunner is what I'm asking for. I'll try it out. I was aware of the logs, but unfortunately, have problems where inputs hang or don't log meaningful information. Separtely I found the output from the map stage (In our config, in: .../hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200808051739_0005/attempt_200808051739_0005_r_09_0/output/ which is a bit different than taskTracker/${taskid}/work. There's a work dir parallel to output, but it's empty. ) Hopefully isolation runner will deal with this layout. -John
Re: Passing TupleWritable between map and reduce
OK thanks for the information. I guess it seems strange to want to use TupleWritable in this way, but this just seemed like the right thing to do this based on the API docs. Is it more idiomatic to inherit from Writable when processing structured data? Again, I am really new to the hadoop community but I will try to file something with JIRA on this. Not really sure how to proceed with a patch, maybe I could just try and clarify the docs? On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote: You need access to TupleWritable::setWritten(int). If you want to use TupleWritable outside the join package, then you need to make this (and probably related methods, like clearWritten(int)) public and recompile. Please file a JIRA if you think it should be more general. -C On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken()); output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();}TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } ); output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
mapred/map only at 2, always?
hadoop 0.16.4 Why are mapred.reduce.tasks and mapred.map.tasks always showing up as 2? I have the same config on all nodes. hadoop-site.xml contains the following parameters: property namemapred.map.tasks/name value67/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value23/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.job.tracker/name valueidx1-r70:50030/value !-- mapred.job.tracker -- descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property -- James Graham (Greywolf) | 650.930.1138|925.768.4053 * [EMAIL PROTECTED] | Check out what people are saying about SearchMe! -- click below http://www.searchme.com/stack/109aa
Re: Passing TupleWritable between map and reduce
Particularly if you know which types to expect in your structured data, rolling your own Writable is strongly preferred to TupleWritable. The latter serializes to a comically verbose format and should only be used when the types and nesting depth are unknown. -C On Aug 7, 2008, at 5:45 PM, Michael Andrews wrote: OK thanks for the information. I guess it seems strange to want to use TupleWritable in this way, but this just seemed like the right thing to do this based on the API docs. Is it more idiomatic to inherit from Writable when processing structured data? Again, I am really new to the hadoop community but I will try to file something with JIRA on this. Not really sure how to proceed with a patch, maybe I could just try and clarify the docs? On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote: You need access to TupleWritable::setWritten(int). If you want to use TupleWritable outside the join package, then you need to make this (and probably related methods, like clearWritten(int)) public and recompile. Please file a JIRA if you think it should be more general. -C On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public class WordCount {public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, TupleWritable {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);TupleWritable tuple = new TupleWritable(new Writable[] { one } );while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken()); output.collect(word, tuple);}}}public static class Reduce extends MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {public void reduce(Text key, IteratorTupleWritable values, OutputCollectorText, TupleWritable output, Reporter reporter) throws IOException {IntWritable i = new IntWritable();int sum = 0;while (values.hasNext()) {i = ((IntWritable) values.next().get(0));sum += i.get();}TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } ); output.collect(key, tuple);}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(TupleWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} } The output is always empty tuples ('[]'). Using the debugger, I have determined that the line: TupleWritable tuple = new TupleWritable(new Writable[] { one } ); Is properly constructing the desired tuple. I am not sure if it is being outputed correctly by output.collect as I cannot find the field in the OutputCollector data structure. When I check in the reduce method the values are always empty tuples. I have a feeling it has something to do with this line in the JavaDoc: TupleWritable(Writable[] vals) Initialize tuple with storage; unknown whether any of them contain written values. Thanks in advance for any all help, Michael
Setting up a Hadoop cluster where nodes are spread over the Internet
Hello, Can someone point me out what are the extra tasks that need to be performed in order to set up a cluster where nodes are spread over the Internet, in different LANs? Do I need to free any datanode/namenode ports? How do I get the datanodes to know the valid namenode IP, and not something like 10.1.1.1? Any help is appreciate. Lucas