Re: Namenode Exceptions with S3
Sorry for the time taken to respond, i've been doing some tests on this. Your workaround worked like a charm, thank you :) now i'm able to fetch the data from S3 process using HDFS and put the results in S3. about the a) problem that i mentioned in my previous email, now i understood the error, i was starting namenode and datanodes and changing fs.default.name to s3://bucket/ after that, now i understand why it doesn't work. Thank you *very* much for your help, now i can use EC2 and S3 :) slitz On Fri, Jul 11, 2008 at 10:46 PM, Tom White <[EMAIL PROTECTED]> wrote: > On Fri, Jul 11, 2008 at 9:09 PM, slitz <[EMAIL PROTECTED]> wrote: > > a) Use S3 only, without HDFS and configuring fs.default.name as > s3://bucket > > -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode: > > java.lang.RuntimeException: Not a host:port pair: X > > What command are you using to start Hadoop? > > > b) Use HDFS as the default FS, specifying S3 only as input for the first > Job > > and output for the last(assuming one has multiple jobs on same data) > > -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733 > > Yes, this is a problem. I've added a comment to the Jira description > describing a workaround. > > Tom >
Re: Namenode Exceptions with S3
I've been learning a lot from this thread, and Tom just helped me understanding some things about S3 and HDFS, thank you. To wrap everything up, if we want to use S3 with EC2 we can: a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode: java.lang.RuntimeException: Not a host:port pair: X b) Use HDFS as the default FS, specifying S3 only as input for the first Job and output for the last(assuming one has multiple jobs on same data) -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733 So, in my case i cannot use S3 at all for now because of these 2 problems. Any advice? slitz On Fri, Jul 11, 2008 at 4:31 PM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > Thanks Tom! > > Your explanation makes things a lot clearer. I think that changing > the 'fs.default.name' to something like 'dfs.namenode.address' would > certainly be less confusing since it would clarify the purpose of > these values. > > -lincoln > > -- > lincolnritter.com > > > > On Fri, Jul 11, 2008 at 4:21 AM, Tom White <[EMAIL PROTECTED]> wrote: > > On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter > > <[EMAIL PROTECTED]> wrote: > >> Thank you, Tom. > >> > >> Forgive me for being dense, but I don't understand your reply: > >> > > > > Sorry! I'll try to explain it better (see below). > > > >> > >> Do you mean that it is possible to use the Hadoop daemons with S3 but > >> the default filesystem must be HDFS? > > > > The HDFS daemons use the value of "fs.default.name" to set the > > namenode host and port, so if you set it to a s3 URI, you can't run > > the HDFS daemons. So in this case you would use the start-mapred.sh > > script instead of start-all.sh. > > > >> If that is the case, can I > >> specify the output filesystem on a per-job basis and can that be an S3 > >> FS? > > > > Yes, that's exactly how you do it. > > > >> > >> Also, is there a particular reason to not allow S3 as the default FS? > > > > You can allow S3 as the default FS, it's just that then you can't run > > HDFS at all in this case. You would only do this if you don't want to > > use HDFS at all, for example, if you were running a MapReduce job > > which read from S3 and wrote to S3. > > > > It might be less confusing if the HDFS daemons didn't use > > fs.default.name to define the namenode host and port. Just like > > mapred.job.tracker defines the host and port for the jobtracker, > > dfs.namenode.address (or similar) could define the namenode. Would > > this be a good change to make? > > > > Tom > > >
Re: Namenode Exceptions with S3
I'm having the exact same problem, any tip? slitz On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > Hello, > > I am trying to use S3 with Hadoop 0.17.0 on EC2. Using this style of > configuration: > > > fs.default.name > s3://$HDFS_BUCKET > > > > fs.s3.awsAccessKeyId > $AWS_ACCESS_KEY_ID > > > > fs.s3.awsSecretAccessKey > $AWS_SECRET_ACCESS_KEY > > > on startup of the cluster with the bucket having no non-alphabetic > characters, I get: > > 2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode: > java.lang.RuntimeException: Not a host:port pair: X >at > org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121) >at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121) >at org.apache.hadoop.dfs.NameNode.(NameNode.java:178) >at org.apache.hadoop.dfs.NameNode.(NameNode.java:164) >at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) >at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) > > If I use this style of configuration: > > > fs.default.name > s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED] > > > I get (where the all-caps portions are the actual values...): > > 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode: > java.lang.NumberFormatException: For input string: > "[EMAIL PROTECTED]" >at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >at java.lang.Integer.parseInt(Integer.java:447) >at java.lang.Integer.parseInt(Integer.java:497) >at > org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) >at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121) >at org.apache.hadoop.dfs.NameNode.(NameNode.java:178) >at org.apache.hadoop.dfs.NameNode.(NameNode.java:164) >at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) >at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) > > These exceptions are taken from the namenode log. The datanode logs > show the same exceptions. > > Other than the above configuration changes, the configuration is > identical to that generate by the hadoop image creation script found > in the 0.17.0 distribution. > > Can anybody point me in the right direction here? > > -lincoln > > -- > lincolnritter.com >
Re: Using S3 Block FileSystem as HDFS replacement
That's a good point, in fact it didn't occured me that i could access it like that. But some questions came to my mind: How do i put something into the fs? something like "bin/hadoop fs -put input input" will not work well since s3 is not the default fs, so i tried to do bin/hadoop fs -put input s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i always got an error complaining about not having provided the ID/secret for s3. To experiment a little i tried to edit conf/hadoop-site.xml (something that's just possible to do when experimenting because of lack of presistence of these changes, unless a new AMI is created) and added the fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey properties and changed fs.default.name to an s3:// one. This worked to things like: mkdir input cp conf/*.xml input bin/hadoop fs -put input input bin/hadoop fs -ls input but then i faced another problem, when i tried to run bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' that now should be able to run using S3 as a FileSystem, i got this error: 08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process : 2 08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010 08/07/01 22:12:58 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) (...) I tried several times and with the wordcount example but the error were always the same. What should be the problem here? And how may i access the FileSystem with "bin/hadoop fs ..." if the default filesystem isn't the S3? thank you very much :) slitz On Tue, Jul 1, 2008 at 4:43 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > by editing the hadoop-site.xml, you set the default. but I don't recommend > changing the default on EC2. > > but you can specify the filesystem to use through the URL that references > your data (jobConf.addInputPath etc) for a particular job. in the case of > the S3 block filesystem, just use a s3:// url. > > ckw > > > On Jun 30, 2008, at 8:04 PM, slitz wrote: > > Hello, >> I've been trying to setup hadoop to use s3 as filesystem, i read in the >> wiki >> that it's possible to choose either S3 native FileSystem or S3 Block >> Filesystem. I would like to use S3 Block FileSystem to avoid the task of >> "manually" transferring data from S3 to HDFS every time i want to run a >> job. >> >> I'm still experimenting with EC2 contrib scripts and those seem to be >> excellent. >> What i can't understand is how may be possible to use S3 using a public >> hadoop AMI since from my understanding hadoop-site.xml gets written on >> each >> instance startup with the options on hadoop-init, and it seems that the >> public AMI (at least the 0.17.0 one) is not configured to use S3 at >> all(which makes sense because the bucket would need individual >> configuration >> anyway). >> >> So... to use S3 block FileSystem with EC2 i need to create a custom AMI >> with >> a modified hadoop-init script right? or am I completely confused? >> >> >> slitz >> > > -- > Chris K Wensel > [EMAIL PROTECTED] > http://chris.wensel.net/ > http://www.cascading.org/ > > > > > > >
Using S3 Block FileSystem as HDFS replacement
Hello, I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki that it's possible to choose either S3 native FileSystem or S3 Block Filesystem. I would like to use S3 Block FileSystem to avoid the task of "manually" transferring data from S3 to HDFS every time i want to run a job. I'm still experimenting with EC2 contrib scripts and those seem to be excellent. What i can't understand is how may be possible to use S3 using a public hadoop AMI since from my understanding hadoop-site.xml gets written on each instance startup with the options on hadoop-init, and it seems that the public AMI (at least the 0.17.0 one) is not configured to use S3 at all(which makes sense because the bucket would need individual configuration anyway). So... to use S3 block FileSystem with EC2 i need to create a custom AMI with a modified hadoop-init script right? or am I completely confused? slitz
Re: MultipleOutputFormat example
Hello, I just did! Thank you! And indeed it is A LOT easier, or maybe it's just the included snippets that help a lot, or maybe both things help :) Although i would still like to learn how to use MultipleOutputFormat/MultipleTextOutputFormat since it should be more flexible and i whould like to know how to use this kind of things in hadoop as this could help me understand other classes and patterns. So it would be great if someone could give me an example of how to use it. slitz On Wed, Jun 25, 2008 at 7:53 PM, montag <[EMAIL PROTECTED]> wrote: > > Hi, > > You should check out the MultipleOutputs thread and patch of > https://issues.apache.org/jira/browse/HADOOP-3149 HADOOP-3149 There are > some relevant and useful code snippets that address the issue of splitting > output to multiple files within the discussion as well as in the patch > documentation. I found implementing this patch easier than dealing with > MultipleTextOutputFormat. > > Cheers, > Mike > > > > slitz wrote: > > > > Hello, > > I need the reduce to output to different files depending on the key, > after > > reading some jira entries and some previous threads of the mailing list i > > think that the MultipleTextOutputFormat class would fit my needs, the > > problem is that i can't find any example of how to use it. > > > > Could someone please show me a quick example of how to use this class or > > MultipleOutputFormat subclasses in general? i'm somewhat lost... > > > > slitz > > > > > > -- > View this message in context: > http://www.nabble.com/MultipleOutputFormat-example-tp18118780p18119478.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
MultipleOutputFormat example
Hello, I need the reduce to output to different files depending on the key, after reading some jira entries and some previous threads of the mailing list i think that the MultipleTextOutputFormat class would fit my needs, the problem is that i can't find any example of how to use it. Could someone please show me a quick example of how to use this class or MultipleOutputFormat subclasses in general? i'm somewhat lost... slitz
Re: Using NFS without HDFS
Thank you for the file:/// tip, i was not including it in the paths. I'm running the example with this line -> bin/hadoop jar hadoop-*-examples.jar grep file:///home/slitz/warehouse/input file:///home/slitz/warehouse/output 'dfs[a-z.]+' But i'm getting the same error as before, i'm getting org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : * /home/slitz/hadoop-0.15.3/grep-temp-1030179831* at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) (...stack continues...) i think the problem may be the input path, it should be pointing to some path in the nfs share, right? the grep-temp-* dir is being created in the HADOOP_HOME of Box A ( 192.168.2.3). slitz On Fri, Apr 11, 2008 at 4:06 PM, Luca <[EMAIL PROTECTED]> wrote: > slitz wrote: > > > I've read in the archive that it should be possible to use any > > distributed > > filesystem since the data is available to all nodes, so it should be > > possible to use NFS, right? > > I've also read somewere in the archive that this shoud be possible... > > > > > As far as I know, you can refer to any file on a mounted file system > (visible from all compute nodes) using the prefix file:// before the full > path, unless another prefix has been specified. > > Cheers, > Luca > > > > > slitz > > > > > > On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED] > > > > > wrote: > > > > Hello , > > > > > > To execute Hadoop Map-Reduce job input data should be on HDFS not on > > > NFS. > > > > > > Thanks > > > > > > --- > > > Peeyush > > > > > > > > > > > > On Fri, 2008-04-11 at 12:40 +0100, slitz wrote: > > > > > > Hello, > > > > I'm trying to assemble a simple setup of 3 nodes using NFS as > > > > > > > Distributed > > > > > > > Filesystem. > > > > > > > > Box A: 192.168.2.3, this box is either the NFS server and working as > > > > a > > > > > > > slave > > > > > > > node > > > > Box B: 192.168.2.30, this box is only JobTracker > > > > Box C: 192.168.2.31, this box is only slave > > > > > > > > Obviously all three nodes can access the NFS shared, and the path to > > > > the > > > > share is /home/slitz/warehouse in all three. > > > > > > > > My hadoop-site.xml file were copied over all nodes and looks like > > > > this: > > > > > > > > > > > > > > > > > > > > > > > > fs.default.name > > > > > > > > local > > > > > > > > > > > > > > > > The name of the default file system. Either the literal string > > > > > > > > "local" or a host:port for NDFS. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > mapred.job.tracker > > > > > > > > 192.168.2.30:9001 > > > > > > > > > > > > > > > > The host and port that the MapReduce job > > > > > > > > tracker runs at. If "local", then jobs are > > > > > > > > run in-process as a single map and reduce task. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > mapred.system.dir > > > > > > > > /home/slitz/warehouse/hadoop_service/system > > > > > > > > omgrotfcopterlol. > > > > > > > > > > > > > > > > > > > > > > > > > > > > As one can see, i'm not using HDFS at all. > > > > (Because all the free space i have is located in only one node, so > > > > using > > > > HDFS would be unnecessary overhead) > > > > > > > > I've copied the input folder from hadoop to > > > > /home/slitz/warehouse/input. > > > > When i try to run the example line > > > > > > > > bin/hadoop jar hadoop-*-examples.jar grep > > > > /home/slitz/warehouse/input/ > > > > /home/slitz/warehouse/output 'dfs[a-z.]+' > > > > > > > > the job starts and finish okay but at the end i get this error: > > > > > > > > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't > > > > exist > > > > > > > : > > > > > > > /home/slitz/hadoop-0.15.3/grep-temp-141595661 > > > > at > > > > > > > > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154) > > > > > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508) > > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) > > > > (...the error stack continues...) > > > > > > > > i don't know why the input path being looked is in the local path > > > > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...) > > > > > > > > Maybe something is missing in my hadoop-site.xml? > > > > > > > > > > > > > > > > slitz > > > > > > > > > > >
Re: Using NFS without HDFS
I've read in the archive that it should be possible to use any distributed filesystem since the data is available to all nodes, so it should be possible to use NFS, right? I've also read somewere in the archive that this shoud be possible... slitz On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]> wrote: > Hello , > > To execute Hadoop Map-Reduce job input data should be on HDFS not on > NFS. > > Thanks > > --- > Peeyush > > > > On Fri, 2008-04-11 at 12:40 +0100, slitz wrote: > > > Hello, > > I'm trying to assemble a simple setup of 3 nodes using NFS as > Distributed > > Filesystem. > > > > Box A: 192.168.2.3, this box is either the NFS server and working as a > slave > > node > > Box B: 192.168.2.30, this box is only JobTracker > > Box C: 192.168.2.31, this box is only slave > > > > Obviously all three nodes can access the NFS shared, and the path to the > > share is /home/slitz/warehouse in all three. > > > > My hadoop-site.xml file were copied over all nodes and looks like this: > > > > > > > > > > > > fs.default.name > > > > local > > > > > > > > The name of the default file system. Either the literal string > > > > "local" or a host:port for NDFS. > > > > > > > > > > > > > > > > mapred.job.tracker > > > > 192.168.2.30:9001 > > > > > > > > The host and port that the MapReduce job > > > > tracker runs at. If "local", then jobs are > > > > run in-process as a single map and reduce task. > > > > > > > > > > > > > > > > mapred.system.dir > > > > /home/slitz/warehouse/hadoop_service/system > > > > omgrotfcopterlol. > > > > > > > > > > > > > > As one can see, i'm not using HDFS at all. > > (Because all the free space i have is located in only one node, so using > > HDFS would be unnecessary overhead) > > > > I've copied the input folder from hadoop to /home/slitz/warehouse/input. > > When i try to run the example line > > > > bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/ > > /home/slitz/warehouse/output 'dfs[a-z.]+' > > > > the job starts and finish okay but at the end i get this error: > > > > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist > : > > /home/slitz/hadoop-0.15.3/grep-temp-141595661 > > at > > > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) > > (...the error stack continues...) > > > > i don't know why the input path being looked is in the local path > > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...) > > > > Maybe something is missing in my hadoop-site.xml? > > > > > > > > slitz >
Using NFS without HDFS
Hello, I'm trying to assemble a simple setup of 3 nodes using NFS as Distributed Filesystem. Box A: 192.168.2.3, this box is either the NFS server and working as a slave node Box B: 192.168.2.30, this box is only JobTracker Box C: 192.168.2.31, this box is only slave Obviously all three nodes can access the NFS shared, and the path to the share is /home/slitz/warehouse in all three. My hadoop-site.xml file were copied over all nodes and looks like this: fs.default.name local The name of the default file system. Either the literal string "local" or a host:port for NDFS. mapred.job.tracker 192.168.2.30:9001 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapred.system.dir /home/slitz/warehouse/hadoop_service/system omgrotfcopterlol. As one can see, i'm not using HDFS at all. (Because all the free space i have is located in only one node, so using HDFS would be unnecessary overhead) I've copied the input folder from hadoop to /home/slitz/warehouse/input. When i try to run the example line bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/ /home/slitz/warehouse/output 'dfs[a-z.]+' the job starts and finish okay but at the end i get this error: org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist : /home/slitz/hadoop-0.15.3/grep-temp-141595661 at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) (...the error stack continues...) i don't know why the input path being looked is in the local path /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...) Maybe something is missing in my hadoop-site.xml? slitz
Re: Different output classes from map and reducer
Hello, it worked like a charm! thank you :) slitz On Thu, Feb 28, 2008 at 5:51 PM, Johannes Zillmann <[EMAIL PROTECTED]> wrote: > Hi Slitz, > > try > conf.setMapOutputValueClass(Text.class); > conf.setMapOutputKeyClass(Text.class); > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(IntWritable.class); > > Johannes > > slitz wrote: > > Hello, > > I'm experimenting with hadoop a few days now, but i'm stuck trying to > output > > different classes from map and reduce methods. > > > > I have something like: > > > > class test { > > > > public static class Map extends MapReduceBase implements > > Mapper { > >public void map(LongWritable key, Text value, > OutputCollector > Text> output, Reporter reporter) throws IOException { > > > >(...) > > > >output.collect(new Text(...), new Text(...)); > >} > > } > > > > public static class Reduce extends MapReduceBase implements > Reducer > Text, Text, IntWritable> { > > public void reduce(Text key, Iterator values, > > OutputCollector output, Reporter reporter) throws > > IOException { > > > > > > (...) > > > > output.collect(key, new IntWritable(...)); > > > > } > > > > } > > > > } > > > > > > the relevant part of my conf goes like: > > > > JobConf conf = new JobConf(test.class); > > conf.setOutputKeyClass(Text.class); > > conf.setOutputValueClass(Text.class); > > > > conf.setInputFormat(TextInputFormat.class); > > conf.setOutputFormat(TextOutputFormat.class); > > > > i keep getting this error: > > > > 08/02/28 01:52:47 INFO mapred.JobClient: map 50% reduce 0% > > 08/02/28 01:52:51 INFO mapred.JobClient: Task Id : > > task_200802261545_0032_m_01_0, Status : FAILED > > java.io.IOException: wrong value class: org.apache.hadoop.io.IntWritableis > > not class org.apache.hadoop.io.Text > > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java > :938) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect( > MapTask.java > > :414) > > at org.myorg.WordCount$Reduce.reduce(WordCount.java:64) > > at org.myorg.WordCount$Reduce.reduce(WordCount.java:49) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill( > > MapTask.java:439) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk( > > MapTask.java:418) > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java > :604) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193) > > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > :1804) > > > > > > if in the conf i switch this -> conf.setOutputValueClass(Text.class); > (map's > > output value type > > for this -> conf.setOutputValueClass(IntWritable.class); (reduce's > output > > value type) > > > > then i get this: > > > > 08/02/28 02:05:08 INFO mapred.JobClient: map 50% reduce 0% > > 08/02/28 02:05:12 INFO mapred.JobClient: Task Id : > > task_200802261545_0033_m_01_0, Status : FAILED > > java.io.IOException: Type mismatch in value from map: expected > > org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text > > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java > > :336) > > at org.myorg.WordCount$Map.map(WordCount.java:43) > > at org.myorg.WordCount$Map.map(WordCount.java:16) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java > :1804) > > > > > > i'm just trying to modify slightly the wordcount example to fit my needs > but > > i keep getting this kind of errors. > > Can somebody please point me the right direction? > > > > > > Thank you > > > > slitz > > > > > > > -- > ~~~ > 101tec GmbH > > Halle (Saale), Saxony-Anhalt, Germany > http://www.101tec.com > >
Different output classes from map and reducer
Hello, I'm experimenting with hadoop a few days now, but i'm stuck trying to output different classes from map and reduce methods. I have something like: class test { public static class Map extends MapReduceBase implements Mapper { public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { (...) output.collect(new Text(...), new Text(...)); } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { (...) output.collect(key, new IntWritable(...)); } } } the relevant part of my conf goes like: JobConf conf = new JobConf(test.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); i keep getting this error: 08/02/28 01:52:47 INFO mapred.JobClient: map 50% reduce 0% 08/02/28 01:52:51 INFO mapred.JobClient: Task Id : task_200802261545_0032_m_01_0, Status : FAILED java.io.IOException: wrong value class: org.apache.hadoop.io.IntWritable is not class org.apache.hadoop.io.Text at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java :414) at org.myorg.WordCount$Reduce.reduce(WordCount.java:64) at org.myorg.WordCount$Reduce.reduce(WordCount.java:49) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill( MapTask.java:439) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk( MapTask.java:418) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804) if in the conf i switch this -> conf.setOutputValueClass(Text.class); (map's output value type for this -> conf.setOutputValueClass(IntWritable.class); (reduce's output value type) then i get this: 08/02/28 02:05:08 INFO mapred.JobClient: map 50% reduce 0% 08/02/28 02:05:12 INFO mapred.JobClient: Task Id : task_200802261545_0033_m_01_0, Status : FAILED java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java :336) at org.myorg.WordCount$Map.map(WordCount.java:43) at org.myorg.WordCount$Map.map(WordCount.java:16) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804) i'm just trying to modify slightly the wordcount example to fit my needs but i keep getting this kind of errors. Can somebody please point me the right direction? Thank you slitz