Re: Namenode Exceptions with S3

2008-07-16 Thread slitz
Sorry for the time taken to respond, i've been doing some tests on this.
Your workaround worked like a charm, thank you :) now i'm able to fetch the
data from S3 process using HDFS and put the results in S3.

about the a) problem that i mentioned in my previous email, now i understood
the error, i was starting namenode and datanodes and changing
fs.default.name to s3://bucket/ after that, now i understand why it doesn't
work.

Thank you *very* much for your help, now i can use EC2 and S3 :)

slitz

On Fri, Jul 11, 2008 at 10:46 PM, Tom White <[EMAIL PROTECTED]> wrote:

> On Fri, Jul 11, 2008 at 9:09 PM, slitz <[EMAIL PROTECTED]> wrote:
> > a) Use S3 only, without HDFS and configuring fs.default.name as
> s3://bucket
> >  -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode:
> > java.lang.RuntimeException: Not a host:port pair: X
>
> What command are you using to start Hadoop?
>
> > b) Use HDFS as the default FS, specifying S3 only as input for the first
> Job
> > and output for the last(assuming one has multiple jobs on same data)
> >  -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733
>
> Yes, this is a problem. I've added a comment to the Jira description
> describing a workaround.
>
> Tom
>


Re: Namenode Exceptions with S3

2008-07-11 Thread slitz
I've been learning a lot from this thread, and Tom just helped me
understanding some things about S3 and HDFS, thank you.
To wrap everything up, if we want to use S3 with EC2 we can:

a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket
  -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode:
java.lang.RuntimeException: Not a host:port pair: X
b) Use HDFS as the default FS, specifying S3 only as input for the first Job
and output for the last(assuming one has multiple jobs on same data)
  -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733


So, in my case i cannot use S3 at all for now because of these 2 problems.
Any advice?

slitz

On Fri, Jul 11, 2008 at 4:31 PM, Lincoln Ritter <[EMAIL PROTECTED]>
wrote:

> Thanks Tom!
>
> Your explanation makes things a lot clearer.  I think that changing
> the 'fs.default.name' to something like 'dfs.namenode.address' would
> certainly be less confusing since it would clarify the purpose of
> these values.
>
> -lincoln
>
> --
> lincolnritter.com
>
>
>
> On Fri, Jul 11, 2008 at 4:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> > On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
> > <[EMAIL PROTECTED]> wrote:
> >> Thank you, Tom.
> >>
> >> Forgive me for being dense, but I don't understand your reply:
> >>
> >
> > Sorry! I'll try to explain it better (see below).
> >
> >>
> >> Do you mean that it is possible to use the Hadoop daemons with S3 but
> >> the default filesystem must be HDFS?
> >
> > The HDFS daemons use the value of "fs.default.name" to set the
> > namenode host and port, so if you set it to a s3 URI, you can't run
> > the HDFS daemons. So in this case you would use the start-mapred.sh
> > script instead of start-all.sh.
> >
> >> If that is the case, can I
> >> specify the output filesystem on a per-job basis and can that be an S3
> >> FS?
> >
> > Yes, that's exactly how you do it.
> >
> >>
> >> Also, is there a particular reason to not allow S3 as the default FS?
> >
> > You can allow S3 as the default FS, it's just that then you can't run
> > HDFS at all in this case. You would only do this if you don't want to
> > use HDFS at all, for example, if you were running a MapReduce job
> > which read from S3 and wrote to S3.
> >
> > It might be less confusing if the HDFS daemons didn't use
> > fs.default.name to define the namenode host and port. Just like
> > mapred.job.tracker defines the host and port for the jobtracker,
> > dfs.namenode.address (or similar) could define the namenode. Would
> > this be a good change to make?
> >
> > Tom
> >
>


Re: Namenode Exceptions with S3

2008-07-09 Thread slitz
I'm having the exact same problem, any tip?

slitz

On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter <[EMAIL PROTECTED]>
wrote:

> Hello,
>
> I am trying to use S3 with Hadoop 0.17.0 on EC2.  Using this style of
> configuration:
>
> 
>  fs.default.name
>  s3://$HDFS_BUCKET
> 
>
> 
>  fs.s3.awsAccessKeyId
>  $AWS_ACCESS_KEY_ID
> 
>
> 
>  fs.s3.awsSecretAccessKey
>  $AWS_SECRET_ACCESS_KEY
> 
>
> on startup of the cluster with the bucket having no non-alphabetic
> characters, I get:
>
> 2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode:
> java.lang.RuntimeException: Not a host:port pair: X
>at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121)
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>
> If I use this style of configuration:
>
> 
>  fs.default.name
>  s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED]
> 
>
> I get (where the all-caps portions are the actual values...):
>
> 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
> java.lang.NumberFormatException: For input string:
> "[EMAIL PROTECTED]"
>at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>at java.lang.Integer.parseInt(Integer.java:447)
>at java.lang.Integer.parseInt(Integer.java:497)
>at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>
> These exceptions are taken from the namenode log.  The datanode logs
> show the same exceptions.
>
> Other than the above configuration changes, the configuration is
> identical to that generate by the hadoop image creation script found
> in the 0.17.0 distribution.
>
> Can anybody point me in the right direction here?
>
> -lincoln
>
> --
> lincolnritter.com
>


Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread slitz
That's a good point, in fact it didn't occured me that i could access it
like that.
But some questions came to my mind:

How do i put something into the fs?
something like "bin/hadoop fs -put input input" will not work well since s3
is not the default fs, so i tried to do bin/hadoop fs -put input
s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i
always got an error complaining about not having provided the ID/secret for
s3.

To experiment a little i tried to edit conf/hadoop-site.xml (something
that's just possible to do when experimenting because of lack of presistence
of these changes, unless a new AMI is created) and added
the fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey properties and changed
fs.default.name to an s3:// one.
This worked to things like:

mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input

but then i faced another problem, when i tried to run
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
that now should be able to run using S3 as a FileSystem, i got this error:

08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010
08/07/01 22:12:58 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
(...)

I tried several times and with the wordcount example but the error were
always the same.

What should be the problem here? And how may i access the FileSystem with
"bin/hadoop fs ..." if the default filesystem isn't the S3?

thank you very much :)

slitz

On Tue, Jul 1, 2008 at 4:43 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:

> by editing the hadoop-site.xml, you set the default. but I don't recommend
> changing the default on EC2.
>
> but you can specify the filesystem to use through the URL that references
> your data (jobConf.addInputPath etc) for a particular job. in the case of
> the S3 block filesystem, just use a s3:// url.
>
> ckw
>
>
> On Jun 30, 2008, at 8:04 PM, slitz wrote:
>
>  Hello,
>> I've been trying to setup hadoop to use s3 as filesystem, i read in the
>> wiki
>> that it's possible to choose either S3 native FileSystem or S3 Block
>> Filesystem. I would like to use S3 Block FileSystem to avoid the task of
>> "manually" transferring data from S3 to HDFS every time i want to run a
>> job.
>>
>> I'm still experimenting with EC2 contrib scripts and those seem to be
>> excellent.
>> What i can't understand is how may be possible to use S3 using a public
>> hadoop AMI since from my understanding hadoop-site.xml gets written on
>> each
>> instance startup with the options on hadoop-init, and it seems that the
>> public AMI (at least the 0.17.0 one) is not configured to use S3 at
>> all(which makes sense because the bucket would need individual
>> configuration
>> anyway).
>>
>> So... to use S3 block FileSystem with EC2 i need to create a custom AMI
>> with
>> a modified hadoop-init script right? or am I completely confused?
>>
>>
>> slitz
>>
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>
>
>
>
>
>


Using S3 Block FileSystem as HDFS replacement

2008-06-30 Thread slitz
Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki
that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the task of
"manually" transferring data from S3 to HDFS every time i want to run a job.

I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a public
hadoop AMI since from my understanding hadoop-site.xml gets written on each
instance startup with the options on hadoop-init, and it seems that the
public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual configuration
anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom AMI with
a modified hadoop-init script right? or am I completely confused?


slitz


Re: MultipleOutputFormat example

2008-06-25 Thread slitz
Hello,
I just did! Thank you! And indeed it is A LOT easier, or maybe it's just the
included snippets that help a lot, or maybe both things help :)

Although i would still like to learn how to use
MultipleOutputFormat/MultipleTextOutputFormat since it should be more
flexible and i whould like to know how to use this kind of things in hadoop
as this could help me understand other classes and patterns.

So it would be great if someone could give me an example of how to use it.

slitz

On Wed, Jun 25, 2008 at 7:53 PM, montag <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
>  You should check out the MultipleOutputs thread and patch of
> https://issues.apache.org/jira/browse/HADOOP-3149 HADOOP-3149   There are
> some relevant and useful code snippets that address the issue of splitting
> output to multiple files within the discussion as well as in the patch
> documentation.  I found implementing this patch easier than dealing with
> MultipleTextOutputFormat.
>
> Cheers,
> Mike
>
>
>
> slitz wrote:
> >
> > Hello,
> > I need the reduce to output to different files depending on the key,
> after
> > reading some jira entries and some previous threads of the mailing list i
> > think that the MultipleTextOutputFormat class would fit my needs, the
> > problem is that i can't find any example of how to use it.
> >
> > Could someone please show me a quick example of how to use this class or
> > MultipleOutputFormat subclasses in general? i'm somewhat lost...
> >
> > slitz
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/MultipleOutputFormat-example-tp18118780p18119478.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


MultipleOutputFormat example

2008-06-25 Thread slitz
Hello,
I need the reduce to output to different files depending on the key, after
reading some jira entries and some previous threads of the mailing list i
think that the MultipleTextOutputFormat class would fit my needs, the
problem is that i can't find any example of how to use it.

Could someone please show me a quick example of how to use this class or
MultipleOutputFormat subclasses in general? i'm somewhat lost...

slitz


Re: Using NFS without HDFS

2008-04-11 Thread slitz
Thank you for the file:/// tip, i was not including it in the paths.
I'm running the example with this line -> bin/hadoop jar
hadoop-*-examples.jar grep file:///home/slitz/warehouse/input
file:///home/slitz/warehouse/output 'dfs[a-z.]+'

But i'm getting the same error as before, i'm getting

org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : *
/home/slitz/hadoop-0.15.3/grep-temp-1030179831*
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...stack continues...)

i think the problem may be the input path, it should be pointing to some
path in the nfs share, right?

the grep-temp-* dir is being created in the HADOOP_HOME of Box A (
192.168.2.3).

slitz

On Fri, Apr 11, 2008 at 4:06 PM, Luca <[EMAIL PROTECTED]> wrote:

> slitz wrote:
>
> > I've read in the archive that it should be possible to use any
> > distributed
> > filesystem since the data is available to all nodes, so it should be
> > possible to use NFS, right?
> > I've also read somewere in the archive that this shoud be possible...
> >
> >
> As far as I know, you can refer to any file on a mounted file system
> (visible from all compute nodes) using the prefix file:// before the full
> path, unless another prefix has been specified.
>
> Cheers,
> Luca
>
>
>
> > slitz
> >
> >
> > On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]
> > >
> > wrote:
> >
> >  Hello ,
> > >
> > > To execute Hadoop Map-Reduce job input data should be on HDFS not on
> > > NFS.
> > >
> > > Thanks
> > >
> > > ---
> > > Peeyush
> > >
> > >
> > >
> > > On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:
> > >
> > >  Hello,
> > > > I'm trying to assemble a simple setup of 3 nodes using NFS as
> > > >
> > > Distributed
> > >
> > > > Filesystem.
> > > >
> > > > Box A: 192.168.2.3, this box is either the NFS server and working as
> > > > a
> > > >
> > > slave
> > >
> > > > node
> > > > Box B: 192.168.2.30, this box is only JobTracker
> > > > Box C: 192.168.2.31, this box is only slave
> > > >
> > > > Obviously all three nodes can access the NFS shared, and the path to
> > > > the
> > > > share is /home/slitz/warehouse in all three.
> > > >
> > > > My hadoop-site.xml file were copied over all nodes and looks like
> > > > this:
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > fs.default.name
> > > >
> > > >  local
> > > >
> > > > 
> > > >
> > > >  The name of the default file system. Either the literal string
> > > >
> > > > "local" or a host:port for NDFS.
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > >  
> > > >
> > > > mapred.job.tracker
> > > >
> > > >  192.168.2.30:9001
> > > >
> > > > 
> > > >
> > > >  The host and port that the MapReduce job
> > > >
> > > > tracker runs at. If "local", then jobs are
> > > >
> > > >  run in-process as a single map and reduce task.
> > > >
> > > > 
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > > mapred.system.dir
> > > >
> > > >  /home/slitz/warehouse/hadoop_service/system
> > > >
> > > > omgrotfcopterlol.
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > >
> > > > As one can see, i'm not using HDFS at all.
> > > > (Because all the free space i have is located in only one node, so
> > > > using
> > > > HDFS would be unnecessary overhead)
> > > >
> > > > I've copied the input folder from hadoop to
> > > > /home/slitz/warehouse/input.
> > > > When i try to run the example line
> > > >
> > > > bin/hadoop jar hadoop-*-examples.jar grep
> > > > /home/slitz/warehouse/input/
> > > > /home/slitz/warehouse/output 'dfs[a-z.]+'
> > > >
> > > > the job starts and finish okay but at the end i get this error:
> > > >
> > > > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't
> > > > exist
> > > >
> > > :
> > >
> > > > /home/slitz/hadoop-0.15.3/grep-temp-141595661
> > > > at
> > > >
> > > > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
> > >
> > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
> > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> > > > (...the error stack continues...)
> > > >
> > > > i don't know why the input path being looked is in the local path
> > > > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
> > > >
> > > > Maybe something is missing in my hadoop-site.xml?
> > > >
> > > >
> > > >
> > > > slitz
> > > >
> > >
> >
>
>


Re: Using NFS without HDFS

2008-04-11 Thread slitz
I've read in the archive that it should be possible to use any distributed
filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...


slitz


On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:

> Hello ,
>
> To execute Hadoop Map-Reduce job input data should be on HDFS not on
> NFS.
>
> Thanks
>
> ---
> Peeyush
>
>
>
> On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:
>
> > Hello,
> > I'm trying to assemble a simple setup of 3 nodes using NFS as
> Distributed
> > Filesystem.
> >
> > Box A: 192.168.2.3, this box is either the NFS server and working as a
> slave
> > node
> > Box B: 192.168.2.30, this box is only JobTracker
> > Box C: 192.168.2.31, this box is only slave
> >
> > Obviously all three nodes can access the NFS shared, and the path to the
> > share is /home/slitz/warehouse in all three.
> >
> > My hadoop-site.xml file were copied over all nodes and looks like this:
> >
> > 
> >
> > 
> >
> > fs.default.name
> >
> >  local
> >
> > 
> >
> >  The name of the default file system. Either the literal string
> >
> > "local" or a host:port for NDFS.
> >
> >  
> >
> > 
> >
> >  
> >
> > mapred.job.tracker
> >
> >  192.168.2.30:9001
> >
> > 
> >
> >  The host and port that the MapReduce job
> >
> > tracker runs at. If "local", then jobs are
> >
> >  run in-process as a single map and reduce task.
> >
> > 
> >
> >  
> >
> > 
> >
> > mapred.system.dir
> >
> >  /home/slitz/warehouse/hadoop_service/system
> >
> > omgrotfcopterlol.
> >
> >  
> >
> > 
> >
> >
> > As one can see, i'm not using HDFS at all.
> > (Because all the free space i have is located in only one node, so using
> > HDFS would be unnecessary overhead)
> >
> > I've copied the input folder from hadoop to /home/slitz/warehouse/input.
> > When i try to run the example line
> >
> > bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
> > /home/slitz/warehouse/output 'dfs[a-z.]+'
> >
> > the job starts and finish okay but at the end i get this error:
> >
> > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist
> :
> > /home/slitz/hadoop-0.15.3/grep-temp-141595661
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> > (...the error stack continues...)
> >
> > i don't know why the input path being looked is in the local path
> > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
> >
> > Maybe something is missing in my hadoop-site.xml?
> >
> >
> >
> > slitz
>


Using NFS without HDFS

2008-04-11 Thread slitz
Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as Distributed
Filesystem.

Box A: 192.168.2.3, this box is either the NFS server and working as a slave
node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave

Obviously all three nodes can access the NFS shared, and the path to the
share is /home/slitz/warehouse in all three.

My hadoop-site.xml file were copied over all nodes and looks like this:





fs.default.name

 local



 The name of the default file system. Either the literal string

"local" or a host:port for NDFS.

 



 

mapred.job.tracker

 192.168.2.30:9001



 The host and port that the MapReduce job

tracker runs at. If "local", then jobs are

 run in-process as a single map and reduce task.



 



mapred.system.dir

 /home/slitz/warehouse/hadoop_service/system

omgrotfcopterlol.

 




As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so using
HDFS would be unnecessary overhead)

I've copied the input folder from hadoop to /home/slitz/warehouse/input.
When i try to run the example line

bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'

the job starts and finish okay but at the end i get this error:

org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist :
/home/slitz/hadoop-0.15.3/grep-temp-141595661
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)

i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)

Maybe something is missing in my hadoop-site.xml?



slitz


Re: Different output classes from map and reducer

2008-03-05 Thread slitz
Hello,
it worked like a charm! thank you :)


slitz

On Thu, Feb 28, 2008 at 5:51 PM, Johannes Zillmann <[EMAIL PROTECTED]> wrote:

> Hi Slitz,
>
> try
> conf.setMapOutputValueClass(Text.class);
> conf.setMapOutputKeyClass(Text.class);
> conf.setOutputKeyClass(Text.class);
> conf.setOutputValueClass(IntWritable.class);
>
> Johannes
>
> slitz wrote:
> > Hello,
> > I'm experimenting with hadoop a few days now, but i'm stuck trying to
> output
> > different classes from map and reduce methods.
> >
> > I have something like:
> >
> > class test {
> >
> > public static class Map extends MapReduceBase implements
> > Mapper {
> >public void map(LongWritable key, Text value,
> OutputCollector > Text> output, Reporter reporter) throws IOException {
> >
> >(...)
> >
> >output.collect(new Text(...), new Text(...));
> >}
> > }
> >
> > public static class Reduce extends MapReduceBase implements
> Reducer > Text, Text, IntWritable> {
> > public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter) throws
> > IOException {
> >
> >
> >  (...)
> >
> >   output.collect(key, new IntWritable(...));
> >
> >  }
> >
> > }
> >
> > }
> >
> >
> > the relevant part of my conf goes like:
> >
> > JobConf conf = new JobConf(test.class);
> > conf.setOutputKeyClass(Text.class);
> > conf.setOutputValueClass(Text.class);
> >
> > conf.setInputFormat(TextInputFormat.class);
> > conf.setOutputFormat(TextOutputFormat.class);
> >
> > i keep getting this error:
> >
> > 08/02/28 01:52:47 INFO mapred.JobClient:  map 50% reduce 0%
> > 08/02/28 01:52:51 INFO mapred.JobClient: Task Id :
> > task_200802261545_0032_m_01_0, Status : FAILED
> > java.io.IOException: wrong value class: org.apache.hadoop.io.IntWritableis
> > not class org.apache.hadoop.io.Text
> > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java
> :938)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(
> MapTask.java
> > :414)
> > at org.myorg.WordCount$Reduce.reduce(WordCount.java:64)
> > at org.myorg.WordCount$Reduce.reduce(WordCount.java:49)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(
> > MapTask.java:439)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(
> > MapTask.java:418)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java
> :604)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1804)
> >
> >
> > if in the conf i switch this -> conf.setOutputValueClass(Text.class);
> (map's
> > output value  type
> > for this -> conf.setOutputValueClass(IntWritable.class); (reduce's
> output
> > value type)
> >
> > then i get this:
> >
> > 08/02/28 02:05:08 INFO mapred.JobClient:  map 50% reduce 0%
> > 08/02/28 02:05:12 INFO mapred.JobClient: Task Id :
> > task_200802261545_0033_m_01_0, Status : FAILED
> > java.io.IOException: Type mismatch in value from map: expected
> > org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
> > :336)
> > at org.myorg.WordCount$Map.map(WordCount.java:43)
> > at org.myorg.WordCount$Map.map(WordCount.java:16)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1804)
> >
> >
> > i'm just trying to modify slightly the wordcount example to fit my needs
> but
> > i keep getting this kind of errors.
> > Can somebody please point me the right direction?
> >
> >
> > Thank you
> >
> > slitz
> >
> >
>
>
> --
> ~~~
> 101tec GmbH
>
> Halle (Saale), Saxony-Anhalt, Germany
> http://www.101tec.com
>
>


Different output classes from map and reducer

2008-02-27 Thread slitz
Hello,
I'm experimenting with hadoop a few days now, but i'm stuck trying to output
different classes from map and reduce methods.

I have something like:

class test {

public static class Map extends MapReduceBase implements
Mapper {
   public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

   (...)

   output.collect(new Text(...), new Text(...));
   }
}

public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws
IOException {


 (...)

  output.collect(key, new IntWritable(...));

 }

}

}


the relevant part of my conf goes like:

JobConf conf = new JobConf(test.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

i keep getting this error:

08/02/28 01:52:47 INFO mapred.JobClient:  map 50% reduce 0%
08/02/28 01:52:51 INFO mapred.JobClient: Task Id :
task_200802261545_0032_m_01_0, Status : FAILED
java.io.IOException: wrong value class: org.apache.hadoop.io.IntWritable is
not class org.apache.hadoop.io.Text
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java
:414)
at org.myorg.WordCount$Reduce.reduce(WordCount.java:64)
at org.myorg.WordCount$Reduce.reduce(WordCount.java:49)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(
MapTask.java:439)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(
MapTask.java:418)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)


if in the conf i switch this -> conf.setOutputValueClass(Text.class); (map's
output value  type
for this -> conf.setOutputValueClass(IntWritable.class); (reduce's output
value type)

then i get this:

08/02/28 02:05:08 INFO mapred.JobClient:  map 50% reduce 0%
08/02/28 02:05:12 INFO mapred.JobClient: Task Id :
task_200802261545_0033_m_01_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected
org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
:336)
at org.myorg.WordCount$Map.map(WordCount.java:43)
at org.myorg.WordCount$Map.map(WordCount.java:16)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)


i'm just trying to modify slightly the wordcount example to fit my needs but
i keep getting this kind of errors.
Can somebody please point me the right direction?


Thank you

slitz