Particularly if you know which types to expect in your structured
data, rolling your own Writable is strongly preferred to
TupleWritable. The latter serializes to a comically verbose format and
should only be used when the types and nesting depth are unknown. -C
On Aug 7, 2008, at 5:45 PM, Michael Andrews wrote:
OK thanks for the information. I guess it seems strange to want to
use TupleWritable in this way, but this just seemed like the right
thing to do this based on the API docs. Is it more idiomatic to
inherit from Writable when processing structured data? Again, I am
really new to the hadoop community but I will try to file something
with JIRA on this. Not really sure how to proceed with a patch,
maybe I could just try and clarify the docs?
On 8/7/08 4:38 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote:
You need access to TupleWritable::setWritten(int). If you want to use
TupleWritable outside the join package, then you need to make this
(and probably related methods, like clearWritten(int)) public and
recompile.
Please file a JIRA if you think it should be more general. -C
On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote:
Hi,
I am a new hadoop developer and am struggling to understand why I
cannot pass TupleWritable between a map and reduce function. I have
modified the wordcount example to demonstrate the issue. Also I am
using hadoop 0.17.1.
package wordcount; import java.io.IOException; import java.util.*;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; import
org.apache.hadoop.mapred.join.*; public class WordCount { public
static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, TupleWritable> { private
final static IntWritable one = new IntWritable(1); private
Text word = new Text(); public void map(LongWritable key,
Text value, OutputCollector<Text, TupleWritable> output, Reporter
reporter) throws IOException { String line =
value.toString(); StringTokenizer tokenizer = new
StringTokenizer(line); TupleWritable tuple = new
TupleWritable(new Writable[] { one } ); while
(tokenizer.hasMoreTokens())
{ word.set(tokenizer.nextToken());
output.collect(word, tuple); } } } public
static class Reduce extends MapReduceBase implements Reducer<Text,
TupleWritable, Text, TupleWritable> { public void reduce(Text
key, Iterator<TupleWritable> values, OutputCollector<Text,
TupleWritable> output, Reporter reporter) throws IOException
{ IntWritable i = new IntWritable(); int sum =
0; while (values.hasNext()) { i =
((IntWritable) values.next().get(0)); sum +=
i.get(); } TupleWritable tuple = new
TupleWritable(new Writable[] { new IntWritable(sum) } );
output.collect(key, tuple); } } public static void
main(String[] args) throws Exception { JobConf conf = new
JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(TupleWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); } }
The output is always empty tuples ('[]'). Using the debugger, I
have determined that the line:
TupleWritable tuple = new TupleWritable(new Writable[] { one } );
Is properly constructing the desired tuple. I am not sure if it is
being outputed correctly by output.collect as I cannot find the
field in the OutputCollector data structure. When I check in the
reduce method the values are always empty tuples. I have a feeling
it has something to do with this line in the JavaDoc:
TupleWritable(Writable[] vals)
Initialize tuple with storage; unknown whether any of them
contain "written" values.
Thanks in advance for any all help,
Michael