Letting the Mapper handle multiple lines.

Per Stolpe Thu, 04 Jun 2009 09:18:42 -0700

Hi.
I'm quite new to Hadoop programming, so to get a good start I started
writing my own program that summarizes a column in a large tab separated
file (~100 000 000 lines). My first naive implementation was quite simple, a
small rework of the WordCounter example that comes with Hadoop. This program
did calculate the correct answer, but it performed quite badly, since every
line in the file invokes a call to map(). To solve this, I wrote my own
RecordReader, one that would return a List<Text> instead of just a Text. It
does type check in Eclipse and all seems to be fine until I actually run the
program. When I do, I get the following error:


java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
java.util.List
        at Summarizer$TokenizerMapper.map(Summarizer.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

(repeated several times)

What might be the problem?
And are there maybe InputFormat (that are not marked as Deprecated) that
already solves my problem?

Source code:
Summarizer: http://pastebin.com/m52876939
RecordReader: http://pastebin.com/m2c541a00
InputFormat: http://pastebin.com/m7714b0c

Hadoop version: 0.20.0
Java JDK version: 1.6 u14

Regards,
Per and Felix

Letting the Mapper handle multiple lines.

Reply via email to