hadoop streaming with custom RecordReader class

Jason Wang Wed, 17 Oct 2012 21:03:21 -0700

Hi all,
I'm experimenting with hadoop streaming on build 1.0.3.

To give background info, i'm streaming a text file into mapper written in
C.  Using the default settings, streaming uses TextInputFormat which
creates one record from each line.  The problem I am having is that I need
record boundaries to be every 4 lines.  When the splitter breaks up the
input into the mappers, I have partial records on the boundaries due to
this.  To address this, my approach was to write a new RecordReader class
almost in java that is almost identical to LineRecordReader, but with a
modified next() method that reads 4 lines instead of one.


I then compiled the new class and created a jar.  I wanted to import this
at run time using the -libjars argument, like such:

hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
NLineRecordReader.jar -files test_stream.sh -inputreader
mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
/Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE

Unfortunately, I keep getting the following error:
-inputreader: class not found: mypackage.NLineRecordReader

My question is 2 fold.  Am I using the right approach to handle the 4 line
records with the custom RecordReader implementation?  And why isn't
-libjars working to include my class to hadoop streaming at runtime?

Thanks,
Jason

hadoop streaming with custom RecordReader class

Reply via email to