Hi all, I'm experimenting with hadoop streaming on build 1.0.3. To give background info, i'm streaming a text file into mapper written in C. Using the default settings, streaming uses TextInputFormat which creates one record from each line. The problem I am having is that I need record boundaries to be every 4 lines. When the splitter breaks up the input into the mappers, I have partial records on the boundaries due to this. To address this, my approach was to write a new RecordReader class almost in java that is almost identical to LineRecordReader, but with a modified next() method that reads 4 lines instead of one.
I then compiled the new class and created a jar. I wanted to import this at run time using the -libjars argument, like such: hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars NLineRecordReader.jar -files test_stream.sh -inputreader mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE Unfortunately, I keep getting the following error: -inputreader: class not found: mypackage.NLineRecordReader My question is 2 fold. Am I using the right approach to handle the 4 line records with the custom RecordReader implementation? And why isn't -libjars working to include my class to hadoop streaming at runtime? Thanks, Jason