Zhenyu, It's a bit complicated and involves some layers of indirection. CombineFileRecordReader is a sort of shell RecordReader that passes the actual work of reading records to another child record reader. That's the class name provided in the third parameter. Instructing it to use CombineFileRecordReader again as its child RR doesn't tell it to do anything useful. You must give it the name of another RecordReader class that actually understands how to parse your particular records.
Unfortunately, TextInputFormat's LineRecordReader and SequenceFileInputFormat's SequenceFileRecordReader both require the InputSplit to be a FileSplit. So you can't use them directly. (CombineFileInputFormat will pass a CombineFileSplit to the CombineFileRecordReader which is then passed along to the child RR that you specify.) In Sqoop I got around this by creating (another!) indirection class called CombineShimRecordReader. The export functionality of Sqoop uses CombineFileInputFormat to allow the user to specify the number of map tasks; it then organizes a set of input files into that many tasks. This instantiates a CombineFileRecordReader configured to forward its InputSplit to CombineShimRecordReader. CombineShimRecordReader then translates the CombineFileSplit into a regular FileSplit and forward thats to LineRecordReader (for text) or SequenceFileRecordReader (for SequenceFiles). The grandchild (LineRR or SequenceFileRR) is determined on a file-by-file basis by CombineShimRecordReader, by calling a static method of Sqoop's ExportJobBase. You can take a look at the source of theseclasses here: * http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/ExportInputFormat.java * http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/CombineShimRecordReader.java * http://github.com/cloudera/sqoop/blob/master/src/java/org/apache/hadoop/sqoop/mapreduce/ExportJobBase.java (apologies for the lengthy URLs; you could also just download the whole project's source at http://github.com/cloudera/sqoop) :) Cheers, - Aaron On Thu, May 6, 2010 at 7:32 AM, Zhenyu Zhong <zhongresea...@gmail.com>wrote: > Hi, > > I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend > it because it is an abstract class. > However, I need to implement getRecordReader method in the extended class. > > May I ask how to implement this getRecordReader method? > > I tried to do something like this: > > public RecordReader getRecordReader(InputSplit genericSplit, JobConf job, > > Reporter reporter) throws IOException { > > // TODO Auto-generated method stub > > reporter.setStatus(genericSplit.toString()); > > return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit, > reporter, CombineFileRecordReader.class); > > } > > It doesn't seem to be working. I would be very appreciated if someone can > shed a light on this. > > thanks > zhenyu >