Re-factor InputFormat/RecordReader related classes
--------------------------------------------------
Key: HADOOP-1204
URL: https://issues.apache.org/jira/browse/HADOOP-1204
Project: Hadoop
Issue Type: Bug
Components: mapred
Reporter: Runping Qi
This Jira is the first small step to unify the code related to the
inputformat/record readers for streaming
with the Hadoop main framework.
This Jira does a few things to clean up the related parts in the Hadoop main
framework.
1. Add a constructor
public LineRecordReader(Configuration job, FileSplit split)
to LineRecordReader. This makes the constructors of both
SequenceFileRecordReader and LineRecordReader
have the same signature. This facilitates to have a factory class to create
various record readers when
we bring in the class readers classes for hadoop streaming to the main
framework.
2. Implementded next() method using the following newly added protected method
to LineRecordReader class:
protected long readLine() throws IOException {
return LineRecordReader.readLine(in, buffer);
}
This allows the user to easily overwrite the readLine logic to use
different line breaker (e.g. treat '\r' as part of data, not line breaker).
3. Rename class InputFormatBase to FileInputFormat to better reflect the
functionality of the class.
To keep backward compatible, still keep InputFormatBase class, but make it
deprecated shallow class simply inheriting FileInputFormat .
4. Change TextInputFormat and SequenceFileFormat to extend FileInputFormat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.