[
http://issues.apache.org/jira/browse/HADOOP-450?page=comments#action_12427980 ]
Runping Qi commented on HADOOP-450:
-----------------------------------
It seems to me that it is better to add getInputKeyClass/ValueClass methods to
RecordReader, since creating an object of these class needs a
reference to JobConf object and RecordReader does not have such a reference.
The MapRunner class can create the key/value objects by calling ReflectionUtil.
MapRunner would look like:
public void run(RecordReader input, OutputCollector output,
Reporter reporter)
throws IOException {
try {
// allocate key & value instances that are re-used for all entries
this.inputKeyClass = input.getKeyClass();
this.inputValueClass = input.getValueClass();
WritableComparable key =
(WritableComparable)ReflectionUtils.newInstance(inputKeyClass, job);
Writable value = (Writable)ReflectionUtils.newInstance(inputValueClass,
job);
Class mapperClass = job.getMapperClassFor(this.inputKeyClass,
this.inputValueClass);
this.mapper = (Mapper)ReflectionUtils.newInstance(mapperClass, job);
while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
}
} finally {
mapper.close();
}
}
In the above code, the mapper class is obtained from the job object
through a new method of JobConf class:
Class getMapperClassFor(this.inputKeyClass, this.inputValueClass);
With these, and a few other minor changes, this patch will address jira issue
372 as well.
The other changes include:
add to JobConf class the following methods:
public InputFormat getInputFormat(Path p)
public void setInputFormat(Class theClass, Path p)
public Class getMapperClassFor(Class keyClass, Class valueClass) {
public void setMapperClassFor(Class theClass, Class keyClass, Class
valueClass)
replace thefollowing in MapTask class:
final RecordReader rawIn = // open input
job.getInputFormat().getRecordReader
(FileSystem.get(job), split, job, reporter);
with
final RecordReader rawIn = // open input
job.getInputFormat(split.getPath()).getRecordReader
(FileSystem.get(job), split, job, reporter);
With these changes, the application will specify a MapReduce job in the
following way:
Configuration defaults = new Configuration();
JobConf theJob = new JobConf(defaults, My.class);
theJob.addInputPath(myInputPath_1)
theJob.setInputFormat(SequenceFileInputFormat, myInputPath_1);
theJob.addInputPath(myInputPath_2)
theJob.setInputFormat(TextInputFormat, myInputPath_2);
theJob.addInputPath(myInputPath_3)
theJob.setInputFormat(SequenceFileInputFormat, myInputPath_3);
theJob.setMapperClassFor(MapperClass_a, LongWritable.class, Text.class);
theJob.setMapperClassFor(MapperClass_b, Key_1.class, Value_1.class);
theJob.setMapperClassFor(MapperClass_c, Key_2.class, Value_2.class);
....
> Remove the need for users to specify the types of the inputs
> ------------------------------------------------------------
>
> Key: HADOOP-450
> URL: http://issues.apache.org/jira/browse/HADOOP-450
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.5.0
> Reporter: Owen O'Malley
> Assigned To: Owen O'Malley
> Fix For: 0.6.0
>
>
> Currently, the application specifies the types of the input keys and values
> and the RecordReader checks them for consistency. It would make more sense to
> have the RecordReader define the types of keys that it will produce.
> Therefore, I propose that we add two new methods to RecordReader:
> WritableComparable createKey();
> Writable createValue();
> Note that I propose adding them to the RecordReader rather than the
> InputFormat, so that they can specific to a particular input split.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira