[ http://issues.apache.org/jira/browse/HADOOP-450?page=comments#action_12427980 ] Runping Qi commented on HADOOP-450: -----------------------------------
It seems to me that it is better to add getInputKeyClass/ValueClass methods to RecordReader, since creating an object of these class needs a reference to JobConf object and RecordReader does not have such a reference. The MapRunner class can create the key/value objects by calling ReflectionUtil. MapRunner would look like: public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { try { // allocate key & value instances that are re-used for all entries this.inputKeyClass = input.getKeyClass(); this.inputValueClass = input.getValueClass(); WritableComparable key = (WritableComparable)ReflectionUtils.newInstance(inputKeyClass, job); Writable value = (Writable)ReflectionUtils.newInstance(inputValueClass, job); Class mapperClass = job.getMapperClassFor(this.inputKeyClass, this.inputValueClass); this.mapper = (Mapper)ReflectionUtils.newInstance(mapperClass, job); while (input.next(key, value)) { // map pair to output mapper.map(key, value, output, reporter); } } finally { mapper.close(); } } In the above code, the mapper class is obtained from the job object through a new method of JobConf class: Class getMapperClassFor(this.inputKeyClass, this.inputValueClass); With these, and a few other minor changes, this patch will address jira issue 372 as well. The other changes include: add to JobConf class the following methods: public InputFormat getInputFormat(Path p) public void setInputFormat(Class theClass, Path p) public Class getMapperClassFor(Class keyClass, Class valueClass) { public void setMapperClassFor(Class theClass, Class keyClass, Class valueClass) replace thefollowing in MapTask class: final RecordReader rawIn = // open input job.getInputFormat().getRecordReader (FileSystem.get(job), split, job, reporter); with final RecordReader rawIn = // open input job.getInputFormat(split.getPath()).getRecordReader (FileSystem.get(job), split, job, reporter); With these changes, the application will specify a MapReduce job in the following way: Configuration defaults = new Configuration(); JobConf theJob = new JobConf(defaults, My.class); theJob.addInputPath(myInputPath_1) theJob.setInputFormat(SequenceFileInputFormat, myInputPath_1); theJob.addInputPath(myInputPath_2) theJob.setInputFormat(TextInputFormat, myInputPath_2); theJob.addInputPath(myInputPath_3) theJob.setInputFormat(SequenceFileInputFormat, myInputPath_3); theJob.setMapperClassFor(MapperClass_a, LongWritable.class, Text.class); theJob.setMapperClassFor(MapperClass_b, Key_1.class, Value_1.class); theJob.setMapperClassFor(MapperClass_c, Key_2.class, Value_2.class); .... > Remove the need for users to specify the types of the inputs > ------------------------------------------------------------ > > Key: HADOOP-450 > URL: http://issues.apache.org/jira/browse/HADOOP-450 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Affects Versions: 0.5.0 > Reporter: Owen O'Malley > Assigned To: Owen O'Malley > Fix For: 0.6.0 > > > Currently, the application specifies the types of the input keys and values > and the RecordReader checks them for consistency. It would make more sense to > have the RecordReader define the types of keys that it will produce. > Therefore, I propose that we add two new methods to RecordReader: > WritableComparable createKey(); > Writable createValue(); > Note that I propose adding them to the RecordReader rather than the > InputFormat, so that they can specific to a particular input split. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira