[jira] Commented: (HADOOP-450) Remove the need for users to specify the types of the inputs

Runping Qi (JIRA) Mon, 14 Aug 2006 16:07:09 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-450?page=comments#action_12427980 ] 
            
Runping Qi commented on HADOOP-450:
-----------------------------------



It seems to me that it is better to add getInputKeyClass/ValueClass methods to 
RecordReader, since creating an object of these class needs a 
reference to JobConf object and RecordReader does not have such a reference.
The MapRunner class can create the key/value objects by calling ReflectionUtil. 
MapRunner would look like:

  public void run(RecordReader input, OutputCollector output,
                  Reporter reporter)
    throws IOException {
    try {
      // allocate key & value instances that are re-used for all entries
        this.inputKeyClass = input.getKeyClass();
        this.inputValueClass = input.getValueClass();
      
        WritableComparable key =
          (WritableComparable)ReflectionUtils.newInstance(inputKeyClass, job);
        Writable value = (Writable)ReflectionUtils.newInstance(inputValueClass,
                                                             job);
        
        Class mapperClass = job.getMapperClassFor(this.inputKeyClass, 
this.inputValueClass);
        this.mapper = (Mapper)ReflectionUtils.newInstance(mapperClass, job);
        while (input.next(key, value)) {
        // map pair to output
        mapper.map(key, value, output, reporter);
      }
    } finally {
        mapper.close();
    }
  }

In the above code, the mapper class is obtained from the job object 
through a new method of JobConf class:
Class getMapperClassFor(this.inputKeyClass, this.inputValueClass);

With these, and a few other minor changes, this patch will address jira issue 
372 as well.
The other changes include:

    add to JobConf class the following methods:
           public InputFormat getInputFormat(Path p) 
           public void setInputFormat(Class theClass, Path p)
           public Class getMapperClassFor(Class keyClass, Class valueClass) {
           public void setMapperClassFor(Class theClass, Class keyClass, Class 
valueClass)
   replace thefollowing in MapTask class:
      final RecordReader rawIn =                  // open input
        job.getInputFormat().getRecordReader
        (FileSystem.get(job), split, job, reporter);
  with 
      final RecordReader rawIn =                  // open input
        job.getInputFormat(split.getPath()).getRecordReader
        (FileSystem.get(job), split, job, reporter);
 

With these changes, the application will specify a MapReduce job in the 
following way:
            
        Configuration defaults = new Configuration();
        JobConf theJob = new JobConf(defaults, My.class);
        
        theJob.addInputPath(myInputPath_1)
        theJob.setInputFormat(SequenceFileInputFormat, myInputPath_1);

        theJob.addInputPath(myInputPath_2)
        theJob.setInputFormat(TextInputFormat, myInputPath_2);

        theJob.addInputPath(myInputPath_3)
        theJob.setInputFormat(SequenceFileInputFormat, myInputPath_3);

        theJob.setMapperClassFor(MapperClass_a, LongWritable.class, Text.class);
        theJob.setMapperClassFor(MapperClass_b, Key_1.class, Value_1.class);
        theJob.setMapperClassFor(MapperClass_c, Key_2.class, Value_2.class);

        ....



 


> Remove the need for users to specify the types of the inputs
> ------------------------------------------------------------
>
>                 Key: HADOOP-450
>                 URL: http://issues.apache.org/jira/browse/HADOOP-450
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.6.0
>
>
> Currently, the application specifies the types of the input keys and values 
> and the RecordReader checks them for consistency. It would make more sense to 
> have the RecordReader define the types of keys that it will produce. 
> Therefore, I propose that we add two new methods to RecordReader:
> WritableComparable createKey();
> Writable createValue();
> Note that I propose adding them to the RecordReader rather than the 
> InputFormat, so that they can specific to a particular input split.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-450) Remove the need for users to specify the types of the inputs

Reply via email to