[ 
http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12429553 ] 
            
Owen O'Malley commented on HADOOP-372:
--------------------------------------

Ok, let me modify this a bit. How about if we define a new class that defines a 
map input processing pipeline. Each pipeline is given the input directory and 
the JobConf when it is created and then gets to pick the appropriate 
InputFormat and Mapper classes.

public class InputPipeline {
  public InputPipeline(Path inputDir, JobConf conf);
  public Class createInputFormat();
  public Class getMapper();
  public int getRequestedNumMaps();
}

The JobConf then picks up 2 methods:

JobConf:
   public void setInputPipelineClass(Class cls);
   public Class getInputPipelineClass();

The default will be InputPipeline that just uses the values from the JobConf 
for all paths.

The framework changes are pretty light. Just creating the InputPipeline when 
iterating through the input directories and using that to create the splits. We 
need to add the InputFormat and Mapper classes to the MapTask and make 
MapTask.localizeConfiguration set the Mapper class. 

To complete the picture, I'd also add a class in org.apache.hadoop.mapred.lib 
that looks like:

public class MultiInputPipeline extends InputPipeline {
  public static void addInputPipeline(JobConf conf, Path inputDir, Class 
inputFormat, Class mapper, int numMaps);
  ...
}

That makes this look like the other hooks that are currently in Hadoop and 
provides the flexibility that the users need.

> should allow to specify different inputformat classes for different input 
> dirs for Map/Reduce jobs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-372
>                 URL: http://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>
> Right now, the user can specify multiple input directories for a map reduce 
> job. 
> However, the files under all the directories are assumed to be in the same 
> format, 
> with the same key/value classes. This proves to be  a serious limit in many 
> situations. 
> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl 
> time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different 
> inputformats for different input dirs.
> Suppose my three tables are in the following directory respectively: 
> rankTable, classificationTable. and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to 
> me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, 
> anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, 
> DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, 
> UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, 
> MyRecord.class)
> If an input directory is added through the current API, it will have the same 
> meaning as it is now. 
> Thus this extension will not affect any applications that do not need this 
> new feature.
> It is relatively easy for the M/R framework to create an appropriate record 
> reader for a map task based on the above information.
> And that is the only change needed for supporting this extension.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to