[ http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12429835 ] Owen O'Malley commented on HADOOP-372: --------------------------------------
The question is how the user requests multiple InputPipelines. Either they do: job.setInputPipeline(MyInputPipelineFactory.class); or job.addInputPath(myInputDir, MyInputFormat.class, MyMapper.class, numMaps); Setting a class is more similar to the other hooks to control processing in the JobConf, but the expanded addInputPath is more user friendly. I guess you're right that it is better to be friendly. *smile* Whether the InputFormat's and Mapper's class is stored in the MapTask or FileSplit is a separate question. I think it makes more sense to put into the MapTask since that is a private type and doesn't change an API. Especially when you consider that we are going to generalize the FileSplit to InputSplit. Finally, I think that getSplits should take a Path[] rather that just a path so that we can split evenly over a set of input paths. So I propose getSplits(FileSystem, Path[], JobConf, int numMaps). In the long run, as we move Path's over to URL's that include their FileSystem we should drop explicit FileSystem parameters like this. But that is another patch. *grin* > should allow to specify different inputformat classes for different input > dirs for Map/Reduce jobs > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-372 > URL: http://issues.apache.org/jira/browse/HADOOP-372 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.4.0 > Environment: all > Reporter: Runping Qi > Assigned To: Owen O'Malley > > Right now, the user can specify multiple input directories for a map reduce > job. > However, the files under all the directories are assumed to be in the same > format, > with the same key/value classes. This proves to be a serious limit in many > situations. > Here is an example. Suppose I have three simple tables: > one has URLs and their rank values (page ranks), > another has URLs and their classification values, > and the third one has the URL meta data such as crawl status, last crawl > time, etc. > Suppose now I need a job to generate a list of URLs to be crawled next. > The decision depends on the info in all the three tables. > Right now, there is no easy way to accomplish this. > However, this job can be done if the framework allows to specify different > inputformats for different input dirs. > Suppose my three tables are in the following directory respectively: > rankTable, classificationTable. and metaDataTable. > If we extend JobConf class with the following method (as Owen suggested to > me): > addInputPath(aPath, anInputFormatClass, anInputKeyClass, > anInputValueClass) > Then I can specify my job as follows: > addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, > DoubleWritable.class) > addInputPath(classificationTable, TextInputFormat.class, UTF8,class, > UTF8.class) > addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, > MyRecord.class) > If an input directory is added through the current API, it will have the same > meaning as it is now. > Thus this extension will not affect any applications that do not need this > new feature. > It is relatively easy for the M/R framework to create an appropriate record > reader for a map task based on the above information. > And that is the only change needed for supporting this extension. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira