[ http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12430107 ] Doug Cutting commented on HADOOP-372: -------------------------------------
> A very typical case is to have the same input format, but different Mappers But, if the mapper is a function of the input format this can instead be: job.addInputPath("foo", FooInput.class); job.addInputPath("bar", BarInput.class); Where FooInput is defined with something like: public class FooInput extends TextInput { public void map(...) { ... }; } In other words, if you're going to define custom mappers anyway, then it's no more work to define custom Input formats. > should allow to specify different inputformat classes for different input > dirs for Map/Reduce jobs > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-372 > URL: http://issues.apache.org/jira/browse/HADOOP-372 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.4.0 > Environment: all > Reporter: Runping Qi > Assigned To: Owen O'Malley > > Right now, the user can specify multiple input directories for a map reduce > job. > However, the files under all the directories are assumed to be in the same > format, > with the same key/value classes. This proves to be a serious limit in many > situations. > Here is an example. Suppose I have three simple tables: > one has URLs and their rank values (page ranks), > another has URLs and their classification values, > and the third one has the URL meta data such as crawl status, last crawl > time, etc. > Suppose now I need a job to generate a list of URLs to be crawled next. > The decision depends on the info in all the three tables. > Right now, there is no easy way to accomplish this. > However, this job can be done if the framework allows to specify different > inputformats for different input dirs. > Suppose my three tables are in the following directory respectively: > rankTable, classificationTable. and metaDataTable. > If we extend JobConf class with the following method (as Owen suggested to > me): > addInputPath(aPath, anInputFormatClass, anInputKeyClass, > anInputValueClass) > Then I can specify my job as follows: > addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, > DoubleWritable.class) > addInputPath(classificationTable, TextInputFormat.class, UTF8,class, > UTF8.class) > addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, > MyRecord.class) > If an input directory is added through the current API, it will have the same > meaning as it is now. > Thus this extension will not affect any applications that do not need this > new feature. > It is relatively easy for the M/R framework to create an appropriate record > reader for a map task based on the above information. > And that is the only change needed for supporting this extension. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira