[ http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12429529 ] Owen O'Malley commented on HADOOP-372: --------------------------------------
I really think this should be done outside of the framework without adding 4 new fairly complicated public methods into the JobConf. I also don't like using types to distinguish which Mapper to use. A typical case will be two directories using TextInputFormats where you want to apply separate Mappers to pull out different fields. I think it is much nicer to define the entire pipeline, rather than using types to infer it. InputSplit (from an input directory) -> RecordReader (via InputFormat) -> Mapper > should allow to specify different inputformat classes for different input > dirs for Map/Reduce jobs > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-372 > URL: http://issues.apache.org/jira/browse/HADOOP-372 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.4.0 > Environment: all > Reporter: Runping Qi > Assigned To: Runping Qi > > Right now, the user can specify multiple input directories for a map reduce > job. > However, the files under all the directories are assumed to be in the same > format, > with the same key/value classes. This proves to be a serious limit in many > situations. > Here is an example. Suppose I have three simple tables: > one has URLs and their rank values (page ranks), > another has URLs and their classification values, > and the third one has the URL meta data such as crawl status, last crawl > time, etc. > Suppose now I need a job to generate a list of URLs to be crawled next. > The decision depends on the info in all the three tables. > Right now, there is no easy way to accomplish this. > However, this job can be done if the framework allows to specify different > inputformats for different input dirs. > Suppose my three tables are in the following directory respectively: > rankTable, classificationTable. and metaDataTable. > If we extend JobConf class with the following method (as Owen suggested to > me): > addInputPath(aPath, anInputFormatClass, anInputKeyClass, > anInputValueClass) > Then I can specify my job as follows: > addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, > DoubleWritable.class) > addInputPath(classificationTable, TextInputFormat.class, UTF8,class, > UTF8.class) > addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, > MyRecord.class) > If an input directory is added through the current API, it will have the same > meaning as it is now. > Thus this extension will not affect any applications that do not need this > new feature. > It is relatively easy for the M/R framework to create an appropriate record > reader for a map task based on the above information. > And that is the only change needed for supporting this extension. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
