[
http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12423908 ]
Owen O'Malley commented on HADOOP-372:
--------------------------------------
I like the idea of moving the getInput{Key,Value}Class to the RecordReader,
since there is really no need for the user to specify it except as potential
error checking.
The desire is to do joins across different kinds of tables. So if you have two
tables that both contain urls, you could write a mapper for each table and
generate the join via having the url being the map output key. I agree that you
could implement it by using ObjectWritable to wrap all of the input types, but
you'll end up using a lot of "instanceof" to figure out type dynamically and
dispatching to the right kind of map.
I would propose being able to add input directories with specific processing
chains:
JobConf conf = new JobConf();
conf.addInputPath(new Path("dir1");
conf.addInputPath(new Path("dir2");
conf.addInputPath(new Path("my-input"), MyInputFormat.class, MyMapper.class);
conf.addInputPath(new Path("other-input"), OtherInputFormat.class,
OtherMapper.class);
So "dir1" and "dir2" would be processed with the default InputFormat and Mapper.
"my-input" would be processed with MyInputFormat and MyMapper and "other-input"
would be processed with OtherInputFormat and OtherMapper.
This allows it to be backward compatible with a minimum of fuss for users that
don't want to use multiple input sources.
Clearly, you would want to encode the class names with the paths when you were
encoding the strings.
> should allow to specify different inputformat classes for different input
> dirs for Map/Reduce jobs
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-372
> URL: http://issues.apache.org/jira/browse/HADOOP-372
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.4.0
> Environment: all
> Reporter: Runping Qi
>
> Right now, the user can specify multiple input directories for a map reduce
> job.
> However, the files under all the directories are assumed to be in the same
> format,
> with the same key/value classes. This proves to be a serious limit in many
> situations.
> Here is an example. Suppose I have three simple tables:
> one has URLs and their rank values (page ranks),
> another has URLs and their classification values,
> and the third one has the URL meta data such as crawl status, last crawl
> time, etc.
> Suppose now I need a job to generate a list of URLs to be crawled next.
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different
> inputformats for different input dirs.
> Suppose my three tables are in the following directory respectively:
> rankTable, classificationTable. and metaDataTable.
> If we extend JobConf class with the following method (as Owen suggested to
> me):
> addInputPath(aPath, anInputFormatClass, anInputKeyClass,
> anInputValueClass)
> Then I can specify my job as follows:
> addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class,
> DoubleWritable.class)
> addInputPath(classificationTable, TextInputFormat.class, UTF8,class,
> UTF8.class)
> addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class,
> MyRecord.class)
> If an input directory is added through the current API, it will have the same
> meaning as it is now.
> Thus this extension will not affect any applications that do not need this
> new feature.
> It is relatively easy for the M/R framework to create an appropriate record
> reader for a map task based on the above information.
> And that is the only change needed for supporting this extension.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira