RE: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Runping Qi Fri, 25 Aug 2006 16:37:13 -0700

This sounds good to me! A couple of questions.

If the user uses the default JobInput implementation, how does the user to
specify the input directory? Currently, it is done through methods of
JobConf. Will that change?


> But we don't expect end users to be implementing JobInput.  Rather, they
> might job.setJobInputClass(MultiJobInput.class) and then use a static
> utility method like:
> 
> void MultiJobInput.addInput(job, dir, recordReaderClass, mapperClass);

But can the end user implement such a class if s/he wants to? In other
words, will the user have only two choices: the default implementation of
JobInput
and MultiJobInput? Or the user can implement his own:
        public class MyMultipleInput implements JobInput {
};


Runping


> -----Original Message-----
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 25, 2006 4:13 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (HADOOP-372) should allow to specify
> different inputformat classes for different input dirs for Map/Reduce jobs
> 
> Arkady Borkovsky wrote:
> > Defining interfaces that a easy to understand is more than just
> > syntactic sugar, and usability should not be sacrificed to
orthogonality.
> 
> A simple, easy-to-use and understand API can be constructed on top of a
> perhaps not quite-so-simple, yet more general API.  The converse is not
> always true.  So I would prefer that the input primitives, the things
> called by the system, are abstact, free of notions like file names,
> directories, positions, offsets, etc., and that such features are
> supported by library/utility code.
> 
> The case has been made that it would be good to dynamically determine
> the map function, rather than having a fixed map function for all splits
> of a job.  We already intend to make split abstract (HADOOP-451).  So it
> makes sense to make the mapper a function of the split.  Thus a given
> job's input is determined by methods like the following:
> 
> Split[] getSplits(job);
> RecordReader getRecordReader(job, split);
> Mapper getMapper(job, split, recordReader);
> 
> The first two are currently InputFormat methods.  The latter does not
> yet exist.
> 
> In the normal case, all mappers are instances of the same class.  When
> they're different, then then the splits will be somehow different (in
> order to indicate which mapper to use), and the record readers may also
> be different.  So whenever one uses a non-default implementation of
> getMapper() it will require coordination with the getSplits() and
> perhaps the getRecordReader() implementations.  Since they must be
> coordinated, it makes sense to combine these methods in a single
> interface, called something like JobInput.
> 
> But we don't expect end users to be implementing JobInput.  Rather, they
> might job.setJobInputClass(MultiJobInput.class) and then use a static
> utility method like:
> 
> void MultiJobInput.addInput(job, dir, recordReaderClass, mapperClass);
> 
> This will set some properties that MultiJobInput's getSplits(),
> getRecordReader() and getMapper() implementations will use.
> 
> So we can still have a simple, easy-to-use API, without hardwiring a lot
> of assumptions about input/output formats into the kernel.
> 
> Does this make sense?  Does it sound reasonable?
> 
> Doug

RE: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Reply via email to