Arkady Borkovsky wrote:
Defining interfaces that a easy to understand is more than just
syntactic sugar, and usability should not be sacrificed to orthogonality.
A simple, easy-to-use and understand API can be constructed on top of a
perhaps not quite-so-simple, yet more general API. The converse is not
always true. So I would prefer that the input primitives, the things
called by the system, are abstact, free of notions like file names,
directories, positions, offsets, etc., and that such features are
supported by library/utility code.
The case has been made that it would be good to dynamically determine
the map function, rather than having a fixed map function for all splits
of a job. We already intend to make split abstract (HADOOP-451). So it
makes sense to make the mapper a function of the split. Thus a given
job's input is determined by methods like the following:
Split[] getSplits(job);
RecordReader getRecordReader(job, split);
Mapper getMapper(job, split, recordReader);
The first two are currently InputFormat methods. The latter does not
yet exist.
In the normal case, all mappers are instances of the same class. When
they're different, then then the splits will be somehow different (in
order to indicate which mapper to use), and the record readers may also
be different. So whenever one uses a non-default implementation of
getMapper() it will require coordination with the getSplits() and
perhaps the getRecordReader() implementations. Since they must be
coordinated, it makes sense to combine these methods in a single
interface, called something like JobInput.
But we don't expect end users to be implementing JobInput. Rather, they
might job.setJobInputClass(MultiJobInput.class) and then use a static
utility method like:
void MultiJobInput.addInput(job, dir, recordReaderClass, mapperClass);
This will set some properties that MultiJobInput's getSplits(),
getRecordReader() and getMapper() implementations will use.
So we can still have a simple, easy-to-use API, without hardwiring a lot
of assumptions about input/output formats into the kernel.
Does this make sense? Does it sound reasonable?
Doug