Arkady Borkovsky wrote:
Defining interfaces that a easy to understand is more than just syntactic sugar, and usability should not be sacrificed to orthogonality.

A simple, easy-to-use and understand API can be constructed on top of a perhaps not quite-so-simple, yet more general API. The converse is not always true. So I would prefer that the input primitives, the things called by the system, are abstact, free of notions like file names, directories, positions, offsets, etc., and that such features are supported by library/utility code.

The case has been made that it would be good to dynamically determine the map function, rather than having a fixed map function for all splits of a job. We already intend to make split abstract (HADOOP-451). So it makes sense to make the mapper a function of the split. Thus a given job's input is determined by methods like the following:

Split[] getSplits(job);
RecordReader getRecordReader(job, split);
Mapper getMapper(job, split, recordReader);

The first two are currently InputFormat methods. The latter does not yet exist.

In the normal case, all mappers are instances of the same class. When they're different, then then the splits will be somehow different (in order to indicate which mapper to use), and the record readers may also be different. So whenever one uses a non-default implementation of getMapper() it will require coordination with the getSplits() and perhaps the getRecordReader() implementations. Since they must be coordinated, it makes sense to combine these methods in a single interface, called something like JobInput.

But we don't expect end users to be implementing JobInput. Rather, they might job.setJobInputClass(MultiJobInput.class) and then use a static utility method like:

void MultiJobInput.addInput(job, dir, recordReaderClass, mapperClass);

This will set some properties that MultiJobInput's getSplits(), getRecordReader() and getMapper() implementations will use.

So we can still have a simple, easy-to-use API, without hardwiring a lot of assumptions about input/output formats into the kernel.

Does this make sense?  Does it sound reasonable?

Doug

Reply via email to