Hey all

Looking at (converting to) the new .20 API, I see that the static config setters take Job or JobContext, not Configuration.
>> public static Path[] getInputPaths(JobContext context)

I get the utility of this from the perspective of a user writing hadoop jobs. a lot less job.getConfiguration() calls.

But, I do find it odd FileInputFormat, for example, knows about Job and JobContext (and children) when it feels as if it should only know about Configuration (considering thats all they do is get/set properties).

From my perspective, Cascading in part isn't much more than a fancy Configuration builder. And the internals all really only care about Configuration as they may be asked to provide a property outside the context of a job.

So being a builder, a Configuration object is passed around throughout the system at different stages (planning, execution, etc) in order to accumulate properties from nested components.

With the new API, it all adds up to the need to wrap Configuration in a Job/JobContext and then unwrap it so the Configuration instance can move down the configuration chain.

But this isn't really possible simply as new Job( configuration ) sets the configuration as a default property collection and any set() on Job won't influence the defaults. The result is a lot of Configuration algebra to merge the final results (or a bit of reflection).

Would it make sense to accept Configuration instead of the JobContext and its sub-classes.

You could argue I should just use JobContext in my API's. but again, many of my subsystems shouldn't really know of JobContext, they only care about manipulating the Configuration object. further, the use of Job, JobContext, TaskAttemptContext, etc in the static setters is inconsistent. >> public static void addInputPath(Job job, Path path) throws IOException {

I wonder if Hive and Pig (will) have similar issues.

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Reply via email to