Hey all
Looking at (converting to) the new .20 API, I see that the static
config setters take Job or JobContext, not Configuration.
>> public static Path[] getInputPaths(JobContext context)
I get the utility of this from the perspective of a user writing
hadoop jobs. a lot less job.getConfiguration() calls.
But, I do find it odd FileInputFormat, for example, knows about Job
and JobContext (and children) when it feels as if it should only know
about Configuration (considering thats all they do is get/set
properties).
From my perspective, Cascading in part isn't much more than a fancy
Configuration builder. And the internals all really only care about
Configuration as they may be asked to provide a property outside the
context of a job.
So being a builder, a Configuration object is passed around throughout
the system at different stages (planning, execution, etc) in order to
accumulate properties from nested components.
With the new API, it all adds up to the need to wrap Configuration in
a Job/JobContext and then unwrap it so the Configuration instance can
move down the configuration chain.
But this isn't really possible simply as new Job( configuration ) sets
the configuration as a default property collection and any set() on
Job won't influence the defaults. The result is a lot of Configuration
algebra to merge the final results (or a bit of reflection).
Would it make sense to accept Configuration instead of the JobContext
and its sub-classes.
You could argue I should just use JobContext in my API's. but again,
many of my subsystems shouldn't really know of JobContext, they only
care about manipulating the Configuration object. further, the use of
Job, JobContext, TaskAttemptContext, etc in the static setters is
inconsistent.
>> public static void addInputPath(Job job, Path path) throws
IOException {
I wonder if Hive and Pig (will) have similar issues.
cheers,
chris
--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com