[
https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cutting updated HADOOP-1558:
---------------------------------
Status: Open (was: Patch Available)
This does not seem like the best solution. We need inputs and outputs to be
user-extensible, including things like the output 'commit' hook added here. So
jobs must be able to provide custom implementations of these methods. But, for
reliability, we've worked hard to remove all user code from the jobtracker.
So these must be run either in a separate jvm on the jobtracker or as new task
subclasses run on a tasktracker. I think the latter is preferable, since it
would avoid a lot of code duplication (keeping track of child processes) but
would require chasing down all of the places in the code where we assume that
tasks are either map or reduce, which may not be easy.
A third option might be to run them in JobClient. The initialize() method can
certainly be run there, no? The commit() method is tricker, since we want it
to be run even if the JobClient process dies. Perhaps we could have the
jobtracker advance jobs to a state where they're complete but not committed,
and then, when a JobClient polls for completion and finds it in this state, it
runs the commit method for the job, regardless of whether it was the originally
submitting jvm. Could something like that work? Probably not, but it's worth
consideration...
Finally, your interface still contains file-specifics. We must not assume that
inputs or outputs are files. We want to permit input and output from, e.g.,
HBase. So a top-level output interface must not use Path.
> changes to OutputFormat to work on temporary directory to enable re-running
> crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1558
> URL: https://issues.apache.org/jira/browse/HADOOP-1558
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Environment: all
> Reporter: Alejandro Abdelnur
> Fix For: 0.14.0
>
> Attachments: hadoop-1558-JUN1007-1934.txt,
> hadoop-1558-JUN1107-1533.txt
>
>
> Add OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then
> create a temporary directory for the job, removing any that already exists,
> and commit could rename the temporary output directory to the final name.
> The existing checkOutputSpecs() would continue to throw an exception if the
> final output already exists.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.