[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Doug Cutting (JIRA) Wed, 11 Jul 2007 10:28:25 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doug Cutting updated HADOOP-1558:
---------------------------------

    Status: Open  (was: Patch Available)

This does not seem like the best solution.  We need inputs and outputs to be 
user-extensible, including things like the output 'commit' hook added here.  So 
jobs must be able to provide custom implementations of these methods.  But, for 
reliability, we've worked hard to remove all user code from the jobtracker.

So these must be run either in a separate jvm on the jobtracker or as new task 
subclasses run on a tasktracker.  I think the latter is preferable, since it 
would avoid a lot of code duplication (keeping track of child processes) but 
would require chasing down all of the places in the code where we assume that 
tasks are either map or reduce, which may not be easy.

A third option might be to run them in JobClient.  The initialize() method can 
certainly be run there, no?  The commit() method is tricker, since we want it 
to be run even if the JobClient process dies.  Perhaps we could have the 
jobtracker advance jobs to a state where they're complete but not committed, 
and then, when a JobClient polls for completion and finds it in this state, it 
runs the commit method for the job, regardless of whether it was the originally 
submitting jvm.  Could something like that work?  Probably not, but it's worth 
consideration...

Finally, your interface still contains file-specifics.  We must not assume that 
inputs or outputs are files.  We want to permit input and output from, e.g., 
HBase.  So a top-level output interface must not use Path.

> changes to OutputFormat to work on temporary directory to enable re-running 
> crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, 
> hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then 
> create a temporary directory for the job, removing any that already exists, 
> and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the 
> final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Reply via email to