[jira] Commented: (HADOOP-444) In streaming with a NONE reducer, you get duplicate files if a mapper fails, is restarted, and succeeds next time.

Michel Tourn (JIRA) Tue, 15 Aug 2006 10:50:26 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-444?page=comments#action_12428184 ] 
            
Michel Tourn commented on HADOOP-444:
-------------------------------------


I plan to fix this problem like this:
name the "side-effect output files" after the input Split rather than after the 
taskid:

/in/myfile
/sideout/myfile+200000-100000

And for the first split at offset zero, drop split info, 
i.e. name the output file as the unqaliified filename :
/sideout/myfile

That way, assuming your InputFormat only creates one split per file(*)
then the input filenames and partitioning is preserved (in a different output 
directory)

And in the case of speculative execution and failed / reexecuted tasks.
those output files are assigned the same name. 
Only one version of the file will remain in the end.
This solves the duplicate-output issue.

Only full output files are committed to HDFS on close. 
So there are no issues with concurrent writes / partial writes.

One question is what needs to be done to guarantee that the version from the 
last execution "wins" .
Is it a good idea to force an HDFS-delete of a target output file before 
opening the new version?

(*) for example inputs compressed with file-level compression, like .gzip. 
Could also be useful to let the user force a single-split.









> In streaming with a NONE reducer, you get duplicate files if a mapper fails, 
> is restarted, and succeeds next time.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-444
>                 URL: http://issues.apache.org/jira/browse/HADOOP-444
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.5.0
>            Reporter: Dick King
>         Assigned To: Michel Tourn
>
> When the dust settled after a streaming run, the directory ended up looking 
> like this:
>   
> /user/dking/<project-name>/K-HTML-UTF8-2006-08-09-rescued-abstracted/task_0026_m_007384_0
>    <r 3>   10563406
>   
> /user/dking/<project-name>/K-HTML-UTF8-2006-08-09-rescued-abstracted/task_0026_m_007384_1
>    <r 3>   10563406
> Future processing will receive duplicated data.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-444) In streaming with a NONE reducer, you get duplicate files if a mapper fails, is restarted, and succeeds next time.

Reply via email to