[ http://issues.apache.org/jira/browse/HADOOP-444?page=comments#action_12428184 ] Michel Tourn commented on HADOOP-444: -------------------------------------
I plan to fix this problem like this: name the "side-effect output files" after the input Split rather than after the taskid: /in/myfile /sideout/myfile+200000-100000 And for the first split at offset zero, drop split info, i.e. name the output file as the unqaliified filename : /sideout/myfile That way, assuming your InputFormat only creates one split per file(*) then the input filenames and partitioning is preserved (in a different output directory) And in the case of speculative execution and failed / reexecuted tasks. those output files are assigned the same name. Only one version of the file will remain in the end. This solves the duplicate-output issue. Only full output files are committed to HDFS on close. So there are no issues with concurrent writes / partial writes. One question is what needs to be done to guarantee that the version from the last execution "wins" . Is it a good idea to force an HDFS-delete of a target output file before opening the new version? (*) for example inputs compressed with file-level compression, like .gzip. Could also be useful to let the user force a single-split. > In streaming with a NONE reducer, you get duplicate files if a mapper fails, > is restarted, and succeeds next time. > ------------------------------------------------------------------------------------------------------------------ > > Key: HADOOP-444 > URL: http://issues.apache.org/jira/browse/HADOOP-444 > Project: Hadoop > Issue Type: Bug > Components: contrib/streaming > Affects Versions: 0.5.0 > Reporter: Dick King > Assigned To: Michel Tourn > > When the dust settled after a streaming run, the directory ended up looking > like this: > > /user/dking/<project-name>/K-HTML-UTF8-2006-08-09-rescued-abstracted/task_0026_m_007384_0 > <r 3> 10563406 > > /user/dking/<project-name>/K-HTML-UTF8-2006-08-09-rescued-abstracted/task_0026_m_007384_1 > <r 3> 10563406 > Future processing will receive duplicated data. > -dk -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
