[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257608#comment-14257608
 ] 

Gera Shegalov commented on MAPREDUCE-4815:
------------------------------------------

I think we should strive for a solution that does not create any sibling 
directories as it will surprise users, and it would mean that checkOutputSpec 
everywhere needs to be adjusted in derived classes. I think we can modify the 
behavior of the FOC based on [~l201514]'s idea but still use the existing 
directory structure for backwards-compatibility:

task attempts write as usual to 
$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/

{{commitTask}}: 
# rename *all* files 
'$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/foo' to 
'$joboutput/foo'.
# rename '$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID' to  
'$joboutput/_temporary/$appAttemptID/$taskID' , which is the actual commit

{{recoverTask}}:
# if '$joboutput/_temporary/$(appAttemptID - 1)/$taskID' exists: rename to 
'$joboutput/_temporary/$appAttemptID/$taskID'
# for backwards compatibility after upgrade to the new logic, check if there 
are any '$joboutput/_temporary/$appAttemptID/$taskID/foo' and rename them to 
'$joboutput/foo'

{{commitJob}}
# blow away $joboutput/_temporary
# write $joboutput/_SUCCESS

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, 
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, 
> MAPREDUCE-4815.v8.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to