[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275733#comment-14275733
 ] 

Siqi Li commented on MAPREDUCE-4815:
------------------------------------

I have attached a patch v9 based on the design suggestions from Jason and Gera.

Also, I have run a bunch of performance testing job as follows,

1. Teragen job with 500 mappers 

                         Job Execution Time        Job Commit Time
Old APIs                    43 sec                            31 sec
New APIs                  31 sec                            0.2 sec
Savings                     ~38.7%

2. Teragen job with 5K mappers 

                         Job Execution Time        Job Commit Time
Old APIs                    6 min 8 sec                   2 min
New APIs                  4 min 10 sec                 0.3 sec
Savings                     ~33.3%

3. Teragen job with 20K mappers 

                         Job Execution Time        Job Commit Time
Old APIs                23 min 45 sec                   10 min
New APIs              15 min 36 sec                   0.5 sec
Savings                     ~33.3%

According to the tables above, the average time saving of teragen job is 
~33.3%, and the job commit time of new API is almost instant when compared to 
old APIs, which is linear to the number of tasks. Noted that this is when the 
entire cluster is used by this job only. In actual scenario, the job commit 
time may take much longer when NNs are under heavy load.

In addition, this new APIs are optimized for large jobs with small average task 
finish time. Because, this kind of job require less time to finish all task, 
but use a lot of time doing committing using old APIs. This means a large 
portion of overall job time is used to commit. However, with the new APIs 
commit time is largely reduced, hence, the saving is huge.

For the long running small jobs, the saving might be negligible, but it will 
not be worse than the old APIs





> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, 
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, 
> MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to