[
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275733#comment-14275733
]
Siqi Li commented on MAPREDUCE-4815:
------------------------------------
I have attached a patch v9 based on the design suggestions from Jason and Gera.
Also, I have run a bunch of performance testing job as follows,
1. Teragen job with 500 mappers
Job Execution Time Job Commit Time
Old APIs 43 sec 31 sec
New APIs 31 sec 0.2 sec
Savings ~38.7%
2. Teragen job with 5K mappers
Job Execution Time Job Commit Time
Old APIs 6 min 8 sec 2 min
New APIs 4 min 10 sec 0.3 sec
Savings ~33.3%
3. Teragen job with 20K mappers
Job Execution Time Job Commit Time
Old APIs 23 min 45 sec 10 min
New APIs 15 min 36 sec 0.5 sec
Savings ~33.3%
According to the tables above, the average time saving of teragen job is
~33.3%, and the job commit time of new API is almost instant when compared to
old APIs, which is linear to the number of tasks. Noted that this is when the
entire cluster is used by this job only. In actual scenario, the job commit
time may take much longer when NNs are under heavy load.
In addition, this new APIs are optimized for large jobs with small average task
finish time. Because, this kind of job require less time to finish all task,
but use a lot of time doing committing using old APIs. This means a large
portion of overall job time is used to commit. However, with the new APIs
commit time is largely reduced, hence, the saving is huge.
For the long running small jobs, the saving might be negligible, but it will
not be worse than the old APIs
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
> Reporter: Jason Lowe
> Assignee: Siqi Li
> Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch,
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch,
> MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the
> end of the job can take minutes. This is a performance regression from 1.x,
> as 1.x had the tasks commit directly to the final output directory as they
> were completing and commitJob had very little to do. The commit work was
> processed in parallel and overlapped the processing of outstanding tasks. In
> 0.23/2.x, the commit is single-threaded and waits until all tasks have
> completed before commencing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)