[
https://issues.apache.org/jira/browse/HADOOP-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650847#action_12650847
]
Christian Kunz commented on HADOOP-4730:
----------------------------------------
I was monitoring a long tail (single reducer) of a job, and noticed that it was
spending a lot of time in the merge phase doing merges in single-threaded
fashion. I attach the log:
2008-11-25 16:27:52,222 INFO org.apache.hadoop.mapred.ReduceTask: Initiating
final on-disk merge with 394 files
2008-11-25 16:27:52,343 INFO org.apache.hadoop.mapred.Merger: Merging 394
sorted segments
2008-11-25 16:27:57,982 INFO org.apache.hadoop.mapred.Merger: Merging 97
intermediate segments out of a total of 394
2008-11-25 17:10:23,569 INFO org.apache.hadoop.mapred.Merger: Merging 100
intermediate segments out of a total of 298
2008-11-25 17:59:22,272 INFO org.apache.hadoop.mapred.Merger: Merging 100
intermediate segments out of a total of 199
2008-11-25 18:48:48,813 INFO org.apache.hadoop.mapred.Merger: Down to the last
merge-pass, with 100 segments left of total size: 113074719385 bytes
2008-11-25 18:48:50,521 INFO org.apache.hadoop.mapred.pipes.PipesReducer:
starting application
Between 16:28 and 18:48 3 merges got executed, each taking 40-50 minutes. With
running in parallel we could have saved about 1.5hr.
> multi-threaded merge phase
> --------------------------
>
> Key: HADOOP-4730
> URL: https://issues.apache.org/jira/browse/HADOOP-4730
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.1
> Reporter: Christian Kunz
>
> Doing merges in multiple threads (when enough cores are available -- a
> monitoring issue), the time spent in merging could be cut by a factor equal
> to the number of threads.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.