[
https://issues.apache.org/jira/browse/MAPREDUCE-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Li Junjun updated MAPREDUCE-5010:
---------------------------------
Attachment: (was: 未标题-1.jpg)
> use multithreading to speed up Merger and try MapPartitionsCompleteEvent to
> schedule fetch in reduce
> -----------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5010
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv1
> Affects Versions: 1.0.1
> Reporter: Li Junjun
> Assignee: Todd Lipcon
>
> use multithreading to speed up Merger and try MapPartitionsCompleteEvent to
> schedule fetch in reduce
> This is for muticore cpu, the performance will depend on your hardware and
> config.
> In maptask
> [code]
> for (int parts = 0; parts < partitions; parts++) {
> //doing merger , append to final output file (file.out)
> }
> [/code]
> it only use one thread !
> so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do
> Merger , if you have many cores or cpus.
> Before, only a map task complete the reduce tasks will fetch the output ,
> that means
> when map x complete , all the reduce will fetch the output concomitantly.
> even we use
> [code]
> // Randomize the map output locations to prevent
> // all reduce-tasks swamping the same tasktracker
> List<String> hostList = new ArrayList<String>();
> hostList.addAll(mapLocations.keySet());
> Collections.shuffle(hostList, this.random);
> [code]
> in reduce task .
> for example , 100 reduce wait 2 map complete ,beacase the cluster's map task
> capacity is 98,but the job have
> 100 map tasks .
> so,I think : During the threads mergering , for example if map has 8
> partitions , and use 3 thread doing merger ,
> where one of the thread complete one part we can inform the Reduce to fetch
> the partition file immediately,
> or we can wait after 3 parts complete then send the event (conf:
> mapred.map.parts.inform) to reduce the jt's stress.
> not to wait all the map task complete. by doing this, it will prevent all
> reduce-tasks swamping the same tasktracker
> more effective .
> is it acceptable ?
> and other good ideas ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira