[ https://issues.apache.org/jira/browse/MAPREDUCE-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo Nicholas Sze reassigned MAPREDUCE-5010: ---------------------------------------------- Assignee: (was: Tsz Wo Nicholas Sze) > use multithreading to speed up mergeParts and try MapPartitionsCompleteEvent > to schedule fetch in reduce > ---------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-5010 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1 > Affects Versions: 1.0.1 > Reporter: Li Junjun > Attachments: MAPREDUCE-5010.jpg > > > use multithreading to speed up Merger and try MapPartitionsCompleteEvent to > schedule fetch in reduce > This is for muticore cpu, the performance will depend on your hardware and > config. > In maptask > <code> > for (int parts = 0; parts < partitions; parts++) { > //doing merger , append to final output file (file.out) > } > </code> > it only use one thread ! > so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do > Merger , if you have many cores or cpus. > Before, only a map task complete the reduce tasks will fetch the output , > that means > when map x complete , all the reduce will fetch the output concomitantly. > even we use > <code> > // Randomize the map output locations to prevent > // all reduce-tasks swamping the same tasktracker > List<String> hostList = new ArrayList<String>(); > hostList.addAll(mapLocations.keySet()); > Collections.shuffle(hostList, this.random); > </code> > in reduce task . > for example , 100 reduce wait 2 map complete ,beacase the cluster's map task > capacity is 98,but the job have > 100 map tasks . > so,I think : During the threads mergering , for example if map has 8 > partitions , and use 3 thread doing merger , > where one of the thread complete one part we can inform the Reduce to fetch > the partition file immediately, > or we can wait after 3 parts complete then send the event (conf: > mapred.map.parts.inform) to reduce the jt's stress. > not to wait all the map task complete. by doing this, it will prevent all > reduce-tasks swamping the same tasktracker > more effective and speed reduce process. > is it acceptable ? > and other good ideas ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)