[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Junjun updated MAPREDUCE-5010:
---------------------------------

    Description: 
use multithreading to speed up Merger and try MapPartitionsCompleteEvent to 
schedule fetch in reduce 


This is for muticore cpu, the performance will depend on your hardware and 
config.

In maptask 
[code]
for (int parts = 0; parts < partitions; parts++) {
        //doing merger , append to final output file (file.out)
}
[/code]
it only use one thread !
so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do Merger 
, if you have many cores or cpus.


Before, only a map task complete the reduce tasks will fetch the output , that 
means 
when map x complete , all the reduce will fetch the output concomitantly. even 
we use
[code]   
   // Randomize the map output locations to prevent 
   // all reduce-tasks swamping the same tasktracker
   List<String> hostList = new ArrayList<String>();
   hostList.addAll(mapLocations.keySet());       
   Collections.shuffle(hostList, this.random);
[/code]
in  reduce task .
for example ,  100 reduce wait 2 map complete ,beacase the cluster's map task 
capacity is 98,but the job have 
100 map tasks . 


so,I think : During the threads mergering  , for example if map has 8 
partitions , and use 3 thread  doing merger , 
where one of the thread complete one part we can inform  the Reduce to fetch 
the partition file  immediately,
or we can wait after 3 parts complete then send the event  (conf: 
mapred.map.parts.inform) to reduce the jt's stress.
not to wait all the map task complete. by doing this, it will  prevent all 
reduce-tasks swamping the same tasktracker
more effective .



is it  acceptable ?
and other good ideas ?


  was:
use multithreading to speed up Merger and try MapPartitionsCompleteEvent to 
schedule fetch in reduce 


This is for muticore cpu, the performance will depend on your hardware and 
config.

In maptask 
[code]
for (int parts = 0; parts < partitions; parts++) {
        //doing merger , append to final output file (file.out)
}
[/code]
it only use one thread !
so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do Merger 
, if you have many cores or cpus.


Before, only a map task complete the reduce tasks will fetch the output , that 
means 
when map x complete , all the reduce will fetch the output concomitantly. even 
we use
[code]   
   // Randomize the map output locations to prevent 
   // all reduce-tasks swamping the same tasktracker
   List<String> hostList = new ArrayList<String>();
   hostList.addAll(mapLocations.keySet());       
   Collections.shuffle(hostList, this.random);
[code]
in  reduce task .
for example ,  100 reduce wait 2 map complete ,beacase the cluster's map task 
capacity is 98,but the job have 
100 map tasks . 


so,I think : During the threads mergering  , for example if map has 8 
partitions , and use 3 thread  doing merger , 
where one of the thread complete one part we can inform  the Reduce to fetch 
the partition file  immediately,
or we can wait after 3 parts complete then send the event  (conf: 
mapred.map.parts.inform) to reduce the jt's stress.
not to wait all the map task complete. by doing this, it will  prevent all 
reduce-tasks swamping the same tasktracker
more effective .



is it  acceptable ?
and other good ideas ?


    
> use multithreading to speed up Merger and try MapPartitionsCompleteEvent to 
> schedule fetch in reduce 
> -----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5010
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv1
>    Affects Versions: 1.0.1
>            Reporter: Li Junjun
>            Assignee: Todd Lipcon
>         Attachments: MAPREDUCE-5010.jpg
>
>
> use multithreading to speed up Merger and try MapPartitionsCompleteEvent to 
> schedule fetch in reduce 
> This is for muticore cpu, the performance will depend on your hardware and 
> config.
> In maptask 
> [code]
> for (int parts = 0; parts < partitions; parts++) {
>       //doing merger , append to final output file (file.out)
> }
> [/code]
> it only use one thread !
> so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do 
> Merger , if you have many cores or cpus.
> Before, only a map task complete the reduce tasks will fetch the output , 
> that means 
> when map x complete , all the reduce will fetch the output concomitantly. 
> even we use
> [code]   
>    // Randomize the map output locations to prevent 
>    // all reduce-tasks swamping the same tasktracker
>    List<String> hostList = new ArrayList<String>();
>    hostList.addAll(mapLocations.keySet());       
>    Collections.shuffle(hostList, this.random);
> [/code]
> in  reduce task .
> for example ,  100 reduce wait 2 map complete ,beacase the cluster's map task 
> capacity is 98,but the job have 
> 100 map tasks . 
> so,I think : During the threads mergering  , for example if map has 8 
> partitions , and use 3 thread  doing merger , 
> where one of the thread complete one part we can inform  the Reduce to fetch 
> the partition file  immediately,
> or we can wait after 3 parts complete then send the event  (conf: 
> mapred.map.parts.inform) to reduce the jt's stress.
> not to wait all the map task complete. by doing this, it will  prevent all 
> reduce-tasks swamping the same tasktracker
> more effective .
> is it  acceptable ?
> and other good ideas ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to