[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-1939.
-----------------------------------------
    Resolution: Won't Fix

stale

> split reduce compute phase into two threads one for reading and another for 
> computing
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1939
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1939
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.2
>            Reporter: wangxiaowei
>
> it is known that  reduce task is made up of three phases: shuffle , sort and 
> reduce. During reduce phase, a reduce function will read a record from disk 
> or memory first and process it to write to hdfs finally. To convert this 
> serial progress to parallel progress , I split the reduce phase into two 
> threads called producer and consumer individually. producer is used to read 
> record from disk and consumer to process the records read by the first one. I 
> use two buffer, if  producer is writing one buffer consumer will read from 
> another buffer.  Theoretically  there will be a overlap between this two 
> phases so we can reduce the whole reduce time.
> I wonder why hadoop does not implement it originally? Is there some potential 
> problems for such ideas ?
> I have already implemmented a prototypy. The producer just reads bytes from 
> the disk and leaves the work of transformation to real key and value objects 
> to consumer. The results is not good only a improvement of 13%  for time. I 
> think it has someting with the buffer size and the time spending on different 
> threads.Maybe the tiem spend by consumer thread is too long and the producer 
> has to wait until the next buffer is available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to