[
https://issues.apache.org/jira/browse/MAPREDUCE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer resolved MAPREDUCE-1939.
-----------------------------------------
Resolution: Won't Fix
stale
> split reduce compute phase into two threads one for reading and another for
> computing
> -------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1939
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1939
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 0.20.2
> Reporter: wangxiaowei
>
> it is known that reduce task is made up of three phases: shuffle , sort and
> reduce. During reduce phase, a reduce function will read a record from disk
> or memory first and process it to write to hdfs finally. To convert this
> serial progress to parallel progress , I split the reduce phase into two
> threads called producer and consumer individually. producer is used to read
> record from disk and consumer to process the records read by the first one. I
> use two buffer, if producer is writing one buffer consumer will read from
> another buffer. Theoretically there will be a overlap between this two
> phases so we can reduce the whole reduce time.
> I wonder why hadoop does not implement it originally? Is there some potential
> problems for such ideas ?
> I have already implemmented a prototypy. The producer just reads bytes from
> the disk and leaves the work of transformation to real key and value objects
> to consumer. The results is not good only a improvement of 13% for time. I
> think it has someting with the buffer size and the time spending on different
> threads.Maybe the tiem spend by consumer thread is too long and the producer
> has to wait until the next buffer is available.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)