May be of help ... In my experience there is not a single bottleneck.
Even your tuple representation may disproportionally impact performance.
Too-granular tuples resulting in redundant values will slow down the
shuffle, which does at least 2 serialize/de-serialize ops per tuple.
Benchmarks on my internal cluster show HDFS seek times around 25ms,
higher than a typical physical disk. Hence I'm under the impression the
number of input/output files affects IO performance more than on a
typical physical disk.

A paper from Berkeley EECS about pipelining between Reduce and Map may
be of interest.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html






Gang Luo wrote:
> Hi all,
> I got some questions about the performance of IO in hadoop. What I want to 
> know is which of the three phases (mapper input, shuffle and reducer output) 
> is the bottleneck.
>
> First, compared with the shuffle phase (mapper output, data transfer over 
> network and reducer write it to local disk), does it cost more time to read 
> the input source to mappers? My concern is that to read a large table also 
> need to transfer different splits to different mappers over network. Is 
> hadoop smart enough to make input consume less time than shuffle?
>
> Second, for the final result will be stored in HDFS, we need to duplicate 
> several copies (by default, 3 copies). This need to transfer more than one 
> copy of the result over network. If we run another mapreduce job which is 
> dependent on this result, I think the latency for replicating the result is 
> non-trivial, right? What is the performance compared with mapper input and 
> shuffle?
>
> Thanks.
>
>  
> Gang Luo
> ---------
> Department of Computer Science
> Duke University
> (919)316-0993
> gang....@duke.edu
>
>
>       ___________________________________________________________ 
>   好玩贺卡等你发,邮箱贺卡全新上线! 
> http://card.mail.cn.yahoo.com/
>   

Reply via email to