May be of help ... In my experience there is not a single bottleneck. Even your tuple representation may disproportionally impact performance. Too-granular tuples resulting in redundant values will slow down the shuffle, which does at least 2 serialize/de-serialize ops per tuple. Benchmarks on my internal cluster show HDFS seek times around 25ms, higher than a typical physical disk. Hence I'm under the impression the number of input/output files affects IO performance more than on a typical physical disk.
A paper from Berkeley EECS about pipelining between Reduce and Map may be of interest. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html Gang Luo wrote: > Hi all, > I got some questions about the performance of IO in hadoop. What I want to > know is which of the three phases (mapper input, shuffle and reducer output) > is the bottleneck. > > First, compared with the shuffle phase (mapper output, data transfer over > network and reducer write it to local disk), does it cost more time to read > the input source to mappers? My concern is that to read a large table also > need to transfer different splits to different mappers over network. Is > hadoop smart enough to make input consume less time than shuffle? > > Second, for the final result will be stored in HDFS, we need to duplicate > several copies (by default, 3 copies). This need to transfer more than one > copy of the result over network. If we run another mapreduce job which is > dependent on this result, I think the latency for replicating the result is > non-trivial, right? What is the performance compared with mapper input and > shuffle? > > Thanks. > > > Gang Luo > --------- > Department of Computer Science > Duke University > (919)316-0993 > gang....@duke.edu > > > ___________________________________________________________ > 好玩贺卡等你发,邮箱贺卡全新上线! > http://card.mail.cn.yahoo.com/ >