Hi Gang, thanks for your reply.
To clarify: I look at the statistics through the job tracker. In the webinterface for my job I have columns for map, reduce and total. What I was refering to is "map" - i.e. I see FILE_BYTES_WRITTEN = 3 * Map Output Bytes in the map column. About the replication factor: I would expect the exact same thing - changing to 6 has no influence on FILE_BYTES_WRITTEN. About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10. Furthermore, I have 40 mappers and map output data is ~300GB. I can't see how that ends up in a factor 3? - tim Am 23.02.2010 14:39, schrieb Gang Luo: > Hi Tim, > the intermediate data is materialized to local file system. Before it is > available for reducers, mappers will sort them. If the buffer (io.sort.mb) is > too small for the intermediate data, multi-phase sorting happen, which means > you read and write the same bit more than one time. > > Besides, are you looking at the statistics per mapper through the job > tracker, or just the information output when a job finish? If you look at the > information given out at the end of the job, note that this is an overall > statistics which include sorting at reduce side. It also include the amount > of data written to HDFS (I am not 100% sure). > > And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I > think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same. > > -Gang > > > ----- 原始邮件 ---- > 发件人: Tim Kiefer <tim-kie...@gmx.de> > 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > 发送日期: 2010/2/23 (周二) 6:44:28 上午 > 主 题: How are intermediate key/value pairs materialized between map and > reduce? > > Hi there, > > can anybody help me out on a (most likely) simple unclarity. > > I am wondering how intermediate key/value pairs are materialized. I have a > job where the map phase produces 600,000 records and map output bytes is > ~300GB. What I thought (up to now) is that these 600,000 records, i.e., > 300GB, are materialized locally by the mappers and that later on reducers > pull these records (based on the key). > What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as > high as ~900GB. > > So - where does the factor 3 come from between Map output bytes and > FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file > system - but that should be HDFS only?! > > Thanks > - tim > > > > ___________________________________________________________ > 好玩贺卡等你发,邮箱贺卡全新上线! > http://card.mail.cn.yahoo.com/ >