Hi Gang,

thanks for your reply.

To clarify: I look at the statistics through the job tracker. In the
webinterface for my job I have columns for map, reduce and total. What I
was refering to is "map" - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
Output Bytes in the map column.

About the replication factor: I would expect the exact same thing -
changing to 6 has no influence on FILE_BYTES_WRITTEN.

About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
Furthermore, I have 40 mappers and map output data is ~300GB. I can't
see how that ends up in a factor 3?

- tim

Am 23.02.2010 14:39, schrieb Gang Luo:
> Hi Tim,
> the intermediate data is materialized to local file system. Before it is 
> available for reducers, mappers will sort them. If the buffer (io.sort.mb) is 
> too small for the intermediate data, multi-phase sorting happen, which means 
> you read and write the same bit more than one time. 
>
> Besides, are you looking at the statistics per mapper through the job 
> tracker, or just the information output when a job finish? If you look at the 
> information given out at the end of the job, note that this is an overall 
> statistics which include sorting at reduce side. It also include the amount 
> of data written to HDFS (I am not 100% sure).
>
> And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
> think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.
>
>  -Gang
>
>
> ----- 原始邮件 ----
> 发件人: Tim Kiefer <tim-kie...@gmx.de>
> 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> 发送日期: 2010/2/23 (周二) 6:44:28 上午
> 主   题: How are intermediate key/value pairs materialized between map and 
> reduce?
>
> Hi there,
>
> can anybody help me out on a (most likely) simple unclarity.
>
> I am wondering how intermediate key/value pairs are materialized. I have a 
> job where the map phase produces 600,000 records and map output bytes is 
> ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
> 300GB, are materialized locally by the mappers and that later on reducers 
> pull these records (based on the key).
> What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
> high as ~900GB.
>
> So - where does the factor 3 come from between Map output bytes and 
> FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file 
> system - but that should be HDFS only?!
>
> Thanks
> - tim
>
>
>
>       ___________________________________________________________ 
>   好玩贺卡等你发,邮箱贺卡全新上线! 
> http://card.mail.cn.yahoo.com/
>   

Reply via email to