Re: How are intermediate key/value pairs materialized between map and reduce?

Amogh Vasekar Tue, 23 Feb 2010 20:29:37 -0800

Hi,
Can you let us know what is the value for :
Map input records
Map spilled records
Map output bytes
Is there any side effect file written?


Thanks,
Amogh


On 2/23/10 8:57 PM, "Tim Kiefer" <[email protected]> wrote:

No... 900GB is in the map column. Reduce adds another ~70GB of
FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

Am 23.02.2010 16:11, schrieb Ed Mazur:
> Hi Tim,
>
> I'm guessing a lot of these writes are happening on the reduce side.
> On the JT web interface, there are three columns: map, reduce,
> overall. Is the 900GB figure from the overall column? The value in the
> map column will probably be closer to what you were expecting. There
> are writes on the reduce side too during the shuffle and multi-pass
> merge.
>
> Ed
>
> 2010/2/23 Tim Kiefer <[email protected]>:
>
>> Hi Gang,
>>
>> thanks for your reply.
>>
>> To clarify: I look at the statistics through the job tracker. In the
>> webinterface for my job I have columns for map, reduce and total. What I
>> was refering to is "map" - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
>> Output Bytes in the map column.
>>
>> About the replication factor: I would expect the exact same thing -
>> changing to 6 has no influence on FILE_BYTES_WRITTEN.
>>
>> About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
>> Furthermore, I have 40 mappers and map output data is ~300GB. I can't
>> see how that ends up in a factor 3?
>>
>> - tim
>>
>> Am 23.02.2010 14:39, schrieb Gang Luo:
>>
>>> Hi Tim,
>>> the intermediate data is materialized to local file system. Before it is 
>>> available for reducers, mappers will sort them. If the buffer (io.sort.mb) 
>>> is too small for the intermediate data, multi-phase sorting happen, which 
>>> means you read and write the same bit more than one time.
>>>
>>> Besides, are you looking at the statistics per mapper through the job 
>>> tracker, or just the information output when a job finish? If you look at 
>>> the information given out at the end of the job, note that this is an 
>>> overall statistics which include sorting at reduce side. It also include 
>>> the amount of data written to HDFS (I am not 100% sure).
>>>
>>> And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. 
>>> I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.
>>>
>>>  -Gang
>>>
>>>
>>> Hi there,
>>>
>>> can anybody help me out on a (most likely) simple unclarity.
>>>
>>> I am wondering how intermediate key/value pairs are materialized. I have a 
>>> job where the map phase produces 600,000 records and map output bytes is 
>>> ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
>>> 300GB, are materialized locally by the mappers and that later on reducers 
>>> pull these records (based on the key).
>>> What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is 
>>> as high as ~900GB.
>>>
>>> So - where does the factor 3 come from between Map output bytes and 
>>> FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
>>> file system - but that should be HDFS only?!
>>>
>>> Thanks
>>> - tim
>>>
>>>
>>

Re: How are intermediate key/value pairs materialized between map and reduce?

Reply via email to