On 11/18/08 6:36 PM, "Paco NATHAN" <[EMAIL PROTECTED]> wrote:
> Thank you, Devaraj -
> That explanation helps a lot.
>
> Is the following reasonable to say?
>
> Combine input records count shown in the Map phase column of the
> report is a measure of how many times records have passed through the
> Combiner during merges of intermediate spills. Therefore, it may be
> larger than the actual count of records which are being merged.
>
>
Yes, but to be precise you should say sorts and merges instead of just
merges (as you might know that map does a sort of the map output buffer data
whenever it has collected sufficient data, and the data that gets spilled to
disk are the records that the combiner outputs).
> Paco
>
>
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
>
> On Mon, Nov 17, 2008 at 23:04, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>> On 11/18/08 3:59 AM, "Paco NATHAN" <[EMAIL PROTECTED]> wrote:
>>
>>> Could someone please help explain the job counters shown for Combine
>>> records on the JobTracker JSP page?
>>>
>>> Here's an example from one of our MR jobs. There are Combine input
>>> and output record counters shown for both Map phase and Reduce phase.
>>> We're not quite sure how to interpret them -
>>>
>>> Map Phase:
>>> Map input records 85,013,261,279
>>> Map output records 85,013,261,279
>>> Combine input records 114,936,724,505
>>> Combine output records 38,750,511,975
>>>
>>> Reduce Phase:
>>> Combine input records 8,827,017,275
>>> Combine output records 17,986,654
>>> Reduce input groups 2,221,796
>>> Reduce input records 17,986,654
>>> Reduce output records 4,443,590
>>>
>>>
>>> What makes sense:
>>> * Considering the MR job and its data, the 85.0b count for Map
>>> output records is expected
>>> * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
>>> * Reduce phase shows Combine output records at 18.0m = Reduce input
>>> records at 18.0m
>>> * Reduce input groups at 2.2m is expected
>>> * Reduce output records at 4.4m is verified
>>>
>>> What doesn't make sense:
>>> * The 115b count for Combine input records during Map phase
>>> * The 8.8b count for Combine input records during Reduce phase
>>>
>>
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
>> On the reducer side, the combiner would be called only during merges of
>> intermediate data, and the intermediate merges stops at a certain point (we
>> have <= io.sort.factor files remaining). Hence the combiner may be called
>> fewer times here...
>>
>>> What would be the actual count of records coming out of the Map phase?
>>>
>>> Thanks,
>>> Paco
>>
>>
>>