I am not quite sure what you mean by "this".

If you mean that the second approach is only an approximation, then you are
correct.

The only simple correct algorithm that I know of is to do the counts
(correctly) and then do the main show (processing with a kill list).

On 4/16/08 9:04 PM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:

> Ted Dunning wrote:
>> The easiest solution is to not worry too much about running an extra MR
>> step.
>> 
>> So,
>> 
>> - run a first pass to get the counts.  Use word count as the pattern.  Store
>> the results in a file.
>> 
>> - run the second pass.  You can now read the hash-table from the file you
>> stored in pass 1.
>> 
>> Another approach is to do the counting in your maps as specified and then
>> before exiting, you can emit special records for each key to suppress.  With
>> the correct sort and partition functions, you can make these killer records
>> appear first in the reduce input.  Then, if your reducer sees the kill flag
>> in the front of the values, it can avoid processing any extra data.
>> 
>>   
> Ted,
> Will this work for the case where the cutoff frequency/count requires a
> global picture? I guess not.

Reply via email to