Re: hadoop: how to find top N frequently occurring words

Ted Dunning Tue, 05 Feb 2008 12:00:28 -0800

Note that picking the top 10 from the output of N reducers each of which
produces 10 outputs works, but picking the top 10 in a single reducer eating
the output of combiners that each pick the top 10 that they see does not
work correctly.


Using the combiners to limit the data volume is a great thing, but has to be
done carefully because different combiners will see the same key.


On 2/5/08 11:44 AM, "Stu Hood" <[EMAIL PROTECTED]> wrote:

> That structure will not be synchronized between reducers. With an output that
> small, your options are either to use a single reducer, in which case the 10
> results that end up in your OrderedSet are the answer to your question, or to
> use multiple reducers and then do post-processing to pick the top 10 from the
> top 10 * (# of reducers).
> 
> Thanks,
> Stu
> 
> 
> -----Original Message-----
> From: Tarandeep Singh <[EMAIL PROTECTED]>
> Sent: Tuesday, February 5, 2008 2:30pm
> To: core-user@hadoop.apache.org
> Subject: Re: hadoop: how to find top N frequently occurring words
> 
> On Feb 4, 2008 3:30 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>> Approximately this:
>> 
>>    TopNReducer extends MapReduceBase
>>           implements Reducer<Text, IntWritable, Text, IntWritable> {
>>       OrderedSet<KeyWordIntegerPair> top =
>>             new TreeSet<KeyWordIntegerPair>();
>>       FileSystem fs;
>> 
>>       void configure(JobConf conf) {
>>          fs = FileSystem.get(conf);
>>       }
>> 
>>       void reduce(Text keyword, IntWritable counts,
>>             OutputCollector<Text, IntWritable> out, Reporter reporter) {
>>          int sum = 0;
>>          while (counts.hasNext) {
>>             sum += counts.next();
>>          }
>> 
>>          if (top.size() < 10 || sum > top.first().getCount()) {
>>               top.add(new KeyWordIntegerPair(keyword, sum);
>>          }
>> 
>>          while (top.size() > 10) {
>>              top.remove(0);
>>          }
>>      }
>> 
>>      void close() {
>>          PrintWriter out = new PW(fs.create(new Path("top-counts")));
>>          for (v : top) {
>>              out.printf("%s\t%d\n", v.keyword(), v.count());
>>          }
>>      }
>>    }
>> 
> 
> Correct me if I am wrong ...
> Here Reducer class is using OrderedSet data structure. The JobTracker
> or the master schedules reduce jobs on slave nodes, so this means
> slave nodes are having their own data structure. If this is correct
> then this orderedSet is also local to slave node, then I don't think
> it is going to work.
> 
> or this data structure is global ? In that case, do I have to take
> care of synchronization ?
> 
> thanks,
> Taran
> 
>> You will have to fix the errors I made in typing this off the cuff, of
>> course.
>> 
>> 
>> 
>> 
>> On 2/4/08 3:19 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>> 
>>> On Feb 4, 2008 3:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>>> 
>>>> Yes, you can do it in one pass, but the reducer will have to accumulate the
>>>> top N items.  If N is small, this is no problem.  If N is large, that is a
>>>> problem.  You also have the problem that the reducer has a close method
>>>> where it can output the accumulated data, but it can't go into the normal
>>>> output channel because you don't have access to the output collector at
>>>> that
>>>> point.
>>>> 
>>> 
>>> Could you elaborate this a bit.
>>> My log file looks like this -
>>> 
>>> keyword  source  dateID
>>> 
>>> right now my mapper output the following as key- value pair
>>> keyword_source_dateID - 1
>>> 
>>> reducer counts the 1s.. and output
>>> keyword_source_dateId  frequency
>>> 
>>> so it is just the word count program so far. I have another program
>>> that identifies the top N keywords. Please tell me how can I modify
>>> the reducer to accumulate top N items. N is small e.g 10
>>> 
>>> thanks,
>>> Taran
>>> 
>>>> In practical situations, your count reducer can eliminate items with very
>>>> small counts that you know cannot be in the top N.  This makes the total
>>>> output much smaller than the input.  This means that making a second pass
>>>> over the data costs very little.  Even without the threshold, the second
>>>> pass will likely be so much faster than the first that it doesn't matter.
>>>> 
>>>> IF you are counting things that satisfy Zipf's law, then the counts will be
>>>> proportional to 1/r where r is the rank of the item.  Using this, you can
>>>> show that the average count for your keywords will be
>>>> 
>>>>      E(k) = N H_m,2 / (H_m)^2
>>>> 
>>>> Where N is the total number of words counted, m is the possible vocabulary,
>>>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
>>>> second order harmonic number (approximately 1.6).
>>>> 
>>>> This means that you should have a compression of approximately
>>>> 
>>>>      1.6 / log(m)^2
>>>> 
>>>> 
>>>> 
>>>> On 2/4/08 2:20 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>>>> 
>>>>> On Feb 4, 2008 2:11 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
>>>>>> This is exactly the same as word counting, except that you have a second
>>>>>> pass to find the top n per block of data (this can be done in a mapper)
>>>>>> and
>>>>>> then a reducer can quite easily merge the results together.
>>>>>> 
>>>>> 
>>>>> This would mean I have to write a second program that reads the output
>>>>> of first and does the job. I was wondering if it could be done in one
>>>>> program.
>>>> 
>>>> 
>> 
>> 
> 
>

Re: hadoop: how to find top N frequently occurring words

Reply via email to