Re: hadoop: how to find top N frequently occurring words

Stu Hood Tue, 05 Feb 2008 11:54:45 -0800

That structure will not be synchronized between reducers. With an output that 
small, your options are either to use a single reducer, in which case the 10 
results that end up in your OrderedSet are the answer to your question, or to 
use multiple reducers and then do post-processing to pick the top 10 from the 
top 10 * (# of reducers).


Thanks,
Stu


-----Original Message-----
From: Tarandeep Singh <[EMAIL PROTECTED]>
Sent: Tuesday, February 5, 2008 2:30pm
To: core-user@hadoop.apache.org
Subject: Re: hadoop: how to find top N frequently occurring words

On Feb 4, 2008 3:30 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> Approximately this:
>
>    TopNReducer extends MapReduceBase
>           implements Reducer<Text, IntWritable, Text, IntWritable> {
>       OrderedSet<KeyWordIntegerPair> top =
>             new TreeSet<KeyWordIntegerPair>();
>       FileSystem fs;
>
>       void configure(JobConf conf) {
>          fs = FileSystem.get(conf);
>       }
>
>       void reduce(Text keyword, IntWritable counts,
>             OutputCollector<Text, IntWritable> out, Reporter reporter) {
>          int sum = 0;
>          while (counts.hasNext) {
>             sum += counts.next();
>          }
>
>          if (top.size() < 10 || sum > top.first().getCount()) {
>               top.add(new KeyWordIntegerPair(keyword, sum);
>          }
>
>          while (top.size() > 10) {
>              top.remove(0);
>          }
>      }
>
>      void close() {
>          PrintWriter out = new PW(fs.create(new Path("top-counts")));
>          for (v : top) {
>              out.printf("%s\t%d\n", v.keyword(), v.count());
>          }
>      }
>    }
>

Correct me if I am wrong ...
Here Reducer class is using OrderedSet data structure. The JobTracker
or the master schedules reduce jobs on slave nodes, so this means
slave nodes are having their own data structure. If this is correct
then this orderedSet is also local to slave node, then I don't think
it is going to work.

or this data structure is global ? In that case, do I have to take
care of synchronization ?

thanks,
Taran

> You will have to fix the errors I made in typing this off the cuff, of
> course.
>
>
>
>
> On 2/4/08 3:19 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>
> > On Feb 4, 2008 3:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >>
> >> Yes, you can do it in one pass, but the reducer will have to accumulate the
> >> top N items.  If N is small, this is no problem.  If N is large, that is a
> >> problem.  You also have the problem that the reducer has a close method
> >> where it can output the accumulated data, but it can't go into the normal
> >> output channel because you don't have access to the output collector at 
> >> that
> >> point.
> >>
> >
> > Could you elaborate this a bit.
> > My log file looks like this -
> >
> > keyword  source  dateID
> >
> > right now my mapper output the following as key- value pair
> > keyword_source_dateID - 1
> >
> > reducer counts the 1s.. and output
> > keyword_source_dateId  frequency
> >
> > so it is just the word count program so far. I have another program
> > that identifies the top N keywords. Please tell me how can I modify
> > the reducer to accumulate top N items. N is small e.g 10
> >
> > thanks,
> > Taran
> >
> >> In practical situations, your count reducer can eliminate items with very
> >> small counts that you know cannot be in the top N.  This makes the total
> >> output much smaller than the input.  This means that making a second pass
> >> over the data costs very little.  Even without the threshold, the second
> >> pass will likely be so much faster than the first that it doesn't matter.
> >>
> >> IF you are counting things that satisfy Zipf's law, then the counts will be
> >> proportional to 1/r where r is the rank of the item.  Using this, you can
> >> show that the average count for your keywords will be
> >>
> >>      E(k) = N H_m,2 / (H_m)^2
> >>
> >> Where N is the total number of words counted, m is the possible vocabulary,
> >> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
> >> second order harmonic number (approximately 1.6).
> >>
> >> This means that you should have a compression of approximately
> >>
> >>      1.6 / log(m)^2
> >>
> >>
> >>
> >> On 2/4/08 2:20 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
> >>
> >>> On Feb 4, 2008 2:11 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> >>>> This is exactly the same as word counting, except that you have a second
> >>>> pass to find the top n per block of data (this can be done in a mapper) 
> >>>> and
> >>>> then a reducer can quite easily merge the results together.
> >>>>
> >>>
> >>> This would mean I have to write a second program that reads the output
> >>> of first and does the job. I was wondering if it could be done in one
> >>> program.
> >>
> >>
>
>

Re: hadoop: how to find top N frequently occurring words

Reply via email to