Note that picking the top 10 from the output of N reducers each of which produces 10 outputs works, but picking the top 10 in a single reducer eating the output of combiners that each pick the top 10 that they see does not work correctly.
Using the combiners to limit the data volume is a great thing, but has to be done carefully because different combiners will see the same key. On 2/5/08 11:44 AM, "Stu Hood" <[EMAIL PROTECTED]> wrote: > That structure will not be synchronized between reducers. With an output that > small, your options are either to use a single reducer, in which case the 10 > results that end up in your OrderedSet are the answer to your question, or to > use multiple reducers and then do post-processing to pick the top 10 from the > top 10 * (# of reducers). > > Thanks, > Stu > > > -----Original Message----- > From: Tarandeep Singh <[EMAIL PROTECTED]> > Sent: Tuesday, February 5, 2008 2:30pm > To: core-user@hadoop.apache.org > Subject: Re: hadoop: how to find top N frequently occurring words > > On Feb 4, 2008 3:30 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> Approximately this: >> >> TopNReducer extends MapReduceBase >> implements Reducer<Text, IntWritable, Text, IntWritable> { >> OrderedSet<KeyWordIntegerPair> top = >> new TreeSet<KeyWordIntegerPair>(); >> FileSystem fs; >> >> void configure(JobConf conf) { >> fs = FileSystem.get(conf); >> } >> >> void reduce(Text keyword, IntWritable counts, >> OutputCollector<Text, IntWritable> out, Reporter reporter) { >> int sum = 0; >> while (counts.hasNext) { >> sum += counts.next(); >> } >> >> if (top.size() < 10 || sum > top.first().getCount()) { >> top.add(new KeyWordIntegerPair(keyword, sum); >> } >> >> while (top.size() > 10) { >> top.remove(0); >> } >> } >> >> void close() { >> PrintWriter out = new PW(fs.create(new Path("top-counts"))); >> for (v : top) { >> out.printf("%s\t%d\n", v.keyword(), v.count()); >> } >> } >> } >> > > Correct me if I am wrong ... > Here Reducer class is using OrderedSet data structure. The JobTracker > or the master schedules reduce jobs on slave nodes, so this means > slave nodes are having their own data structure. If this is correct > then this orderedSet is also local to slave node, then I don't think > it is going to work. > > or this data structure is global ? In that case, do I have to take > care of synchronization ? > > thanks, > Taran > >> You will have to fix the errors I made in typing this off the cuff, of >> course. >> >> >> >> >> On 2/4/08 3:19 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: >> >>> On Feb 4, 2008 3:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>>> >>>> Yes, you can do it in one pass, but the reducer will have to accumulate the >>>> top N items. If N is small, this is no problem. If N is large, that is a >>>> problem. You also have the problem that the reducer has a close method >>>> where it can output the accumulated data, but it can't go into the normal >>>> output channel because you don't have access to the output collector at >>>> that >>>> point. >>>> >>> >>> Could you elaborate this a bit. >>> My log file looks like this - >>> >>> keyword source dateID >>> >>> right now my mapper output the following as key- value pair >>> keyword_source_dateID - 1 >>> >>> reducer counts the 1s.. and output >>> keyword_source_dateId frequency >>> >>> so it is just the word count program so far. I have another program >>> that identifies the top N keywords. Please tell me how can I modify >>> the reducer to accumulate top N items. N is small e.g 10 >>> >>> thanks, >>> Taran >>> >>>> In practical situations, your count reducer can eliminate items with very >>>> small counts that you know cannot be in the top N. This makes the total >>>> output much smaller than the input. This means that making a second pass >>>> over the data costs very little. Even without the threshold, the second >>>> pass will likely be so much faster than the first that it doesn't matter. >>>> >>>> IF you are counting things that satisfy Zipf's law, then the counts will be >>>> proportional to 1/r where r is the rank of the item. Using this, you can >>>> show that the average count for your keywords will be >>>> >>>> E(k) = N H_m,2 / (H_m)^2 >>>> >>>> Where N is the total number of words counted, m is the possible vocabulary, >>>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth >>>> second order harmonic number (approximately 1.6). >>>> >>>> This means that you should have a compression of approximately >>>> >>>> 1.6 / log(m)^2 >>>> >>>> >>>> >>>> On 2/4/08 2:20 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: >>>> >>>>> On Feb 4, 2008 2:11 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: >>>>>> This is exactly the same as word counting, except that you have a second >>>>>> pass to find the top n per block of data (this can be done in a mapper) >>>>>> and >>>>>> then a reducer can quite easily merge the results together. >>>>>> >>>>> >>>>> This would mean I have to write a second program that reads the output >>>>> of first and does the job. I was wondering if it could be done in one >>>>> program. >>>> >>>> >> >> > >