You are foregoing the use of a combiner in your solution, therefore yours is a non scalable solution. Imagine what happens when all or most words have the same length without a combiner. Most computation goes through one reducer while the others watch -- you with them. You can fix it by having the mapper emit word, 1, 1 (word can have the special value #n for overall counts) and the reducer emit word, wordCount / statistics[k2[0]], statistics[k2[0]]. Then you can turn the sum into a weighted sum ... I don't even want to go there. Just do three separate mapreduces and you'll be much better off and have reusable modules. First map reduce is a standard wordcount algorithm, the second map reduce maps word, count to length, count and reduces to length, sum(counts) (again you can use a combiner and the data set is smaller to start with). For the final mapreduce you could distribute the counts by length with distributed cache to every mapper and map word, count to word count/ word length count
Antonio On Thu, Oct 28, 2010 at 10:03 AM, Ricky Ho <[email protected]> wrote: > Of course you can use two round of map reduce with the first round compute > the > statistics and the second round compute the percentile. > > But I don't think this is better than your solution ... which is the most > optimal one that I can think of. Here is the pseudo code of your solution > ... > > > > map(k1, doc) { > for each word in doc { > k2 = [word.length, "#"] > > emit(k2, 1) > k2 = [word.length, word] > > emit(k2, 1) } > > } > > partition(k2) { > k2[0] % NoOfReducers > > } > > # key = word length, value = count > > statistics = Hash.new > > reduce(k2, listOfCounts) { > if k2[1] == "#" { > > statistics[k2[0]] ++ > } else { > wordCount = 0 > > for each count in listOfCounts { > wordCount = wordCount + count > } > emit(word, wordCount / statistics[k2[0]] > > } > > Rgds, > Ricky > > -----Original Message----- > From: Steve Lewis [mailto:[email protected]] > Sent: Thursday, October 28, 2010 8:53 AM > To: common-user > Subject: Statistics and Early Keys to Reducers > > Imaging I have the following problem - I want to call a standard word count > program but instead of having the reducer output the word and > its count I want it to output the word and the count / (total count of > words > of that length) > > The total count of words of a given length - say 1..100 seen by each mapper > is known at the end of the map step > > In theory each mapper could send its total to every reducer and before the > rest of the reduce step each reducer could > compute the grand total > > This requires > 1) Statistics are sent with a key which sort ahead of all others > 2) Statistics are send as the mapper is closing > 3) Somehow each mapper sends statistics with proper keys so a copy is > delivered to every reducer > > Is this a reasonable approach - are there others > What do folks think > -- > Steven M. Lewis PhD > 4221 105th Ave Ne > Kirkland, WA 98033 > 206-384-1340 (cell) > Institute for Systems Biology > Seattle WA > > > >
