Of course you can use two round of map reduce with the first round compute the
statistics and the second round compute the percentile.
But I don't think this is better than your solution ... which is the most
optimal one that I can think of. Here is the pseudo code of your solution ...
map(k1, doc) {
for each word in doc {
k2 = [word.length, "#"]
emit(k2, 1)
k2 = [word.length, word]
emit(k2, 1) }
}
partition(k2) {
k2[0] % NoOfReducers
}
# key = word length, value = count
statistics = Hash.new
reduce(k2, listOfCounts) {
if k2[1] == "#" {
statistics[k2[0]] ++
} else {
wordCount = 0
for each count in listOfCounts {
wordCount = wordCount + count
}
emit(word, wordCount / statistics[k2[0]]
}
Rgds,
Ricky
-----Original Message-----
From: Steve Lewis [mailto:[email protected]]
Sent: Thursday, October 28, 2010 8:53 AM
To: common-user
Subject: Statistics and Early Keys to Reducers
Imaging I have the following problem - I want to call a standard word count
program but instead of having the reducer output the word and
its count I want it to output the word and the count / (total count of words
of that length)
The total count of words of a given length - say 1..100 seen by each mapper
is known at the end of the map step
In theory each mapper could send its total to every reducer and before the
rest of the reduce step each reducer could
compute the grand total
This requires
1) Statistics are sent with a key which sort ahead of all others
2) Statistics are send as the mapper is closing
3) Somehow each mapper sends statistics with proper keys so a copy is
delivered to every reducer
Is this a reasonable approach - are there others
What do folks think
--
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA