Aaron,
I actually do something different than word count. I count all possible
phrases for every sentence in my corpus. So for instance, if I have a
sentence like Hello world, my mappers emit:
Hello 1
World 1
Hello World 1
As you can easily realize, for longer sentences the number of
Hmm. Check your math on the data set size. Your input corpus may be a few
(dozen, hundred) TB, but how many distinct words are there? The output data
set should be at least a thousand times smaller. If you've got the hardware
to do that initial word count step on a few TB of data, the second pass
Hello again,
I think I found an answer to my question. If I write a new
WritableComparable object that extends IntWritable and then overwrite the
compareTo method, I can change the sorting order from ascending to
descending. That will solve my problem for getting the top 100 most frequent
words
Hello,
I was wondering if Hadoop provides thread safe shared variables that can be
accessed from individual mappers/reducers along with a proper locking
mechanism. To clarify things, let's say that in the word count example, I
want to know the word that has the highest frequency and how many
Hi Jim,
The ability to perform locking of shared mutable state is a distinct
anti-goal of the MapReduce paradigm. One of the major benefits of writing
MapReduce programs is knowing that you don't have to worry about deadlock in
your code. If mappers could lock objects, then the failure and
Hi Aaron,
Thanks for the advice. I actually thought of using multiple combiners and a
single reducer but I was worried about the key sorting phase to be a vaste
for my purpose. If the input is just a bunch of (word,count) pairs which is
in the order of TeraBytes, wouldn't sorting be an overkill?