Re: Shared thread safe variables?

2009-01-01 Thread Jim Twensky
Aaron, I actually do something different than word count. I count all possible phrases for every sentence in my corpus. So for instance, if I have a sentence like Hello world, my mappers emit: Hello 1 World 1 Hello World 1 As you can easily realize, for longer sentences the number of

Re: Shared thread safe variables?

2008-12-31 Thread Aaron Kimball
Hmm. Check your math on the data set size. Your input corpus may be a few (dozen, hundred) TB, but how many distinct words are there? The output data set should be at least a thousand times smaller. If you've got the hardware to do that initial word count step on a few TB of data, the second pass

Re: Shared thread safe variables?

2008-12-25 Thread Jim Twensky
Hello again, I think I found an answer to my question. If I write a new WritableComparable object that extends IntWritable and then overwrite the compareTo method, I can change the sorting order from ascending to descending. That will solve my problem for getting the top 100 most frequent words

Shared thread safe variables?

2008-12-24 Thread Jim Twensky
Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things, let's say that in the word count example, I want to know the word that has the highest frequency and how many

Re: Shared thread safe variables?

2008-12-24 Thread Aaron Kimball
Hi Jim, The ability to perform locking of shared mutable state is a distinct anti-goal of the MapReduce paradigm. One of the major benefits of writing MapReduce programs is knowing that you don't have to worry about deadlock in your code. If mappers could lock objects, then the failure and

Re: Shared thread safe variables?

2008-12-24 Thread Jim Twensky
Hi Aaron, Thanks for the advice. I actually thought of using multiple combiners and a single reducer but I was worried about the key sorting phase to be a vaste for my purpose. If the input is just a bunch of (word,count) pairs which is in the order of TeraBytes, wouldn't sorting be an overkill?