The hashmap solution won't scale very much as the output data has to fit completely in the heap space of a single machine.

You establish the threshold only after you're done with all the keys right? That's the reason you cannot do something like :
if (frequency < threshold)
   output.collect(...);

in the reducer?

If that's the case, doing a second simple map-reduce pass on your data to eliminate frequent keys is probably the most scalable solution.

alex.r.

Aayush Garg wrote:
We can not read HashMap in the configure method of the reducer because it is
called before reduce job.
I need to eliminate rows from the HashMap when all the keys are read.
Also my concern is if dataset is large will this HashMap thing work??


On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

That design is fine.

You should read your map in the configure method of the reducer.

There is a MapFile format supported by Hadoop, but they tend to be pretty
slow.  I usually find it better to just load my hash table by hand.  If
you
do this, you should use whatever format you like.


On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:

HI,

The current structure of my program is::
Upper class{
class Reduce{
  reduce function(K1,V1,K2,V2){
        // I count the frequency for each key
     // Add output in  HashMap(Key,value)  instead  of  output.collect()
   }
 }

void run()
 {
      runjob();
     // Now eliminate top frequency keys in HashMap built in reduce
function
here because only now hashmap is complete.
     // Write this hashmap to a file in such a format so that I can use
this
hashmap in next MapReduce job and key of this hashmap is taken as key in
mapper function of that Map Reduce. ?? How and which format should I
choose??? Is this design and approach ok?

  }

  public static void main() {}
}
I hope you have got my question.

Thanks,


On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
wrote:
Aayush Garg wrote:

Hi,

Are you sure that another MR is required for eliminating some rows?
Can't I
just somehow eliminate from main() when I know the keys which are
needed
to
remove?



Can you provide some more details on how exactly are you filtering?
Amar






Reply via email to