>> In order to achieve efficiency, I don't want these pieces of spam filters 
>> moving around the nodes in cluster.
If you are flexible on this, you can pass both mails and config data to 
mappers, do common processing for mails, transform the K,V pair for each 
user/mailbox , and use custom partitioner, comparator to pass user-specific 
mails and filters to a single reducer and process as needed. If the size of 
config file is << than mail sizes ( maybe naïve, but should hold good ), its 
not much of an inefficiency. This *should*  be better than 2 mapred jobs , 
where you would be writing twice to the hdfs.
Hope this helps, just the first thing that came to my mind.

Thanks,
Amogh

________________________________
From: fan wei fang [mailto:eagleeye8...@gmail.com]
Sent: Monday, August 24, 2009 12:03 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Location reduce task running.

Hi Amogh,

I appreciate your quick response.
Please correct me if I'm wrong. If the workload of reducers is transferred to 
combiners, does it mean every map node must hold a copy of my config. data? If 
this is the case, it is completely unacceptable for my app.

Let me further explain the situation for you.
I am trying to build an anti-spam system using Map-Reduce. In this system, 
users are allowed to have their own spam filters. The whole set of these 
filters are so huge that it shouldn't be put in any single node. Therefore, I 
have to split them to nodes. Each node will be responsible for only a small 
number email boxes.
In order to achieve efficiency, I don't want these pieces of spam filters 
moving around the nodes in cluster.

This is the data flow of my app.

Mails ---> Map (do common processing for emails) ---> Reduce (do user-specific 
processing) ---> Store mails to designated boxes.

Do you have any suggestion? I am thinking about JVM re-use feature of Hadoop or 
I can set up a chain of two map-reduce pairs.

Best regards.
Fang.


On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar 
<am...@yahoo-inc.com<mailto:am...@yahoo-inc.com>> wrote:

No, but if you want a "reducer like" functionality on the same node, have a 
look at combiners. To get exact functionality you might need to tweak around a 
little wrt buffers, flush etc.



Cheers!

Amogh



________________________________

From: fan wei fang 
[mailto:eagleeye8...@gmail.com<mailto:eagleeye8...@gmail.com>]
Sent: Monday, August 24, 2009 9:17 AM
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: Location reduce task running.



Hello guys,

I am a newbie of Hadoop and doing an experiment with Hadoop.
My situation is:
 +My job is expected to run continuously/frequently
 +My reduce task require a large amount of configuration data. This config data 
is specific to map output's key.
-->That's why, I want to avoid moving this config data around.
As far as I read, nodes where reduce tasks are assigned are picked without 
consideration of data locality.

My question is: Is there any way to force the reduce tasks for a specific key 
running on the same node?

Thnx.

Reply via email to