Hi,

I am very new to the MapReduce paradigm so this could be a dumb question. 

What do you do if your mapper functions need to know more than just the data
being processed in order to do their job? The simplest example I can think
of is implementing a selective, phrase-based version of wordcount. 

Imagine you want to count the occurrences of all notable names (from the
notable names database) in a large collection of news stories. You can't
just count phrases - the number of potential word combinations is
ridiculously large, and the vast majority are irrelevant. 

You have a limited (large, but bounded) vocabulary of phrases you are
interested in--this list of names. You want each mapper to be aware of it,
and only count the relevant phrases. You basically want to give each mapper
read-only access to a HashSet of phrases as well as the documents they
should be counting over. How would you do that?

Cheers, 
Dave


-- 
View this message in context: 
http://old.nabble.com/Context-needed-by-mapper-tp28532164p28532164.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to