Hi Dave,

This is a great question.  Hadoop provides a few mechanism to get additional
data to the map/reduce tasks.


   - Configuration parameter
   - Distributed cache
   - Direct RDBMS access from a node
   - Distributed HashTable like memcached


Let's go through each one.  Configuration object is a Map that is passed to
every task.  If the vocabulary is not very large, say no larger than a few
MB, the best way is to serialize the object and pass it as a configuration
parameter.  Each task will get a deserialized copy.  If the vocabulary is
relatively large, you can always pass it as file in HDFS.  There is a
mechanism to optimize network traffic by increasing the default replication
on these files.  The mechanism is called 'distributed cache'.  Don't ask me
why.  I would actually have called the last option, memcached, by this
name.  You can search the web to clarify what it is.

Finally, it is always possible, but not recommended, to have a direct ODBC
or JDBC connection from the task.  It's not recommended since its not a
scalable solution:  On a large cluster the RDBMS will be flooded with calls
and will likely just go down (unless you at least cache the vocabulary in
memory at the beginning of each map/reduce task).

Hope this helps,

Alex K

On Tue, May 11, 2010 at 10:33 PM, DNMILNE <d.n.mi...@gmail.com> wrote:

>
> Hi,
>
> I am very new to the MapReduce paradigm so this could be a dumb question.
>
> What do you do if your mapper functions need to know more than just the
> data
> being processed in order to do their job? The simplest example I can think
> of is implementing a selective, phrase-based version of wordcount.
>
> Imagine you want to count the occurrences of all notable names (from the
> notable names database) in a large collection of news stories. You can't
> just count phrases - the number of potential word combinations is
> ridiculously large, and the vast majority are irrelevant.
>
> You have a limited (large, but bounded) vocabulary of phrases you are
> interested in--this list of names. You want each mapper to be aware of it,
> and only count the relevant phrases. You basically want to give each mapper
> read-only access to a HashSet of phrases as well as the documents they
> should be counting over. How would you do that?
>
> Cheers,
> Dave
>
>
> --
> View this message in context:
> http://old.nabble.com/Context-needed-by-mapper-tp28532164p28532164.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Reply via email to