Re: Coordination between Mapper tasks

Steve Loughran Fri, 20 Mar 2009 07:07:27 -0700

Stuart White wrote:

The nodes in my cluster have 4 cores & 4 GB RAM.  So, I've set
mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for
"breathing room").


My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.

What I have currently implemented is a standalone RMI service that,
during startup, loads the 2GB dictionaries.  My mappers are simply RMI
clients that call this RMI service.

This works just fine.  The only problem is that my standalone RMI
service is totally "outside" Hadoop.  I have to ssh onto each of the
nodes, start/stop/reconfigure the services manually, etc...

There's nothing wrong with doing this outside Hadoop, the only problemis that manual deployment is not the way forward.

1. some kind of javaspace system where you put facts into the t-spaceand let them all share it

2. (CofI warning), use something like SmartFrog's anubis tuplespace tobring up one -and one only- node with the dictionary application. Thismay be hard to get started, but it keeps availability high -the anubisnodes keep track of all other members of the cluster by someheartbeat/election protocol, and can handle failures of the dictionarynode by automatically bringing up a new one

3. Roll your own multicast/voting protocol, so avoiding RMI. Somethingscatter/gather style is needed as part of the Apache Cloud computingproduct portfolio, so you could try implementing it -Doug Cutting willprobably provide constructive feedback.


I haven't played with zookeeper enough to say whether it would work here

-steve

Re: Coordination between Mapper tasks

Reply via email to