Stuart White wrote:
The nodes in my cluster have 4 cores & 4 GB RAM.  So, I've set
mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for
"breathing room").

My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.

What I have currently implemented is a standalone RMI service that,
during startup, loads the 2GB dictionaries.  My mappers are simply RMI
clients that call this RMI service.

This works just fine.  The only problem is that my standalone RMI
service is totally "outside" Hadoop.  I have to ssh onto each of the
nodes, start/stop/reconfigure the services manually, etc...

There's nothing wrong with doing this outside Hadoop, the only problem is that manual deployment is not the way forward.

1. some kind of javaspace system where you put facts into the t-space and let them all share it

2. (CofI warning), use something like SmartFrog's anubis tuplespace to bring up one -and one only- node with the dictionary application. This may be hard to get started, but it keeps availability high -the anubis nodes keep track of all other members of the cluster by some heartbeat/election protocol, and can handle failures of the dictionary node by automatically bringing up a new one

3. Roll your own multicast/voting protocol, so avoiding RMI. Something scatter/gather style is needed as part of the Apache Cloud computing product portfolio, so you could try implementing it -Doug Cutting will probably provide constructive feedback.

I haven't played with zookeeper enough to say whether it would work here

-steve

Reply via email to