Hi Stuart,

You might want to look at a memcached solution some students and I worked out for exactly this problem. It's written up in:

Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework. Technical Report HCIL-2009-01, University of Maryland, College Park, January 2009.

Available at:

http://www.umiacs.umd.edu/~jimmylin/publications/by_year.html

Best,
Jimmy

Stuart White wrote:
Thanks to everyone for your feedback.  I'm unfamiliar with many of the
technologies you've mentioned, so it may take me some time to digest
all your responses.  The first thing I'm going to look at is Ted's
suggestion of a pure map-reduce solution by pre-joining my data with
my lookup values.

On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
On Thu, Mar 19, 2009 at 6:42 PM, Stuart White <stuart.whi...@gmail.com>wrote:

My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.

I would recommend using the multi-threaded map runner. Have 1 map/node and
just use 3 worker threads that all consume the input. The only disadvantage
is that it works best for cpu-heavy loads (or maps that are doing crawling,
etc.), since you only have one record reader for all three of the map
threads.

In the longer term, it might make sense to enable parallel jvm reuse in
addition to serial jvm reuse.

-- Owen


Reply via email to