Thanks to everyone for your feedback.  I'm unfamiliar with many of the
technologies you've mentioned, so it may take me some time to digest
all your responses.  The first thing I'm going to look at is Ted's
suggestion of a pure map-reduce solution by pre-joining my data with
my lookup values.

On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
> On Thu, Mar 19, 2009 at 6:42 PM, Stuart White <stuart.whi...@gmail.com>wrote:
>
>>
>> My process requires a large dictionary of terms (~ 2GB when loaded
>> into RAM).  The terms are looked-up very frequently, so I want the
>> terms memory-resident.
>>
>> So, the problem is, I want 3 processes (to utilize CPU), but each
>> process requires ~2GB, but my nodes don't have enough memory to each
>> have their own copy of the 2GB of data.  So, I need to somehow share
>> the 2GB between the processes.
>
>
> I would recommend using the multi-threaded map runner. Have 1 map/node and
> just use 3 worker threads that all consume the input. The only disadvantage
> is that it works best for cpu-heavy loads (or maps that are doing crawling,
> etc.), since you only have one record reader for all three of the map
> threads.
>
> In the longer term, it might make sense to enable parallel jvm reuse in
> addition to serial jvm reuse.
>
> -- Owen
>

Reply via email to