Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need to modify
it. Essentially it is an object that allows me to analyze/modify data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
that if I run multiple mappers, this object gets replicated in the different
VMs and I run out of memory on my node. I pretty much need to have the full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene store
implementation) but there is a significant performance hit. As an e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
scenario, I can really only have 1 mapper per node whereas there are 8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
getting a roughly 600 times decrease in performance versus in-memory access.

So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the above is
possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote:

> It really depends on what type of data you are sharing, how you are looking
> up the data, whether the data is Read-write, and whether you care about
> consistency. If you don't care about consistency, I suggest that you shove
> the data into a BDB store (for key-value lookup) or a lucene store, and copy
> the data to all the nodes. That way all data access will be in-process, no
> gc problems, and you will get very fast results. BDB and lucene both have
> easy replication strategies.
>
> If the data is RW, and you need consistency, you should probably forget
> about MapReduce and just run everything on big-iron.
>
> Regards,
> Alan Ho
>
>
>
>
> ----- Original Message ----
> From: Devajyoti Sarkar <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, October 2, 2008 8:41:04 PM
> Subject: Sharing an object across mappers
>
> I think each mapper/reducer runs in its own JVM which makes it impossible
> to
> share objects. I need to share a large object so that I can access it at
> memory speeds across all the mappers. Is it possible to have all the
> mappers
> run in the same VM? Or is there a way to do this across VMs at high speed?
> I
> guess JMI and others such methods will be just too slow.
>
> Thanks,
> Dev
>
>
>
>       __________________________________________________________________
> Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo!
> Canada Messenger at http://ca.beta.messenger.yahoo.com/
>

Reply via email to