Re: Sharing an object across mappers

Arun C Murthy Fri, 03 Oct 2008 07:34:11 -0700


On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:

Hi Alan,

Thanks for your message.
The object can be read-only once it is initialized - I do not needto modify


Please take a look at DistributedCache:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

An example:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0

Arun

it. Essentially it is an object that allows me to analyze/modifydata that Iam mapping/reducing. It comes to about 3-4GB of RAM. The problem Ihave isthat if I run multiple mappers, this object gets replicated in thedifferentVMs and I run out of memory on my node. I pretty much need to havethe full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucenestoreimplementation) but there is a significant performance hit. As ane.g., letus say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). Inthisscenario, I can really only have 1 mapper per node whereas there are8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, thenit ischeaper to run 1 mapper/node. I tried sharing with Terracotta and Iwasgetting a roughly 600 times decrease in performance versus in-memoryaccess.
So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.
Please do let me know if I am looking at this correctly and if theabove is
possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote:
It really depends on what type of data you are sharing, how you arelookingup the data, whether the data is Read-write, and whether you careaboutconsistency. If you don't care about consistency, I suggest thatyou shovethe data into a BDB store (for key-value lookup) or a lucene store,and copythe data to all the nodes. That way all data access will be in-process, nogc problems, and you will get very fast results. BDB and luceneboth have
easy replication strategies.
If the data is RW, and you need consistency, you should probablyforget
about MapReduce and just run everything on big-iron.

Regards,
Alan Ho




----- Original Message ----
From: Devajyoti Sarkar <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers
I think each mapper/reducer runs in its own JVM which makes itimpossible
to
share objects. I need to share a large object so that I can accessit at
memory speeds across all the mappers. Is it possible to have all the
mappers
run in the same VM? Or is there a way to do this across VMs at highspeed?
I
guess JMI and others such methods will be just too slow.

Thanks,
Dev
__________________________________________________________________Instant Messaging, free SMS, sharing photos and more... Try the newYahoo!
Canada Messenger at http://ca.beta.messenger.yahoo.com/

Re: Sharing an object across mappers

Reply via email to