Hi Arun, Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. One still needs to read the contents into each map/reduce task VM. Therefore, the data gets replicated across the VMs in a single node. It seems it does not address my basic problem which is to have a large shared object across multiple map/reduce tasks at a given node without having to replicate it across the VMs.
Is there a setting in Hadoop where one can tell Hadoop to create the individual map/reduce tasks in the same JVM? Thanks, Dev On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: > > Hi Alan, >> >> Thanks for your message. >> >> The object can be read-only once it is initialized - I do not need to >> modify >> > > Please take a look at DistributedCache: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache > > An example: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 > > Arun > > > >> it. Essentially it is an object that allows me to analyze/modify data that >> I >> am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is >> that if I run multiple mappers, this object gets replicated in the >> different >> VMs and I run out of memory on my node. I pretty much need to have the >> full >> object in memory to do my processing. It is possible (though quite >> difficult) to have it partially on disk and query it (like a lucene store >> implementation) but there is a significant performance hit. As an e.g., >> let >> us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this >> scenario, I can really only have 1 mapper per node whereas there are 8 >> CPUs. >> But if the overhead of sharing the object (e.g. RMI) or persisting the >> object (e.g. lucene) is greater than 8 times the memory speed, then it is >> cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was >> getting a roughly 600 times decrease in performance versus in-memory >> access. >> >> So ideally, if I could have all the mappers in the same VM, then I can >> create a singleton and still have multiple mappers access it at memory >> speeds. >> >> Please do let me know if I am looking at this correctly and if the above >> is >> possible. >> >> Thanks a lot for all your help. >> >> Cheers, >> Dev >> >> >> >> >> On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: >> >> It really depends on what type of data you are sharing, how you are >>> looking >>> up the data, whether the data is Read-write, and whether you care about >>> consistency. If you don't care about consistency, I suggest that you >>> shove >>> the data into a BDB store (for key-value lookup) or a lucene store, and >>> copy >>> the data to all the nodes. That way all data access will be in-process, >>> no >>> gc problems, and you will get very fast results. BDB and lucene both have >>> easy replication strategies. >>> >>> If the data is RW, and you need consistency, you should probably forget >>> about MapReduce and just run everything on big-iron. >>> >>> Regards, >>> Alan Ho >>> >>> >>> >>> >>> ----- Original Message ---- >>> From: Devajyoti Sarkar <[EMAIL PROTECTED]> >>> To: core-user@hadoop.apache.org >>> Sent: Thursday, October 2, 2008 8:41:04 PM >>> Subject: Sharing an object across mappers >>> >>> I think each mapper/reducer runs in its own JVM which makes it impossible >>> to >>> share objects. I need to share a large object so that I can access it at >>> memory speeds across all the mappers. Is it possible to have all the >>> mappers >>> run in the same VM? Or is there a way to do this across VMs at high >>> speed? >>> I >>> guess JMI and others such methods will be just too slow. >>> >>> Thanks, >>> Dev >>> >>> >>> >>> __________________________________________________________________ >>> Instant Messaging, free SMS, sharing photos and more... Try the new >>> Yahoo! >>> Canada Messenger at http://ca.beta.messenger.yahoo.com/ >>> >>> >