Hi Arun,

Briefly going through the DistributedCache information, it seems to be a way
to distribute files to mappers/reducers. One still needs to read the
contents into each map/reduce task VM. Therefore, the data gets replicated
across the VMs in a single node. It seems it does not address my basic
problem which is to have a large shared object across multiple map/reduce
tasks at a given node without having to replicate it across the VMs.

Is there a setting in Hadoop where one can tell Hadoop to create the
individual map/reduce tasks in the same JVM?

Thanks,
Dev


On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

>
> On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:
>
>  Hi Alan,
>>
>> Thanks for your message.
>>
>> The object can be read-only once it is initialized - I do not need to
>> modify
>>
>
> Please take a look at DistributedCache:
>
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
>
> An example:
>
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0
>
> Arun
>
>
>
>> it. Essentially it is an object that allows me to analyze/modify data that
>> I
>> am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
>> that if I run multiple mappers, this object gets replicated in the
>> different
>> VMs and I run out of memory on my node. I pretty much need to have the
>> full
>> object in memory to do my processing. It is possible (though quite
>> difficult) to have it partially on disk and query it (like a lucene store
>> implementation) but there is a significant performance hit. As an e.g.,
>> let
>> us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
>> scenario, I can really only have 1 mapper per node whereas there are 8
>> CPUs.
>> But if the overhead of sharing the object (e.g. RMI) or persisting the
>> object (e.g. lucene) is greater than 8 times the memory speed, then it is
>> cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
>> getting a roughly 600 times decrease in performance versus in-memory
>> access.
>>
>> So ideally, if I could have all the mappers in the same VM, then I can
>> create a singleton and still have multiple mappers access it at memory
>> speeds.
>>
>> Please do let me know if I am looking at this correctly and if the above
>> is
>> possible.
>>
>> Thanks a lot for all your help.
>>
>> Cheers,
>> Dev
>>
>>
>>
>>
>> On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote:
>>
>>  It really depends on what type of data you are sharing, how you are
>>> looking
>>> up the data, whether the data is Read-write, and whether you care about
>>> consistency. If you don't care about consistency, I suggest that you
>>> shove
>>> the data into a BDB store (for key-value lookup) or a lucene store, and
>>> copy
>>> the data to all the nodes. That way all data access will be in-process,
>>> no
>>> gc problems, and you will get very fast results. BDB and lucene both have
>>> easy replication strategies.
>>>
>>> If the data is RW, and you need consistency, you should probably forget
>>> about MapReduce and just run everything on big-iron.
>>>
>>> Regards,
>>> Alan Ho
>>>
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: Devajyoti Sarkar <[EMAIL PROTECTED]>
>>> To: core-user@hadoop.apache.org
>>> Sent: Thursday, October 2, 2008 8:41:04 PM
>>> Subject: Sharing an object across mappers
>>>
>>> I think each mapper/reducer runs in its own JVM which makes it impossible
>>> to
>>> share objects. I need to share a large object so that I can access it at
>>> memory speeds across all the mappers. Is it possible to have all the
>>> mappers
>>> run in the same VM? Or is there a way to do this across VMs at high
>>> speed?
>>> I
>>> guess JMI and others such methods will be just too slow.
>>>
>>> Thanks,
>>> Dev
>>>
>>>
>>>
>>>     __________________________________________________________________
>>> Instant Messaging, free SMS, sharing photos and more... Try the new
>>> Yahoo!
>>> Canada Messenger at http://ca.beta.messenger.yahoo.com/
>>>
>>>
>

Reply via email to