Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need to modify
it. Essentially it is an object that allows me to analyze/modify data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
that if I run multiple mappers, this object gets replicated in the different
VMs and I run out of memory on my node. I pretty much need to have the full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene store
implementation) but there is a significant performance hit. As an e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
scenario, I can really only have 1 mapper per node whereas there are 8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
getting a roughly 600 times decrease in performance versus in-memory access.

So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the above is
possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

 It really depends on what type of data you are sharing, how you are looking
 up the data, whether the data is Read-write, and whether you care about
 consistency. If you don't care about consistency, I suggest that you shove
 the data into a BDB store (for key-value lookup) or a lucene store, and copy
 the data to all the nodes. That way all data access will be in-process, no
 gc problems, and you will get very fast results. BDB and lucene both have
 easy replication strategies.

 If the data is RW, and you need consistency, you should probably forget
 about MapReduce and just run everything on big-iron.

 Regards,
 Alan Ho




 - Original Message 
 From: Devajyoti Sarkar [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, October 2, 2008 8:41:04 PM
 Subject: Sharing an object across mappers

 I think each mapper/reducer runs in its own JVM which makes it impossible
 to
 share objects. I need to share a large object so that I can access it at
 memory speeds across all the mappers. Is it possible to have all the
 mappers
 run in the same VM? Or is there a way to do this across VMs at high speed?
 I
 guess JMI and others such methods will be just too slow.

 Thanks,
 Dev



   __
 Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo!
 Canada Messenger at http://ca.beta.messenger.yahoo.com/



Re: Sharing an object across mappers

2008-10-03 Thread Arun C Murthy


On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:


Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need  
to modify


Please take a look at DistributedCache:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

An example:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0

Arun



it. Essentially it is an object that allows me to analyze/modify  
data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I  
have is
that if I run multiple mappers, this object gets replicated in the  
different
VMs and I run out of memory on my node. I pretty much need to have  
the full

object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene  
store
implementation) but there is a significant performance hit. As an  
e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In  
this
scenario, I can really only have 1 mapper per node whereas there are  
8 CPUs.

But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then  
it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I  
was
getting a roughly 600 times decrease in performance versus in-memory  
access.


So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the  
above is

possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

It really depends on what type of data you are sharing, how you are  
looking
up the data, whether the data is Read-write, and whether you care  
about
consistency. If you don't care about consistency, I suggest that  
you shove
the data into a BDB store (for key-value lookup) or a lucene store,  
and copy
the data to all the nodes. That way all data access will be in- 
process, no
gc problems, and you will get very fast results. BDB and lucene  
both have

easy replication strategies.

If the data is RW, and you need consistency, you should probably  
forget

about MapReduce and just run everything on big-iron.

Regards,
Alan Ho




- Original Message 
From: Devajyoti Sarkar [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers

I think each mapper/reducer runs in its own JVM which makes it  
impossible

to
share objects. I need to share a large object so that I can access  
it at

memory speeds across all the mappers. Is it possible to have all the
mappers
run in the same VM? Or is there a way to do this across VMs at high  
speed?

I
guess JMI and others such methods will be just too slow.

Thanks,
Dev



  
__
Instant Messaging, free SMS, sharing photos and more... Try the new  
Yahoo!

Canada Messenger at http://ca.beta.messenger.yahoo.com/





Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Arun,

Briefly going through the DistributedCache information, it seems to be a way
to distribute files to mappers/reducers. One still needs to read the
contents into each map/reduce task VM. Therefore, the data gets replicated
across the VMs in a single node. It seems it does not address my basic
problem which is to have a large shared object across multiple map/reduce
tasks at a given node without having to replicate it across the VMs.

Is there a setting in Hadoop where one can tell Hadoop to create the
individual map/reduce tasks in the same JVM?

Thanks,
Dev


On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy [EMAIL PROTECTED] wrote:


 On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:

  Hi Alan,

 Thanks for your message.

 The object can be read-only once it is initialized - I do not need to
 modify


 Please take a look at DistributedCache:

 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

 An example:

 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0

 Arun



 it. Essentially it is an object that allows me to analyze/modify data that
 I
 am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
 that if I run multiple mappers, this object gets replicated in the
 different
 VMs and I run out of memory on my node. I pretty much need to have the
 full
 object in memory to do my processing. It is possible (though quite
 difficult) to have it partially on disk and query it (like a lucene store
 implementation) but there is a significant performance hit. As an e.g.,
 let
 us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
 scenario, I can really only have 1 mapper per node whereas there are 8
 CPUs.
 But if the overhead of sharing the object (e.g. RMI) or persisting the
 object (e.g. lucene) is greater than 8 times the memory speed, then it is
 cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
 getting a roughly 600 times decrease in performance versus in-memory
 access.

 So ideally, if I could have all the mappers in the same VM, then I can
 create a singleton and still have multiple mappers access it at memory
 speeds.

 Please do let me know if I am looking at this correctly and if the above
 is
 possible.

 Thanks a lot for all your help.

 Cheers,
 Dev




 On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

  It really depends on what type of data you are sharing, how you are
 looking
 up the data, whether the data is Read-write, and whether you care about
 consistency. If you don't care about consistency, I suggest that you
 shove
 the data into a BDB store (for key-value lookup) or a lucene store, and
 copy
 the data to all the nodes. That way all data access will be in-process,
 no
 gc problems, and you will get very fast results. BDB and lucene both have
 easy replication strategies.

 If the data is RW, and you need consistency, you should probably forget
 about MapReduce and just run everything on big-iron.

 Regards,
 Alan Ho




 - Original Message 
 From: Devajyoti Sarkar [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, October 2, 2008 8:41:04 PM
 Subject: Sharing an object across mappers

 I think each mapper/reducer runs in its own JVM which makes it impossible
 to
 share objects. I need to share a large object so that I can access it at
 memory speeds across all the mappers. Is it possible to have all the
 mappers
 run in the same VM? Or is there a way to do this across VMs at high
 speed?
 I
 guess JMI and others such methods will be just too slow.

 Thanks,
 Dev



 __
 Instant Messaging, free SMS, sharing photos and more... Try the new
 Yahoo!
 Canada Messenger at http://ca.beta.messenger.yahoo.com/





Re: Sharing an object across mappers

2008-10-03 Thread Owen O'Malley


On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:

Briefly going through the DistributedCache information, it seems to  
be a way

to distribute files to mappers/reducers.


Sure, but it handles the distribution problem for you.


One still needs to read the
contents into each map/reduce task VM.


If the data is straight binary data, you could just mmap it from the  
various tasks. It would be pretty efficient.


The other direction is to use the MultiThreadedMapRunner and run  
multiple maps as threads in the same VM. But unless your maps are CPU  
heavy or contacting external servers, it probably won't help as much  
as you'd like.


-- Owen


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Owen,

Thanks a lot for the pointers.

In order to use the MultiThreadedMapRunner, if I change the
setMapRunnerClass() method in the jobConf, then does the rest of my code
remain the same (apart from making it thread-safe)?

Thanks in advance,
Dev


On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley [EMAIL PROTECTED] wrote:


 On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:

  Briefly going through the DistributedCache information, it seems to be a
 way
 to distribute files to mappers/reducers.


 Sure, but it handles the distribution problem for you.

  One still needs to read the
 contents into each map/reduce task VM.


 If the data is straight binary data, you could just mmap it from the
 various tasks. It would be pretty efficient.

 The other direction is to use the MultiThreadedMapRunner and run multiple
 maps as threads in the same VM. But unless your maps are CPU heavy or
 contacting external servers, it probably won't help as much as you'd like.

 -- Owen



Sharing an object across mappers

2008-10-02 Thread Devajyoti Sarkar
I think each mapper/reducer runs in its own JVM which makes it impossible to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the mappers
run in the same VM? Or is there a way to do this across VMs at high speed? I
guess JMI and others such methods will be just too slow.

Thanks,
Dev


Re: Sharing an object across mappers

2008-10-02 Thread Alan Ho
It really depends on what type of data you are sharing, how you are looking up 
the data, whether the data is Read-write, and whether you care about 
consistency. If you don't care about consistency, I suggest that you shove the 
data into a BDB store (for key-value lookup) or a lucene store, and copy the 
data to all the nodes. That way all data access will be in-process, no gc 
problems, and you will get very fast results. BDB and lucene both have easy 
replication strategies.

If the data is RW, and you need consistency, you should probably forget about 
MapReduce and just run everything on big-iron.

Regards,
Alan Ho




- Original Message 
From: Devajyoti Sarkar [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers

I think each mapper/reducer runs in its own JVM which makes it impossible to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the mappers
run in the same VM? Or is there a way to do this across VMs at high speed? I
guess JMI and others such methods will be just too slow.

Thanks,
Dev



  __
Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! 
Canada Messenger at http://ca.beta.messenger.yahoo.com/