Re: Sharing an object across mappers
Hi Owen, Thanks a lot for the pointers. In order to use the MultiThreadedMapRunner, if I change the setMapRunnerClass() method in the jobConf, then does the rest of my code remain the same (apart from making it thread-safe)? Thanks in advance, Dev On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: > > Briefly going through the DistributedCache information, it seems to be a >> way >> to distribute files to mappers/reducers. >> > > Sure, but it handles the distribution problem for you. > > One still needs to read the >> contents into each map/reduce task VM. >> > > If the data is straight binary data, you could just mmap it from the > various tasks. It would be pretty efficient. > > The other direction is to use the MultiThreadedMapRunner and run multiple > maps as threads in the same VM. But unless your maps are CPU heavy or > contacting external servers, it probably won't help as much as you'd like. > > -- Owen >
Re: Sharing an object across mappers
On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. Sure, but it handles the distribution problem for you. One still needs to read the contents into each map/reduce task VM. If the data is straight binary data, you could just mmap it from the various tasks. It would be pretty efficient. The other direction is to use the MultiThreadedMapRunner and run multiple maps as threads in the same VM. But unless your maps are CPU heavy or contacting external servers, it probably won't help as much as you'd like. -- Owen
Re: Sharing an object across mappers
Hi Arun, Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. One still needs to read the contents into each map/reduce task VM. Therefore, the data gets replicated across the VMs in a single node. It seems it does not address my basic problem which is to have a large shared object across multiple map/reduce tasks at a given node without having to replicate it across the VMs. Is there a setting in Hadoop where one can tell Hadoop to create the individual map/reduce tasks in the same JVM? Thanks, Dev On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: > > Hi Alan, >> >> Thanks for your message. >> >> The object can be read-only once it is initialized - I do not need to >> modify >> > > Please take a look at DistributedCache: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache > > An example: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 > > Arun > > > >> it. Essentially it is an object that allows me to analyze/modify data that >> I >> am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is >> that if I run multiple mappers, this object gets replicated in the >> different >> VMs and I run out of memory on my node. I pretty much need to have the >> full >> object in memory to do my processing. It is possible (though quite >> difficult) to have it partially on disk and query it (like a lucene store >> implementation) but there is a significant performance hit. As an e.g., >> let >> us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this >> scenario, I can really only have 1 mapper per node whereas there are 8 >> CPUs. >> But if the overhead of sharing the object (e.g. RMI) or persisting the >> object (e.g. lucene) is greater than 8 times the memory speed, then it is >> cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was >> getting a roughly 600 times decrease in performance versus in-memory >> access. >> >> So ideally, if I could have all the mappers in the same VM, then I can >> create a singleton and still have multiple mappers access it at memory >> speeds. >> >> Please do let me know if I am looking at this correctly and if the above >> is >> possible. >> >> Thanks a lot for all your help. >> >> Cheers, >> Dev >> >> >> >> >> On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: >> >> It really depends on what type of data you are sharing, how you are >>> looking >>> up the data, whether the data is Read-write, and whether you care about >>> consistency. If you don't care about consistency, I suggest that you >>> shove >>> the data into a BDB store (for key-value lookup) or a lucene store, and >>> copy >>> the data to all the nodes. That way all data access will be in-process, >>> no >>> gc problems, and you will get very fast results. BDB and lucene both have >>> easy replication strategies. >>> >>> If the data is RW, and you need consistency, you should probably forget >>> about MapReduce and just run everything on big-iron. >>> >>> Regards, >>> Alan Ho >>> >>> >>> >>> >>> - Original Message >>> From: Devajyoti Sarkar <[EMAIL PROTECTED]> >>> To: core-user@hadoop.apache.org >>> Sent: Thursday, October 2, 2008 8:41:04 PM >>> Subject: Sharing an object across mappers >>> >>> I think each mapper/reducer runs in its own JVM which makes it impossible >>> to >>> share objects. I need to share a large object so that I can access it at >>> memory speeds across all the mappers. Is it possible to have all the >>> mappers >>> run in the same VM? Or is there a way to do this across VMs at high >>> speed? >>> I >>> guess JMI and others such methods will be just too slow. >>> >>> Thanks, >>> Dev >>> >>> >>> >>> __ >>> Instant Messaging, free SMS, sharing photos and more... Try the new >>> Yahoo! >>> Canada Messenger at http://ca.beta.messenger.yahoo.com/ >>> >>> >
Re: Sharing an object across mappers
On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify Please take a look at DistributedCache: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache An example: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 Arun it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: It really depends on what type of data you are sharing, how you are looking up the data, whether the data is Read-write, and whether you care about consistency. If you don't care about consistency, I suggest that you shove the data into a BDB store (for key-value lookup) or a lucene store, and copy the data to all the nodes. That way all data access will be in- process, no gc problems, and you will get very fast results. BDB and lucene both have easy replication strategies. If the data is RW, and you need consistency, you should probably forget about MapReduce and just run everything on big-iron. Regards, Alan Ho - Original Message From: Devajyoti Sarkar <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, October 2, 2008 8:41:04 PM Subject: Sharing an object across mappers I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev __ Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
Re: Sharing an object across mappers
Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: > It really depends on what type of data you are sharing, how you are looking > up the data, whether the data is Read-write, and whether you care about > consistency. If you don't care about consistency, I suggest that you shove > the data into a BDB store (for key-value lookup) or a lucene store, and copy > the data to all the nodes. That way all data access will be in-process, no > gc problems, and you will get very fast results. BDB and lucene both have > easy replication strategies. > > If the data is RW, and you need consistency, you should probably forget > about MapReduce and just run everything on big-iron. > > Regards, > Alan Ho > > > > > - Original Message > From: Devajyoti Sarkar <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, October 2, 2008 8:41:04 PM > Subject: Sharing an object across mappers > > I think each mapper/reducer runs in its own JVM which makes it impossible > to > share objects. I need to share a large object so that I can access it at > memory speeds across all the mappers. Is it possible to have all the > mappers > run in the same VM? Or is there a way to do this across VMs at high speed? > I > guess JMI and others such methods will be just too slow. > > Thanks, > Dev > > > > __ > Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! > Canada Messenger at http://ca.beta.messenger.yahoo.com/ >
Re: Sharing an object across mappers
It really depends on what type of data you are sharing, how you are looking up the data, whether the data is Read-write, and whether you care about consistency. If you don't care about consistency, I suggest that you shove the data into a BDB store (for key-value lookup) or a lucene store, and copy the data to all the nodes. That way all data access will be in-process, no gc problems, and you will get very fast results. BDB and lucene both have easy replication strategies. If the data is RW, and you need consistency, you should probably forget about MapReduce and just run everything on big-iron. Regards, Alan Ho - Original Message From: Devajyoti Sarkar <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, October 2, 2008 8:41:04 PM Subject: Sharing an object across mappers I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev __ Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
Sharing an object across mappers
I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev