Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Bertrand Dechoux
The difficulty with data transfer between tasks is handling synchronisation and failure. You may want to look at graph processing done on top of Hadoop (like Giraph). That's one way to do it but whether it is relevant or not to you will depend on your context. Regards Bertrand On Wed, Sep 26,

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jonathan Bishop
Yes, Giraph seems like the best way to go - it is mainly a vertex evaluation with message passing between vertices. Synchronization is handled for you. On Wed, Sep 26, 2012 at 8:36 AM, Jane Wayne jane.wayne2...@gmail.comwrote: hi, i know that some algorithms cannot be parallelized and adapted

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jane Wayne
my problem is more general (than graph problems) and doesn't need to have logic built around synchronization or failure. for example, when a mapper is finished successfully, it just writes/persists to a storage location (could be disk, could be database, could be memory, etc...). when the next

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Harsh J
Apache Giraph is a framework for graph processing, currently runs over MR (but is getting its own coordination via YARN soon): http://giraph.apache.org. You may also checkout the generic BSP system (Giraph uses BSP too, if am not wrong, but doesn't use Hama - works over MR instead), Apache Hama:

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jay Vyas
The reason this is so rare is that the nature of map/reduce tasks is that they are orthogonal i.e. the word count, batch image recognition, tera sort -- all the things hadoop is famous for are largely orthogonal tasks. Its much more rare (i think) to see people using hadoop to do traffic

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Bertrand Dechoux
I wouldn't so surprised. It takes times, energy and money to solve problems and make solutions that would be prod-ready. A few people would consider that the namenode/secondary spof is a limit for Hadoop itself in order to go into a critical production environnement. (I am only quoting it and

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Harsh J
Also read: http://arxiv.org/abs/1209.2191 ;-) On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux decho...@gmail.com wrote: I wouldn't so surprised. It takes times, energy and money to solve problems and make solutions that would be prod-ready. A few people would consider that the

Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jane Wayne
thanks. those issues pointed out do cover the pain points i'm experiencing. On Wed, Sep 26, 2012 at 3:11 PM, Harsh J ha...@cloudera.com wrote: Also read: http://arxiv.org/abs/1209.2191 ;-) On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux decho...@gmail.com wrote: I wouldn't so surprised. It