Yup, I get your point. For repetitive MR jobs it would be better. On a side
note, I wanted to ask, would a bsp implementation of Pagerank be faster or
an MR implementation. Please correct me if I am wrong, but I try to think
of it as, since Pagerank is a graph problem and nodes need to communicate
messages, therefore bsp would be a better model for implementing PageRank
than MR.

But I get your point for complex workflows and iterative MR jobs the bsp
implementation with streaming would be better. I had not thought about
streaming in my implementation. My implementation hadn't been thought
 using streaming. Now I will need to include a hook for input being coming
from a previous job directly.

Please feel free to go forth with your idea, and let me know if I can help
in any manner.

--
Regards,
Apurv Verma





On Thu, Oct 11, 2012 at 10:07 PM, Leonidas Fegaras <[email protected]>wrote:

> Hi Apurv,
> Allow me to disagree. For complex workflows and repetitive map-reduce
> jobs, such as PageRank,
> a Hama implementation of map-reduce would be far superior. I haven't
> looked at your code yet but
> I hope it does not use HDFS or memory for I/O for each map-reduce job. It
> must use streaming to pass
> results across map-reduce jobs. If you are currently doing this, there is
> no point for me to step in.
> But I really think that this would be very useful in practice, not just
> for demonstration.
> Best regards,
> Leonidas Fegaras
>
>
> On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:
>
>  Hey Leonidas,
>> Glad that you are thinking about it. IMO it would be good to have such a
>> functionality for demonstration purposes. But only for demonstration
>> purposes. Its best to let hadoop do what's its best at and hama do what
>> its
>> best at. ;) Suraj and I had tried hands on it in the past. Please see [0]
>> and [1]. Also I was working on such a module on my github account. But I
>> couldn't find much time. Basically its mostly the way you have expressed
>> in
>> your last mail. Do you have a github account, I have already done some
>> work
>> on github, I could share work it with you and divide the work if you want?
>>
>> [0]
>> http://code.google.com/p/**anahad/source/browse/trunk/**
>> src/main/java/org/anahata/bsp/**WordCount.java<http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java>
>>     My POC of the Wordcount example on Hama. Super dirty and not generic
>> but works.
>>
>> [1] https://github.com/ssmenon/**hama <https://github.com/ssmenon/hama>
>> Suraj's in memory implementation.
>>
>>
>> Let me know what you think
>>
>> --
>> Regards,
>> Apurv Verma
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <[email protected]
>> >wrote:
>>
>>  I have seen some emails in this mailing list asking questions, such as:
>>> I have an X algorithm running on Hadoop map-reduce. Is it suitable for
>>> Hama?
>>> I think it would be great if we had a good implementation of the
>>> Hadoop map-reduce classes on Hama. Other distributed main-memory
>>> systems have already done so. See:
>>> M3R (http://vldb.org/pvldb/vol5/****p1736_avrahamshinnar_vldb2012.**
>>> **pdf <http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf>
>>> <http://vldb.org/pvldb/**vol5/p1736_avrahamshinnar_**vldb2012.pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>
>>> >)
>>>
>>> and Spark.
>>> It is actually easier than you think. I have done something similar
>>> for my query system, MRQL. What we need is to reimplement
>>> org.apache.hadoop.mapreduce.****Job to execute one superstep for each
>>>
>>> map-reduce job. Then a Hadoop map-reduce program that may contain
>>> complex workflows and/or loops of map-reduce jobs would need minor
>>> changes to run on Hama as a single BSPJob. Obviously, to implement
>>> map-reduce in Hama, the mapper output can be shuffled to reducers
>>> based on key by sending messages using hashing:
>>> peer.getPeerName(key.****hashValue() % peer.getNumPeers())
>>>
>>> Then the reducer superstep groups the data by the key in memory and
>>> applies the reducer method. To handle input/intermediate data, we can
>>> use a mapping from path_name to (count,vector) at each node. The
>>> path_name is the path name of some input or intermediate HDFS file,
>>> vector contains the data partition from this file assigned to the node,
>>> and
>>> count is the max number of times we can scan this vector (after count
>>> times, the vector is garbage-collected). The special case where
>>> count=1 can be implemented using a stream (a Java inner class that
>>> implements a stream Iterator). Given that the map-reduce Job output
>>> is rarely accessed more than once, the translation of most map-reduce
>>> jobs to Hama will not require any data to be stored in memory other
>>> than those used by the map-reduce jobs. One exception is the graph
>>> data that need to persist in memory across all jobs (then count=maxint).
>>> Based on my experience with MRQL, the implementation of these ideas
>>> may need up to 1K lines of Java code. Let me know if you are interested.
>>> Leonidas Fegaras
>>>
>>>
>>>
>

Reply via email to