Re: Implementing Hadoop map-reduce on Hama

Leonidas Fegaras Thu, 11 Oct 2012 09:38:58 -0700

Hi Apurv,

Allow me to disagree. For complex workflows and repetitive map-reducejobs, such as PageRank,a Hama implementation of map-reduce would be far superior. I haven'tlooked at your code yet butI hope it does not use HDFS or memory for I/O for each map-reduce job.It must use streaming to passresults across map-reduce jobs. If you are currently doing this, thereis no point for me to step in.But I really think that this would be very useful in practice, notjust for demonstration.

Best regards,
Leonidas Fegaras


On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:

Hey Leonidas,
Glad that you are thinking about it. IMO it would be good to havesuch a
functionality for demonstration purposes. But only for demonstration
purposes. Its best to let hadoop do what's its best at and hama dowhat itsbest at. ;) Suraj and I had tried hands on it in the past. Pleasesee [0]and [1]. Also I was working on such a module on my github account.But Icouldn't find much time. Basically its mostly the way you haveexpressed inyour last mail. Do you have a github account, I have already donesome workon github, I could share work it with you and divide the work if youwant?
[0]
http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
My POC of the Wordcount example on Hama. Super dirty and notgeneric
but works.

[1] https://github.com/ssmenon/hama
Suraj's in memory implementation.


Let me know what you think

--
Regards,
Apurv Verma
On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras<[email protected]>wrote:
I have seen some emails in this mailing list asking questions, suchas:I have an X algorithm running on Hadoop map-reduce. Is it suitablefor
Hama?
I think it would be great if we had a good implementation of the
Hadoop map-reduce classes on Hama. Other distributed main-memory
systems have already done so. See:
M3R (http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>)
and Spark.
It is actually easier than you think. I have done something similar
for my query system, MRQL. What we need is to reimplement
org.apache.hadoop.mapreduce.**Job to execute one superstep for each
map-reduce job. Then a Hadoop map-reduce program that may contain
complex workflows and/or loops of map-reduce jobs would need minor
changes to run on Hama as a single BSPJob. Obviously, to implement
map-reduce in Hama, the mapper output can be shuffled to reducers
based on key by sending messages using hashing:
peer.getPeerName(key.**hashValue() % peer.getNumPeers())
Then the reducer superstep groups the data by the key in memory and
applies the reducer method. To handle input/intermediate data, we can
use a mapping from path_name to (count,vector) at each node. The
path_name is the path name of some input or intermediate HDFS file,
vector contains the data partition from this file assigned to thenode, and
count is the max number of times we can scan this vector (after count
times, the vector is garbage-collected). The special case where
count=1 can be implemented using a stream (a Java inner class that
implements a stream Iterator). Given that the map-reduce Job output
is rarely accessed more than once, the translation of most map-reduce
jobs to Hama will not require any data to be stored in memory other
than those used by the map-reduce jobs. One exception is the graph
data that need to persist in memory across all jobs (thencount=maxint).
Based on my experience with MRQL, the implementation of these ideas
may need up to 1K lines of Java code. Let me know if you areinterested.
Leonidas Fegaras

Re: Implementing Hadoop map-reduce on Hama

Reply via email to