Hi Apurv,
Allow me to disagree. For complex workflows and repetitive map-reduce
jobs, such as PageRank,
a Hama implementation of map-reduce would be far superior. I haven't
looked at your code yet but
I hope it does not use HDFS or memory for I/O for each map-reduce job.
It must use streaming to pass
results across map-reduce jobs. If you are currently doing this, there
is no point for me to step in.
But I really think that this would be very useful in practice, not
just for demonstration.
Best regards,
Leonidas Fegaras
On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:
Hey Leonidas,
Glad that you are thinking about it. IMO it would be good to have
such a
functionality for demonstration purposes. But only for demonstration
purposes. Its best to let hadoop do what's its best at and hama do
what its
best at. ;) Suraj and I had tried hands on it in the past. Please
see [0]
and [1]. Also I was working on such a module on my github account.
But I
couldn't find much time. Basically its mostly the way you have
expressed in
your last mail. Do you have a github account, I have already done
some work
on github, I could share work it with you and divide the work if you
want?
[0]
http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
My POC of the Wordcount example on Hama. Super dirty and not
generic
but works.
[1] https://github.com/ssmenon/hama
Suraj's in memory implementation.
Let me know what you think
--
Regards,
Apurv Verma
On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras
<[email protected]>wrote:
I have seen some emails in this mailing list asking questions, such
as:
I have an X algorithm running on Hadoop map-reduce. Is it suitable
for
Hama?
I think it would be great if we had a good implementation of the
Hadoop map-reduce classes on Hama. Other distributed main-memory
systems have already done so. See:
M3R (http://vldb.org/pvldb/vol5/
**p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf
>)
and Spark.
It is actually easier than you think. I have done something similar
for my query system, MRQL. What we need is to reimplement
org.apache.hadoop.mapreduce.**Job to execute one superstep for each
map-reduce job. Then a Hadoop map-reduce program that may contain
complex workflows and/or loops of map-reduce jobs would need minor
changes to run on Hama as a single BSPJob. Obviously, to implement
map-reduce in Hama, the mapper output can be shuffled to reducers
based on key by sending messages using hashing:
peer.getPeerName(key.**hashValue() % peer.getNumPeers())
Then the reducer superstep groups the data by the key in memory and
applies the reducer method. To handle input/intermediate data, we can
use a mapping from path_name to (count,vector) at each node. The
path_name is the path name of some input or intermediate HDFS file,
vector contains the data partition from this file assigned to the
node, and
count is the max number of times we can scan this vector (after count
times, the vector is garbage-collected). The special case where
count=1 can be implemented using a stream (a Java inner class that
implements a stream Iterator). Given that the map-reduce Job output
is rarely accessed more than once, the translation of most map-reduce
jobs to Hama will not require any data to be stored in memory other
than those used by the map-reduce jobs. One exception is the graph
data that need to persist in memory across all jobs (then
count=maxint).
Based on my experience with MRQL, the implementation of these ideas
may need up to 1K lines of Java code. Let me know if you are
interested.
Leonidas Fegaras