Yup, I get your point. For repetitive MR jobs it would be better. On a side note, I wanted to ask, would a bsp implementation of Pagerank be faster or an MR implementation. Please correct me if I am wrong, but I try to think of it as, since Pagerank is a graph problem and nodes need to communicate messages, therefore bsp would be a better model for implementing PageRank than MR.
But I get your point for complex workflows and iterative MR jobs the bsp implementation with streaming would be better. I had not thought about streaming in my implementation. My implementation hadn't been thought using streaming. Now I will need to include a hook for input being coming from a previous job directly. Please feel free to go forth with your idea, and let me know if I can help in any manner. -- Regards, Apurv Verma On Thu, Oct 11, 2012 at 10:07 PM, Leonidas Fegaras <[email protected]>wrote: > Hi Apurv, > Allow me to disagree. For complex workflows and repetitive map-reduce > jobs, such as PageRank, > a Hama implementation of map-reduce would be far superior. I haven't > looked at your code yet but > I hope it does not use HDFS or memory for I/O for each map-reduce job. It > must use streaming to pass > results across map-reduce jobs. If you are currently doing this, there is > no point for me to step in. > But I really think that this would be very useful in practice, not just > for demonstration. > Best regards, > Leonidas Fegaras > > > On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote: > > Hey Leonidas, >> Glad that you are thinking about it. IMO it would be good to have such a >> functionality for demonstration purposes. But only for demonstration >> purposes. Its best to let hadoop do what's its best at and hama do what >> its >> best at. ;) Suraj and I had tried hands on it in the past. Please see [0] >> and [1]. Also I was working on such a module on my github account. But I >> couldn't find much time. Basically its mostly the way you have expressed >> in >> your last mail. Do you have a github account, I have already done some >> work >> on github, I could share work it with you and divide the work if you want? >> >> [0] >> http://code.google.com/p/**anahad/source/browse/trunk/** >> src/main/java/org/anahata/bsp/**WordCount.java<http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java> >> My POC of the Wordcount example on Hama. Super dirty and not generic >> but works. >> >> [1] https://github.com/ssmenon/**hama <https://github.com/ssmenon/hama> >> Suraj's in memory implementation. >> >> >> Let me know what you think >> >> -- >> Regards, >> Apurv Verma >> >> >> >> >> >> On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <[email protected] >> >wrote: >> >> I have seen some emails in this mailing list asking questions, such as: >>> I have an X algorithm running on Hadoop map-reduce. Is it suitable for >>> Hama? >>> I think it would be great if we had a good implementation of the >>> Hadoop map-reduce classes on Hama. Other distributed main-memory >>> systems have already done so. See: >>> M3R (http://vldb.org/pvldb/vol5/****p1736_avrahamshinnar_vldb2012.** >>> **pdf <http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf> >>> <http://vldb.org/pvldb/**vol5/p1736_avrahamshinnar_**vldb2012.pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf> >>> >) >>> >>> and Spark. >>> It is actually easier than you think. I have done something similar >>> for my query system, MRQL. What we need is to reimplement >>> org.apache.hadoop.mapreduce.****Job to execute one superstep for each >>> >>> map-reduce job. Then a Hadoop map-reduce program that may contain >>> complex workflows and/or loops of map-reduce jobs would need minor >>> changes to run on Hama as a single BSPJob. Obviously, to implement >>> map-reduce in Hama, the mapper output can be shuffled to reducers >>> based on key by sending messages using hashing: >>> peer.getPeerName(key.****hashValue() % peer.getNumPeers()) >>> >>> Then the reducer superstep groups the data by the key in memory and >>> applies the reducer method. To handle input/intermediate data, we can >>> use a mapping from path_name to (count,vector) at each node. The >>> path_name is the path name of some input or intermediate HDFS file, >>> vector contains the data partition from this file assigned to the node, >>> and >>> count is the max number of times we can scan this vector (after count >>> times, the vector is garbage-collected). The special case where >>> count=1 can be implemented using a stream (a Java inner class that >>> implements a stream Iterator). Given that the map-reduce Job output >>> is rarely accessed more than once, the translation of most map-reduce >>> jobs to Hama will not require any data to be stored in memory other >>> than those used by the map-reduce jobs. One exception is the graph >>> data that need to persist in memory across all jobs (then count=maxint). >>> Based on my experience with MRQL, the implementation of these ideas >>> may need up to 1K lines of Java code. Let me know if you are interested. >>> Leonidas Fegaras >>> >>> >>> >
