Hi all, *>>Apache Spark is definitely more suited for ML (iterative algorithms) than*
*legacy Hadoop due to its preservation of state and optimized executionstrategy (RDDs). However, their approaches are still in synchronous iterativecommunication pattern.* So, Hama has a better communication model. That is a good point. *>>Moreover, BSP can have virtual **shared memory and many more benefits.* I read somewhere that Spark has shared variables. BSP virtual shared memory is something else or is it like shared variables in Spark ? *>>In addition, another one convincing* *point I think can be a utilization ability of modern acceleration accessoriessuch as InfiniBand and GPUs* Yes, it is a good point but I found the following link. Apparently, Spark is also capable of doing processing on GPU's. https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2 *>>I'm sure that this feature will bring a* *completely new wave of big data. The problem we faced is only a lack ofinterest to BSP programming model. :-)* My knowledge is quite limited but I think you are right. With the rise of IoT and stream processing, GPU's will become vital. Yes, I do not understand that why BSP is not the programming model of choice now a days. It has a strong theoretical background which was proposed decades back and still MapReduce/Spark models are used. *>>Just FYI, one of my friends said after reading this thread, "if AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.* I have not written a Spark job before, but I have seen the code. BSP looks more intuitive to me somehow. *>>Hama = GraphX (Library of Spark (Pregel model) [1])* The graph module of Hama is definitely equal to GraphX of Spark. Regards, Behroz On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <[email protected]> wrote: > Hi, here's my few thoughts. > > Apache Spark is definitely more suited for ML (iterative algorithms) than > legacy Hadoop due to its preservation of state and optimized execution > strategy (RDDs). However, their approaches are still in synchronous > iterative > communication pattern. > > In Apache Hama case, it's a general-purpose pure BSP framework. While I > admit > that synchronization costs are high, the communication can be more > efficiently > realized with the message-passing BSP model. Moreover, BSP can have virtual > shared memory and many more benefits. In addition, another one convincing > point I think can be a utilization ability of modern acceleration > accessories > such as InfiniBand and GPUs. I'm sure that this feature will bring a > completely new wave of big data. The problem we faced is only a lack of > interest to BSP programming model. :-) > > > 2) Do we have any recent benchmarks between the 2 systems ? > > It's in my todo list. > > -- > Best Regards, Edward J. Yoon > > -----Original Message----- > From: Behroz Sikander [mailto:[email protected]] > Sent: Thursday, June 25, 2015 12:57 AM > To: [email protected] > Subject: Hama vs Spark > > Hi, > A few days back, I started reading about Apache Spark. It is a pretty good > BigData platform. But a question arises to my mind that where Hama lies in > comparison with Spark if we have to implement an iterative algorithm which > is compute intensive (Machine learning or Optimization) ? > > I found some resources online but none answers my questions. > > 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf> > 2) > > https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf > 3) I actually found the following benchmark but it is quite old. > > > http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results > > Questions: > 1) Is there any specific advantage when we chose BSP model instead of SPARK > paradigm ? > 2) Do we have any recent benchmarks between the 2 systems ? > 3) What is the main convincing point to use Hama over Spark ? > 4) Any scientific paper that compares both systems ? (I was not able to > find any) > > Regards, > Behroz Sikander > > >
