Hi, I don't fully understand how graphlab works but I'm sure that there are pros and cons either way. At the moment I have no plan. :-)
However, I noticed that region barrier synchronization feature within single BSP job (default is global barrier synchronization) is quite useful. This can be used for performing asynchronous mini-batches. -- Best Regards, Edward J. Yoon -----Original Message----- From: Behroz Sikander [mailto:[email protected]] Sent: Monday, August 03, 2015 7:38 PM To: [email protected] Subject: Re: Hama vs Spark I think I wrote it wrong. It should be Asynchronous Iterations. I found the following a few months back. It was a thesis description: *SUPPORT FOR ASYNCHRONOUS ITERATIONS IN FLINK (IN COLLABORATION WITH KTH ROYAL INSTITUTE FOR TECHNOLOGY, SWE)* *Context:* Currently, most of the large scale graph processing systems adopt the bulk synchronous parallel (BSP) model. According to this model, iterative computations happen in well -defined supersteps, which are marked by a global barrier. BSP simplifies application development and ensures determinism. However, it has been shown that asynchronous execution often leads to faster convergence, for several algorithms [LBG+12]. The main goal of this thesis is to add support for asynchronous iterative execution, in Apache Flink, a general- purpose, distributed data processing system. http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf On Mon, Aug 3, 2015 at 3:16 AM, Edward J. Yoon <[email protected]> wrote: > I'm not sure how it can be possible. However, I think user can find > the slowest machine in each superstep and re-balance the loads. This > can be handled from client (user) side. > > On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <[email protected]> > wrote: > > +1. This is great. > > > > Btw our current implementation of Hama is Synchronous BSP i.e we have to > > wait for the slowest machine to sync in order to move to the next super > > step. Is there anything like Asynchronous BSP out yet ? If yes, do you > have > > plans to add it to this framework ? > > > > Regards, > > Behroz > > > > On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <[email protected]> > > wrote: > > > >> I found research paper somewhat related with this topic. > >> > >> "Both the disk based method, i.e., MR, and the memory based method, > >> i.e., BSP and Spark, need to load the data into main memory and > >> conduct the expensive computation. However, when processing topk > >> joins, BSP is clearly the best method as it is the only one that is > >> able to perform top-k joins on large datasets. This is because BSP > >> supports the frequent synchronizations between workers when performing > >> the joining procedure, which quickly lowers the joining threshold for > >> a given k. The winner between the MR and the Spark algorithms change > >> from datasets to datasets: Spark is beaten by MR on A and B while > >> beats MR on C." - > >> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf > >> > >> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <[email protected]> > >> wrote: > >> > Hi all, > >> > *>>Apache Spark is definitely more suited for ML (iterative > algorithms) > >> > than* > >> > > >> > > >> > *legacy Hadoop due to its preservation of state and optimized > >> > executionstrategy (RDDs). However, their approaches are still in > >> > synchronous iterativecommunication pattern.* > >> > So, Hama has a better communication model. That is a good point. > >> > > >> > *>>Moreover, BSP can have virtual **shared memory and many more > >> benefits.* > >> > I read somewhere that Spark has shared variables. BSP virtual shared > >> memory > >> > is something else or is it like shared variables in Spark ? > >> > > >> > *>>In addition, another one convincing* > >> > > >> > *point I think can be a utilization ability of modern acceleration > >> > accessoriessuch as InfiniBand and GPUs* > >> > Yes, it is a good point but I found the following link. Apparently, > Spark > >> > is also capable of doing processing on GPU's. > >> > > >> > https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2 > >> > > >> > *>>I'm sure that this feature will bring a* > >> > > >> > *completely new wave of big data. The problem we faced is only a lack > >> > ofinterest to BSP programming model. :-)* > >> > My knowledge is quite limited but I think you are right. With the > rise of > >> > IoT and stream processing, GPU's will become vital. Yes, I do not > >> > understand that why BSP is not the programming model of choice now a > >> days. > >> > It has a strong theoretical background which was proposed decades back > >> and > >> > still MapReduce/Spark models are used. > >> > > >> > > >> > *>>Just FYI, one of my friends said after reading this thread, "if > >> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability > side.* > >> > I have not written a Spark job before, but I have seen the code. BSP > >> looks > >> > more intuitive to me somehow. > >> > > >> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])* > >> > The graph module of Hama is definitely equal to GraphX of Spark. > >> > > >> > Regards, > >> > Behroz > >> > > >> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon < > [email protected] > >> > > >> > wrote: > >> > > >> >> Hi, here's my few thoughts. > >> >> > >> >> Apache Spark is definitely more suited for ML (iterative algorithms) > >> than > >> >> legacy Hadoop due to its preservation of state and optimized > execution > >> >> strategy (RDDs). However, their approaches are still in synchronous > >> >> iterative > >> >> communication pattern. > >> >> > >> >> In Apache Hama case, it's a general-purpose pure BSP framework. > While I > >> >> admit > >> >> that synchronization costs are high, the communication can be more > >> >> efficiently > >> >> realized with the message-passing BSP model. Moreover, BSP can have > >> virtual > >> >> shared memory and many more benefits. In addition, another one > >> convincing > >> >> point I think can be a utilization ability of modern acceleration > >> >> accessories > >> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a > >> >> completely new wave of big data. The problem we faced is only a lack > of > >> >> interest to BSP programming model. :-) > >> >> > >> >> > 2) Do we have any recent benchmarks between the 2 systems ? > >> >> > >> >> It's in my todo list. > >> >> > >> >> -- > >> >> Best Regards, Edward J. Yoon > >> >> > >> >> -----Original Message----- > >> >> From: Behroz Sikander [mailto:[email protected]] > >> >> Sent: Thursday, June 25, 2015 12:57 AM > >> >> To: [email protected] > >> >> Subject: Hama vs Spark > >> >> > >> >> Hi, > >> >> A few days back, I started reading about Apache Spark. It is a pretty > >> good > >> >> BigData platform. But a question arises to my mind that where Hama > lies > >> in > >> >> comparison with Spark if we have to implement an iterative algorithm > >> which > >> >> is compute intensive (Machine learning or Optimization) ? > >> >> > >> >> I found some resources online but none answers my questions. > >> >> > >> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf> > >> >> 2) > >> >> > >> >> > >> > https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf > >> >> 3) I actually found the following benchmark but it is quite old. > >> >> > >> >> > >> >> > >> > http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results > >> >> > >> >> Questions: > >> >> 1) Is there any specific advantage when we chose BSP model instead of > >> SPARK > >> >> paradigm ? > >> >> 2) Do we have any recent benchmarks between the 2 systems ? > >> >> 3) What is the main convincing point to use Hama over Spark ? > >> >> 4) Any scientific paper that compares both systems ? (I was not able > to > >> >> find any) > >> >> > >> >> Regards, > >> >> Behroz Sikander > >> >> > >> >> > >> >> > >> > >> > >> > >> -- > >> Best Regards, Edward J. Yoon > >> > > > > -- > Best Regards, Edward J. Yoon >
