Yes, that's what I was trying to (briefly & imprecisely) distinguish with the words "fully async DAG" referring to Dryad, and generalizing MR to BSP. I should have referred to Dryad as "a general DAG with a rich composition algebra that the user can directly manipulate".
Spark is more than just MapReduce, so the clarification is helpful; I've flinched each time I use the shorthand "really fast MapReduce". The practical point here is that it's actually how Spark has become so successful---that map*() and reduce*() abstraction is well known by people looking for a speedy way out of the "batch-oriented Hadoop MapReduce" problem but still take advantage of that strong ecosystem. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Wed, Oct 23, 2013 at 7:06 PM, Matei Zaharia <[email protected]>wrote: > Just to be clear, Spark actually *does* support general task graphs, > similar to Dryad (though a bit simpler in that there's a notion of "stages" > and a fixed set of connection patterns between them). However, MBrace goes > a step beyond that, in that the graphs can be modified dynamically based on > user code. It's also not clear what the granularity of task spawns in > MBrace is -- can you spawn stuff that runs for 1 millisecond, or 1 second, > or 1 hour? The choice there greatly affects system design. > > Matei > > On Oct 23, 2013, at 6:54 PM, Christopher Nguyen <[email protected]> wrote: > > > Re MBrace: very interesting work. I'm a bit surprised though that the > paper > > makes no mention of DryadLINQ ( > > http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf). > > > > Architecturally it's a lot easier to see an MBrace implementation > > specialized to a MapReduce (or more generically, a BSP) computation, than > > to have a Spark implement the fully async DAG model of an MBrace/Dryad > > engine. > > > > More practically, as interesting as it might be as a side effort, I think > > for the core Spark effort to attempt something like that would be "off > > mission". Spark's success to date has been more due to beautiful > > implementation of a known architecture, than beautiful new architecture. > > Basically, Spark does MapReduce 10-100x faster than Hadoop, and more > people > > by now understand how to get MapReduce to solve their problems than any > > other parallel model. Spark sits natively on HDFS so that makes adoption > a > > lot easier to swallow. So at present, for Spark to mature quickly along > > that successful trajectory, the key problems to address are more > practical > > "user interface" or "productivity" things like manageability, > > deployability, fault-tolerance improvements, multi-user access, a bigger > > library of pre-packaged algorithms, etc. > > > > Whether MapReduce's own success is an accident of history or something > more > > fundamental is subject to interesting debate. I remember being constantly > > amazed by the number of problems that when squinted at the right way > > becomes an MR-soluble problem at Google (starting ironically with > PageRank > > itself). Yes, apparently sometimes it does pay to see many things as a > nail > > when you have invested in a powerful hammer. > > > > Along those lines, here are some interesting perspectives on the beauty > of > > Dryad/DryadLINQ, and at least one practical reason why it didn't succeed > as > > an implementation. > > > > - > > > http://blogs.msdn.com/b/dryad/archive/2010/02/15/some-dryad-and-dryadlinq-history.aspx > > - > > > http://geekswithblogs.net/johnsPerfBlog/archive/2011/12/12/rip-dryadlinq-or-long-live-linq-to-hadoop.aspx > > > > > > > > -- > > Christopher T. Nguyen > > Co-founder & CEO, Adatao <http://adatao.com> > > linkedin.com/in/ctnguyen > > > > > > > > On Wed, Oct 23, 2013 at 2:33 PM, Alex Boisvert <[email protected] > >wrote: > > > >> (Resending to @apache list instead of old google-group) > >> > >> A bit of a random question but I was wondering if there were efforts > >> underway to generalize / expand the Spark API towards something that > would > >> be similar to the MBrace [1] model ... there's certainly an overlap > between > >> the features of the systems already ... so I guess I'm thinking about an > >> API that's less centered around RDDs (as a collection) and more towards > >> distributed dataflow that would feel more like composing > Promises/Futures > >> ... or even generalizing to support various sorts of container/context > >> monads. > >> > >> [1] "MBrace: Cloud Computing with Monads" > >> http://plosworkshop.org/2013/preprint/dzik.pdf > >> > >
