Hi Sean, I think your point about the ETL costs are the wining argument here. but I would like to see more research on the topic.
What I would like to see researched - is ability to run a specialized set of common algorithms in "fast-local-mode" just like a compiler optimizer can decide to inline some methods, or rewrite a recursive function as a for loop if it's in tail position, I would say that the future of GraphX can be that if a certain algorithm is a well known one (e.g. shortest paths) and can be run locally faster than on a distributed set (taking into account bringing all the data locally) then it will do so. Thanks! On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote: > (I bet the Spark implementation could be improved. I bet GraphX could > be optimized.) > > Not sure about this one, but "in core" benchmarks often start by > assuming that the data is local. In the real world, data is unlikely > to be. The benchmark has to include the cost of bringing all the data > to the local computation too, since the point of distributed > computation is bringing work to the data. > > Specialist implementations for a special problem should always win > over generalist, and Spark is a generalist. Likewise you can factor > matrices way faster in a GPU than in Spark. These aren't entirely > either/or propositions; you can use Rust or GPU in a larger > distributed program. > > Typically a real-world problem involves more than core computation: > ETL, security, monitoring. Generalists are more likely to have an > answer to hand for these. > > Specialist implementations do just one thing, and they typically have > to be custom built. Compare the cost of highly skilled developer time > to generalist computing resources; $1m buys several dev years but also > rents a small data center. > > Speed is an important issue but by no means everything in the real > world, and these are rarely mutually exclusive options in the OSS > world. This is a great piece of work, but I don't think it's some kind > of argument against distributed computing. > > > On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan <ehrann.meh...@gmail.com> > wrote: > > Remember that article that went viral on HN? (Where a guy showed how > GraphX > > / Giraph / GraphLab / Spark have worse performance on a 128 cluster than > on > > a 1 thread machine? if not here is the article > > - > http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) > > > > > > Well as you may recall, this stirred up a lot of commotion in the big > data > > community (and Spark/GraphX in particular) > > > > People (justly I guess) blamed him for not really having “big data”, as > all > > of his data set fits in memory, so it doesn't really count. > > > > > > So he took the challenge and came with a pretty hard to argue counter > > benchmark, now with a huge data set (1TB of data, encoded using Hilbert > > curves to 154GB, but still large). > > see at - > > > http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html > > > > He provided the source here https://github.com/frankmcsherry/COST as an > > example > > > > His benchmark shows how on a 128 billion edges graph, he got X2 to X10 > > faster results on a single threaded Rust based implementation > > > > So, what is the counter argument? it pretty much seems like a blow in the > > face of Spark / GraphX etc, (which I like and use on a daily basis) > > > > Before I dive into re-validating his benchmarks with my own use cases. > What > > is your opinion on this? If this is the case, then what IS the use case > for > > using Spark/GraphX at all? >