Remember that article that went viral on HN? (Where a guy showed how GraphX
/ Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
a 1 thread machine? if not here is the article -
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)


Well as you may recall, this stirred up a lot of commotion in the big data
community (and Spark/GraphX in particular)

People (justly I guess) blamed him for not really having “big data”, as all
of his data set fits in memory, so it doesn't really count.


So he took the challenge and came with a pretty hard to argue counter
benchmark, now with a huge data set (1TB of data, encoded using Hilbert
curves to 154GB, but still large).
see at -
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

He provided the source here https://github.com/frankmcsherry/COST as an
example

His benchmark shows how on a 128 billion edges graph, he got X2 to X10
faster results on a single threaded Rust based implementation

So, what is the counter argument? it pretty much seems like a blow in the
face of Spark / GraphX etc, (which I like and use on a daily basis)

Before I dive into re-validating his benchmarks with my own use cases. What
is your opinion on this? If this is the case, then what IS the use case for
using Spark/GraphX at all?

Reply via email to