Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

Eran Medan Fri, 27 Mar 2015 11:33:55 -0700

Remember that article that went viral on HN? (Where a guy showed how GraphX
/ Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
a 1 thread machine? if not here is the article -
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)



Well as you may recall, this stirred up a lot of commotion in the big data
community (and Spark/GraphX in particular)

People (justly I guess) blamed him for not really having “big data”, as all
of his data set fits in memory, so it doesn't really count.


So he took the challenge and came with a pretty hard to argue counter
benchmark, now with a huge data set (1TB of data, encoded using Hilbert
curves to 154GB, but still large).
see at -
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

He provided the source here https://github.com/frankmcsherry/COST as an
example

His benchmark shows how on a 128 billion edges graph, he got X2 to X10
faster results on a single threaded Rust based implementation

So, what is the counter argument? it pretty much seems like a blow in the
face of Spark / GraphX etc, (which I like and use on a daily basis)

Before I dive into re-validating his benchmarks with my own use cases. What
is your opinion on this? If this is the case, then what IS the use case for
using Spark/GraphX at all?

Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

Reply via email to