Remember that article that went viral on HN? (Where a guy showed how GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on a 1 thread machine? if not here is the article - http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
Well as you may recall, this stirred up a lot of commotion in the big data community (and Spark/GraphX in particular) People (justly I guess) blamed him for not really having “big data”, as all of his data set fits in memory, so it doesn't really count. So he took the challenge and came with a pretty hard to argue counter benchmark, now with a huge data set (1TB of data, encoded using Hilbert curves to 154GB, but still large). see at - http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html He provided the source here https://github.com/frankmcsherry/COST as an example His benchmark shows how on a 128 billion edges graph, he got X2 to X10 faster results on a single threaded Rust based implementation So, what is the counter argument? it pretty much seems like a blow in the face of Spark / GraphX etc, (which I like and use on a daily basis) Before I dive into re-validating his benchmarks with my own use cases. What is your opinion on this? If this is the case, then what IS the use case for using Spark/GraphX at all?