Andrew, We conducted some tests for using Graphx to solve the connected-components problem and were disappointed. On 8 nodes of 16GB each, we could not get above 100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD serialization would take excessive time and then we would get failures. By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a single 16GB node in about an hour. I think that a very large cluster will do better, but we did not explore that.
John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jlilley.redpoint | john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | www.redpoint.net<http://www.redpoint.net/> From: Andrew A [mailto:andrew.a...@gmail.com] Sent: Thursday, March 10, 2016 2:44 PM To: u...@spark.incubator.apache.org Subject: Graphx Hi, is there anyone who use graphx in production? What maximum size of graphs did you process by spark and what cluster are you use for it? i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank from spark examples and i faced with large volume shuffles produced by spark which fail my spark job. Thank you, Andrew