Hi, John: I am very intersting in your experiment, How can you get that RDD serialization cost lots of time, from the log or some other tools?
On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net> wrote: > Andrew, > > > > We conducted some tests for using Graphx to solve the connected-components > problem and were disappointed. On 8 nodes of 16GB each, we could not get > above 100M edges. On 8 nodes of 60GB each, we could not process 1bn > edges. RDD serialization would take excessive time and then we would get > failures. By contrast, we have a C++ algorithm that solves 1bn edges using > memory+disk on a single 16GB node in about an hour. I think that a very > large cluster will do better, but we did not explore that. > > > > *John Lilley* > > Chief Architect, RedPoint Global Inc. > > T: +1 303 541 1516 *| *M: +1 720 938 5761 *|* F: +1 781-705-2077 > > Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net > > > > *From:* Andrew A [mailto:andrew.a...@gmail.com] > *Sent:* Thursday, March 10, 2016 2:44 PM > *To:* u...@spark.incubator.apache.org > *Subject:* Graphx > > > > Hi, is there anyone who use graphx in production? What maximum size of > graphs did you process by spark and what cluster are you use for it? > > i tried calculate pagerank for 1 Gb edges LJ - dataset for > LiveJournalPageRank from spark examples and i faced with large volume > shuffles produced by spark which fail my spark job. > > Thank you, > > Andrew >