PS: This is the code I use to generate clustered test dat: public class GenCCInput { public static void main(String[] args) { if (args.length != 2) { System.err.println("Usage: \njava GenCCInput <edges> <groupsize>"); System.exit(-1); } int edges = Integer.parseInt(args[0]); int groupSize = Integer.parseInt(args[1]); int currentEdge = 1; int currentGroupSize = 0; for (int i = 0; i < edges; i++) { System.out.println(currentEdge + " " + (currentEdge + 1)); if (currentGroupSize == 0) { currentGroupSize = 2; } else { currentGroupSize++; } if (currentGroupSize >= groupSize) { currentGroupSize = 0; currentEdge += 2; } else { currentEdge++; } } } }
John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jlilley.redpoint | john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | www.redpoint.net<http://www.redpoint.net/> From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr] Sent: Friday, March 11, 2016 8:14 AM To: John Lilley <john.lil...@redpoint.net> Cc: lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org Subject: Re: Graphx Hi, I wonder what version of Spark and different parameter configuration you used. I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) John: I suppose your C++ app (algorithm) does not scale if you used only one node. I don’t understand how RDD’s serialization is taking excessive time, compared to the total time or other expected time? For the different RDD times you have events and UI console and a bunch of papers describing how measure different things, lihu: did you used some incomplete tool or what are you looking for? Best, Ovidiu On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote: A colleague did the experiments and I don’t know exactly how he observed that. I think it was indirect from the Spark diagnostics indicating the amount of I/O he deduced that this was RDD serialization. Also when he added light compression to RDD serialization this improved matters. John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jlilley.redpoint | john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | www.redpoint.net<http://www.redpoint.net/> From: lihu [mailto:lihu...@gmail.com] Sent: Friday, March 11, 2016 7:58 AM To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> Cc: Andrew A <andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org> Subject: Re: Graphx Hi, John: I am very intersting in your experiment, How can you get that RDD serialization cost lots of time, from the log or some other tools? On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote: Andrew, We conducted some tests for using Graphx to solve the connected-components problem and were disappointed. On 8 nodes of 16GB each, we could not get above 100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD serialization would take excessive time and then we would get failures. By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a single 16GB node in about an hour. I think that a very large cluster will do better, but we did not explore that. John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516> | M: +1 720 938 5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077> Skype: jlilley.redpoint | john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | www.redpoint.net<http://www.redpoint.net/> From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>] Sent: Thursday, March 10, 2016 2:44 PM To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org> Subject: Graphx Hi, is there anyone who use graphx in production? What maximum size of graphs did you process by spark and what cluster are you use for it? i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank from spark examples and i faced with large volume shuffles produced by spark which fail my spark job. Thank you, Andrew