Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time? 

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote:
> 
> A colleague did the experiments and I don’t know exactly how he observed 
> that.  I think it was indirect from the Spark diagnostics indicating the 
> amount of I/O he deduced that this was RDD serialization.  Also when he added 
> light compression to RDD serialization this improved matters.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net> | www.redpoint.net 
> <http://www.redpoint.net/>
>  
> From: lihu [mailto:lihu...@gmail.com] 
> Sent: Friday, March 11, 2016 7:58 AM
> To: John Lilley <john.lil...@redpoint.net>
> Cc: Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
> Subject: Re: Graphx
>  
> Hi, John:
>        I am very intersting in your experiment, How can you get that RDD 
> serialization cost lots of time, from the log or some other tools?
>  
> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net>> wrote:
> Andrew,
>  
> We conducted some tests for using Graphx to solve the connected-components 
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get 
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  
> RDD serialization would take excessive time and then we would get failures.  
> By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk 
> on a single 16GB node in about an hour.  I think that a very large cluster 
> will do better, but we did not explore that.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516 <tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 5761 
> <tel:%2B1%20720%20938%205761> | F: +1 781-705-2077 <tel:%2B1%20781-705-2077>
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net> | www.redpoint.net 
> <http://www.redpoint.net/>
>  
> From: Andrew A [mailto:andrew.a...@gmail.com <mailto:andrew.a...@gmail.com>] 
> Sent: Thursday, March 10, 2016 2:44 PM
> To: u...@spark.incubator.apache.org <mailto:u...@spark.incubator.apache.org>
> Subject: Graphx
>  
> Hi, is there anyone who use graphx in production? What maximum size of graphs 
> did you process by spark and what cluster are you use for it?
> 
> i tried calculate pagerank for 1 Gb edges LJ - dataset for 
> LiveJournalPageRank from spark examples and i faced with large volume 
> shuffles produced by spark which fail my spark job.
> 
> Thank you,
> Andrew

Reply via email to