A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
Subject: Re: Graphx

Hi, John:
       I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.
Thank you,
Andrew

Reply via email to