Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; 
u...@spark.incubator.apache.org
Subject: Re: Graphx

Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time?

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

On 11 Mar 2016, at 16:02, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:

A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: Andrew A <andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi, John:
       I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.
Thank you,
Andrew

Reply via email to