We have almost zero node info – just an identifying integer.
John Lilley

From: Alexis Roos [mailto:alexis.r...@gmail.com]
Sent: Friday, March 11, 2016 11:24 AM
To: Alexander Pivovarov <apivova...@gmail.com>
Cc: John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>; Andrew A 
<andrew.a...@gmail.com>; u...@spark.incubator.apache.org; Geoff Thompson 
<geoff.thomp...@redpoint.net>
Subject: Re: Graphx

Also we keep the Node info minimal as needed for connected components and 
rejoin later.

Alexis

On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov 
<apivova...@gmail.com<mailto:apivova...@gmail.com>> wrote:
we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes 
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 
GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
I suppose for a 2.6bn case we’d need Long:

public class GenCCInput {
  public static void main(String[] args) {
    if (args.length != 2) {
      System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
      System.exit(-1);
    }
    long edges = Long.parseLong(args[0]);
    long groupSize = Long.parseLong(args[1]);
    long currentEdge = 1;
    long currentGroupSize = 0;
    for (long i = 0; i < edges; i++) {
      System.out.println(currentEdge + " " + (currentEdge + 1));
      if (currentGroupSize == 0) {
        currentGroupSize = 2;
      } else {
        currentGroupSize++;
      }
      if (currentGroupSize >= groupSize) {
        currentGroupSize = 0;
        currentEdge += 2;
      } else {
        currentEdge++;
      }
    }
  }
}

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: John Lilley 
[mailto:john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>]
Sent: Friday, March 11, 2016 8:46 AM
To: Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>>
Cc: lihu <lihu...@gmail.com<mailto:lihu...@gmail.com>>; Andrew A 
<andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>; Geoff 
Thompson <geoff.thomp...@redpoint.net<mailto:geoff.thomp...@redpoint.net>>
Subject: RE: Graphx

Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: lihu <lihu...@gmail.com<mailto:lihu...@gmail.com>>; Andrew A 
<andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time?

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

On 11 Mar 2016, at 16:02, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:

A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: Andrew A <andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi, John:
       I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.
Thank you,
Andrew



Reply via email to