we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
(30.5 GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG
delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
wrote:

> I suppose for a 2.6bn case we’d need Long:
>
>
>
> public class GenCCInput {
>
>   public static void main(String[] args) {
>
>     if (args.length != 2) {
>
>       System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
>
>       System.exit(-1);
>
>     }
>
>     long edges = Long.parseLong(args[0]);
>
>     long groupSize = Long.parseLong(args[1]);
>
>     long currentEdge = 1;
>
>     long currentGroupSize = 0;
>
>     for (long i = 0; i < edges; i++) {
>
>       System.out.println(currentEdge + " " + (currentEdge + 1));
>
>       if (currentGroupSize == 0) {
>
>         currentGroupSize = 2;
>
>       } else {
>
>         currentGroupSize++;
>
>       }
>
>       if (currentGroupSize >= groupSize) {
>
>         currentGroupSize = 0;
>
>         currentEdge += 2;
>
>       } else {
>
>         currentEdge++;
>
>       }
>
>     }
>
>   }
>
> }
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* John Lilley [mailto:john.lil...@redpoint.net]
> *Sent:* Friday, March 11, 2016 8:46 AM
> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* RE: Graphx
>
>
>
> Ovidiu,
>
>
>
> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
> are a lot of knobs and levers to pull to affect performance, with very
> little guidance about which settings work in general.  We cannot ship
> software that requires end-user tuning; it just has to work.  Unfortunately
> GraphX seems very sensitive to working set size relative to available RAM
> and fails catastrophically as opposed to gracefully when working set is too
> large.  It is also very sensitive to the nature of the data.  For example,
> if we build a test file with input-edge representation like:
>
> 1 2
>
> 2 3
>
> 3 4
>
> 5 6
>
> 6 7
>
> 7 8
>
> …
>
> this represents a graph with connected components in groups of four.  We
> found experimentally that when this data in input in clustered order, the
> required memory is lower and runtime is much faster than when data is input
> in random order.  This makes intuitive sense because of the additional
> communication required for the random order.
>
>
>
> Our 1bn-edge test case was of this same form, input in clustered order,
> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
> the kind of data that our application processes, so it is a realistic test
> for us.  I’ve found that social media test data sets tend to follow
> power-law distributions, and that GraphX has much less problem with them.
>
>
>
> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
> in 10-vertex components using the synthetic test input I describe above.  I
> would be curious to know if this works and what settings you use to
> succeed, and if it continues to succeed for random input order.
>
>
>
> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
> behavior for large data sets, but it processes the 1bn-edge case on a
> single 60GB node in about 20 minutes.  It degrades gracefully along the
> O(N^2) curve and additional memory reduces time.
>
>
>
> *John Lilley*
>
>
>
> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr
> <ovidiu-cristian.ma...@inria.fr>]
> *Sent:* Friday, March 11, 2016 8:14 AM
> *To:* John Lilley <john.lil...@redpoint.net>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi,
>
>
>
> I wonder what version of Spark and different parameter configuration you
> used.
>
> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations)
> using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
>
> John: I suppose your C++ app (algorithm) does not scale if you used only
> one node.
>
> I don’t understand how RDD’s serialization is taking excessive time,
> compared to the total time or other expected time?
>
>
>
> For the different RDD times you have events and UI console and a bunch of
> papers describing how measure different things, lihu: did you used some
> incomplete tool or what are you looking for?
>
>
>
> Best,
>
> Ovidiu
>
>
>
> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote:
>
>
>
> A colleague did the experiments and I don’t know exactly how he observed
> that.  I think it was indirect from the Spark diagnostics indicating the
> amount of I/O he deduced that this was RDD serialization.  Also when he
> added light compression to RDD serialization this improved matters.
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* lihu [mailto:lihu...@gmail.com <lihu...@gmail.com>]
> *Sent:* Friday, March 11, 2016 7:58 AM
> *To:* John Lilley <john.lil...@redpoint.net>
> *Cc:* Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi, John:
>
>        I am very intersting in your experiment, How can you get that RDD
> serialization cost lots of time, from the log or some other tools?
>
>
>
> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
> Andrew,
>
>
>
> We conducted some tests for using Graphx to solve the connected-components
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn
> edges.  RDD serialization would take excessive time and then we would get
> failures.  By contrast, we have a C++ algorithm that solves 1bn edges using
> memory+disk on a single 16GB node in about an hour.  I think that a very
> large cluster will do better, but we did not explore that.
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 <%2B1%20720%20938%205761> *|*
> F: +1 781-705-2077 <%2B1%20781-705-2077>
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* Andrew A [mailto:andrew.a...@gmail.com]
> *Sent:* Thursday, March 10, 2016 2:44 PM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Graphx
>
>
>
> Hi, is there anyone who use graphx in production? What maximum size of
> graphs did you process by spark and what cluster are you use for it?
>
> i tried calculate pagerank for 1 Gb edges LJ - dataset for
> LiveJournalPageRank from spark examples and i faced with large volume
> shuffles produced by spark which fail my spark job.
>
> Thank you,
>
> Andrew
>
>
>

Reply via email to