This is an interesting discussion,

I have had some success running GraphX on large graphs with more than a
Billion edges using clusters of different size up to 64 machines. However,
the performance goes down when I double the cluster size to reach 128
machines of r3.xlarge. Does any one have experience with very large GraphX
clusters?

@Ovidiu-Cristian, @Alexis and @Alexander, could you please share the
configurations for Spark / GraphX that works best for you?

Thanks,
-Khaled

On Fri, Mar 11, 2016 at 1:25 PM, John Lilley <john.lil...@redpoint.net>
wrote:

> We have almost zero node info – just an identifying integer.
>
> *John Lilley*
>
>
>
> *From:* Alexis Roos [mailto:alexis.r...@gmail.com]
> *Sent:* Friday, March 11, 2016 11:24 AM
> *To:* Alexander Pivovarov <apivova...@gmail.com>
> *Cc:* John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>; Andrew A <
> andrew.a...@gmail.com>; u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* Re: Graphx
>
>
>
> Also we keep the Node info minimal as needed for connected components and
> rejoin later.
>
>
>
> Alexis
>
>
>
> On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
> we use it in prod
>
>
>
> 70 boxes, 61GB RAM each
>
>
>
> GraphX Connected Components works fine on 250M Vertices and 1B Edges
> (takes about 5-10 min)
>
>
>
> Spark likes memory, so use r3.2xlarge boxes (61GB)
>
> For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
> (30.5 GB) (especially if you have skewed data)
>
>
>
> Also, use checkpoints before and after Connected Components to reduce DAG
> delays
>
>
>
> You can also try to enable Kryo and register classes used in RDD
>
>
>
>
>
> On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
> I suppose for a 2.6bn case we’d need Long:
>
>
>
> public class GenCCInput {
>
>   public static void main(String[] args) {
>
>     if (args.length != 2) {
>
>       System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
>
>       System.exit(-1);
>
>     }
>
>     long edges = Long.parseLong(args[0]);
>
>     long groupSize = Long.parseLong(args[1]);
>
>     long currentEdge = 1;
>
>     long currentGroupSize = 0;
>
>     for (long i = 0; i < edges; i++) {
>
>       System.out.println(currentEdge + " " + (currentEdge + 1));
>
>       if (currentGroupSize == 0) {
>
>         currentGroupSize = 2;
>
>       } else {
>
>         currentGroupSize++;
>
>       }
>
>       if (currentGroupSize >= groupSize) {
>
>         currentGroupSize = 0;
>
>         currentEdge += 2;
>
>       } else {
>
>         currentEdge++;
>
>       }
>
>     }
>
>   }
>
> }
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* John Lilley [mailto:john.lil...@redpoint.net]
> *Sent:* Friday, March 11, 2016 8:46 AM
> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* RE: Graphx
>
>
>
> Ovidiu,
>
>
>
> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
> are a lot of knobs and levers to pull to affect performance, with very
> little guidance about which settings work in general.  We cannot ship
> software that requires end-user tuning; it just has to work.  Unfortunately
> GraphX seems very sensitive to working set size relative to available RAM
> and fails catastrophically as opposed to gracefully when working set is too
> large.  It is also very sensitive to the nature of the data.  For example,
> if we build a test file with input-edge representation like:
>
> 1 2
>
> 2 3
>
> 3 4
>
> 5 6
>
> 6 7
>
> 7 8
>
> …
>
> this represents a graph with connected components in groups of four.  We
> found experimentally that when this data in input in clustered order, the
> required memory is lower and runtime is much faster than when data is input
> in random order.  This makes intuitive sense because of the additional
> communication required for the random order.
>
>
>
> Our 1bn-edge test case was of this same form, input in clustered order,
> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
> the kind of data that our application processes, so it is a realistic test
> for us.  I’ve found that social media test data sets tend to follow
> power-law distributions, and that GraphX has much less problem with them.
>
>
>
> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
> in 10-vertex components using the synthetic test input I describe above.  I
> would be curious to know if this works and what settings you use to
> succeed, and if it continues to succeed for random input order.
>
>
>
> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
> behavior for large data sets, but it processes the 1bn-edge case on a
> single 60GB node in about 20 minutes.  It degrades gracefully along the
> O(N^2) curve and additional memory reduces time.
>
>
>
> *John Lilley*
>
>
>
> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr
> <ovidiu-cristian.ma...@inria.fr>]
> *Sent:* Friday, March 11, 2016 8:14 AM
> *To:* John Lilley <john.lil...@redpoint.net>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi,
>
>
>
> I wonder what version of Spark and different parameter configuration you
> used.
>
> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations)
> using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
>
> John: I suppose your C++ app (algorithm) does not scale if you used only
> one node.
>
> I don’t understand how RDD’s serialization is taking excessive time,
> compared to the total time or other expected time?
>
>
>
> For the different RDD times you have events and UI console and a bunch of
> papers describing how measure different things, lihu: did you used some
> incomplete tool or what are you looking for?
>
>
>
> Best,
>
> Ovidiu
>
>
>
> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote:
>
>
>
> A colleague did the experiments and I don’t know exactly how he observed
> that.  I think it was indirect from the Spark diagnostics indicating the
> amount of I/O he deduced that this was RDD serialization.  Also when he
> added light compression to RDD serialization this improved matters.
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* lihu [mailto:lihu...@gmail.com <lihu...@gmail.com>]
> *Sent:* Friday, March 11, 2016 7:58 AM
> *To:* John Lilley <john.lil...@redpoint.net>
> *Cc:* Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi, John:
>
>        I am very intersting in your experiment, How can you get that RDD
> serialization cost lots of time, from the log or some other tools?
>
>
>
> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
> Andrew,
>
>
>
> We conducted some tests for using Graphx to solve the connected-components
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn
> edges.  RDD serialization would take excessive time and then we would get
> failures.  By contrast, we have a C++ algorithm that solves 1bn edges using
> memory+disk on a single 16GB node in about an hour.  I think that a very
> large cluster will do better, but we did not explore that.
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516 <%2B1%C2%A0303%C2%A0541%201516>  *| *M: +1 720 938 5761
> <%2B1%20720%20938%205761> *|* F: +1 781-705-2077 <%2B1%20781-705-2077>
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* Andrew A [mailto:andrew.a...@gmail.com]
> *Sent:* Thursday, March 10, 2016 2:44 PM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Graphx
>
>
>
> Hi, is there anyone who use graphx in production? What maximum size of
> graphs did you process by spark and what cluster are you use for it?
>
> i tried calculate pagerank for 1 Gb edges LJ - dataset for
> LiveJournalPageRank from spark examples and i faced with large volume
> shuffles produced by spark which fail my spark job.
>
> Thank you,
>
> Andrew
>
>
>
>
>
>
>



-- 
Thanks,
-Khaled

Reply via email to