Also we keep the Node info minimal as needed for connected components and
rejoin later.

Alexis

On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> we use it in prod
>
> 70 boxes, 61GB RAM each
>
> GraphX Connected Components works fine on 250M Vertices and 1B Edges
> (takes about 5-10 min)
>
> Spark likes memory, so use r3.2xlarge boxes (61GB)
> For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
> (30.5 GB) (especially if you have skewed data)
>
> Also, use checkpoints before and after Connected Components to reduce DAG
> delays
>
> You can also try to enable Kryo and register classes used in RDD
>
>
> On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
>> I suppose for a 2.6bn case we’d need Long:
>>
>>
>>
>> public class GenCCInput {
>>
>>   public static void main(String[] args) {
>>
>>     if (args.length != 2) {
>>
>>       System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
>>
>>       System.exit(-1);
>>
>>     }
>>
>>     long edges = Long.parseLong(args[0]);
>>
>>     long groupSize = Long.parseLong(args[1]);
>>
>>     long currentEdge = 1;
>>
>>     long currentGroupSize = 0;
>>
>>     for (long i = 0; i < edges; i++) {
>>
>>       System.out.println(currentEdge + " " + (currentEdge + 1));
>>
>>       if (currentGroupSize == 0) {
>>
>>         currentGroupSize = 2;
>>
>>       } else {
>>
>>         currentGroupSize++;
>>
>>       }
>>
>>       if (currentGroupSize >= groupSize) {
>>
>>         currentGroupSize = 0;
>>
>>         currentEdge += 2;
>>
>>       } else {
>>
>>         currentEdge++;
>>
>>       }
>>
>>     }
>>
>>   }
>>
>> }
>>
>>
>>
>> *John Lilley*
>>
>> Chief Architect, RedPoint Global Inc.
>>
>> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>>
>> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>>
>>
>>
>> *From:* John Lilley [mailto:john.lil...@redpoint.net]
>> *Sent:* Friday, March 11, 2016 8:46 AM
>> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
>> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
>> u...@spark.incubator.apache.org; Geoff Thompson <
>> geoff.thomp...@redpoint.net>
>> *Subject:* RE: Graphx
>>
>>
>>
>> Ovidiu,
>>
>>
>>
>> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
>> are a lot of knobs and levers to pull to affect performance, with very
>> little guidance about which settings work in general.  We cannot ship
>> software that requires end-user tuning; it just has to work.  Unfortunately
>> GraphX seems very sensitive to working set size relative to available RAM
>> and fails catastrophically as opposed to gracefully when working set is too
>> large.  It is also very sensitive to the nature of the data.  For example,
>> if we build a test file with input-edge representation like:
>>
>> 1 2
>>
>> 2 3
>>
>> 3 4
>>
>> 5 6
>>
>> 6 7
>>
>> 7 8
>>
>> …
>>
>> this represents a graph with connected components in groups of four.  We
>> found experimentally that when this data in input in clustered order, the
>> required memory is lower and runtime is much faster than when data is input
>> in random order.  This makes intuitive sense because of the additional
>> communication required for the random order.
>>
>>
>>
>> Our 1bn-edge test case was of this same form, input in clustered order,
>> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
>> the kind of data that our application processes, so it is a realistic test
>> for us.  I’ve found that social media test data sets tend to follow
>> power-law distributions, and that GraphX has much less problem with them.
>>
>>
>>
>> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
>> in 10-vertex components using the synthetic test input I describe above.  I
>> would be curious to know if this works and what settings you use to
>> succeed, and if it continues to succeed for random input order.
>>
>>
>>
>> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
>> behavior for large data sets, but it processes the 1bn-edge case on a
>> single 60GB node in about 20 minutes.  It degrades gracefully along the
>> O(N^2) curve and additional memory reduces time.
>>
>>
>>
>> *John Lilley*
>>
>>
>>
>> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr
>> <ovidiu-cristian.ma...@inria.fr>]
>> *Sent:* Friday, March 11, 2016 8:14 AM
>> *To:* John Lilley <john.lil...@redpoint.net>
>> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
>> u...@spark.incubator.apache.org
>> *Subject:* Re: Graphx
>>
>>
>>
>> Hi,
>>
>>
>>
>> I wonder what version of Spark and different parameter configuration you
>> used.
>>
>> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations)
>> using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
>>
>> John: I suppose your C++ app (algorithm) does not scale if you used only
>> one node.
>>
>> I don’t understand how RDD’s serialization is taking excessive time,
>> compared to the total time or other expected time?
>>
>>
>>
>> For the different RDD times you have events and UI console and a bunch of
>> papers describing how measure different things, lihu: did you used some
>> incomplete tool or what are you looking for?
>>
>>
>>
>> Best,
>>
>> Ovidiu
>>
>>
>>
>> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote:
>>
>>
>>
>> A colleague did the experiments and I don’t know exactly how he observed
>> that.  I think it was indirect from the Spark diagnostics indicating the
>> amount of I/O he deduced that this was RDD serialization.  Also when he
>> added light compression to RDD serialization this improved matters.
>>
>>
>>
>> *John Lilley*
>>
>> Chief Architect, RedPoint Global Inc.
>>
>> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>>
>> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>>
>>
>>
>> *From:* lihu [mailto:lihu...@gmail.com <lihu...@gmail.com>]
>> *Sent:* Friday, March 11, 2016 7:58 AM
>> *To:* John Lilley <john.lil...@redpoint.net>
>> *Cc:* Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
>> *Subject:* Re: Graphx
>>
>>
>>
>> Hi, John:
>>
>>        I am very intersting in your experiment, How can you get that RDD
>> serialization cost lots of time, from the log or some other tools?
>>
>>
>>
>> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net>
>> wrote:
>>
>> Andrew,
>>
>>
>>
>> We conducted some tests for using Graphx to solve the
>> connected-components problem and were disappointed.  On 8 nodes of 16GB
>> each, we could not get above 100M edges.  On 8 nodes of 60GB each, we could
>> not process 1bn edges.  RDD serialization would take excessive time and
>> then we would get failures.  By contrast, we have a C++ algorithm that
>> solves 1bn edges using memory+disk on a single 16GB node in about an hour.
>> I think that a very large cluster will do better, but we did not explore
>> that.
>>
>>
>>
>> *John Lilley*
>>
>> Chief Architect, RedPoint Global Inc.
>>
>> T: +1 303 541 1516  *| *M: +1 720 938 5761 <%2B1%20720%20938%205761> *|*
>> F: +1 781-705-2077 <%2B1%20781-705-2077>
>>
>> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>>
>>
>>
>> *From:* Andrew A [mailto:andrew.a...@gmail.com]
>> *Sent:* Thursday, March 10, 2016 2:44 PM
>> *To:* u...@spark.incubator.apache.org
>> *Subject:* Graphx
>>
>>
>>
>> Hi, is there anyone who use graphx in production? What maximum size of
>> graphs did you process by spark and what cluster are you use for it?
>>
>> i tried calculate pagerank for 1 Gb edges LJ - dataset for
>> LiveJournalPageRank from spark examples and i faced with large volume
>> shuffles produced by spark which fail my spark job.
>>
>> Thank you,
>>
>> Andrew
>>
>>
>>
>
>

Reply via email to