Thanks Igor,
We are definitely thinking along these lines, but I am hoping to shortcut our 
search of the Spark/GraphX tuning parameter space to find a reasonable set of 
starting points.  There are simultaneous questions of “what should we expect 
form GraphX?” and “what are the best parameters to do that?”.

What I’m asking is fairly specific:
-- What is a good set of tuning parameters (partitions, memory?) for a large 
data set that “should” fit into memory on an 8-node cluster with 8GB/node 
available to YARN?
-- Does anyone have or know of sample code that performs well on a real data 
set without adjusting lots of tuning knobs first?
-- How much available YARN memory is required to hold a given number of 
vertices+edges, with enough cushion to be comfortable?  You are giving some 
tantalizing hints (3x as much as I expected…), but no clear indication of how 
much memory should be needed.  Arriving at the answer through experimentation 
isn’t a good approach, because that assumes -- chicken-and-egg problem -- that 
we have already arrived at an optimal configuration.
-- Does GraphX connected-components performance degrade slowly or 
catastrophically when that memory limit is reached?  Are there tuning 
parameters that optimize for data all fitting in memory vs. data that must 
spill?

Thanks,
John Lilley

From: Igor Berman [mailto:igor.ber...@gmail.com]
Sent: Saturday, October 10, 2015 12:06 PM
To: John Lilley <john.lil...@redpoint.net>
Cc: user@spark.apache.org; Geoff Thompson <geoff.thomp...@redpoint.net>
Subject: Re: Question about GraphX connected-components

let's start from some basics: might be u need to split your data into more 
partitions?
spilling depends on your configuration when you create graph(look for storage 
level param) and your global configuration.
in addition, you assumption of 64GB/100M is probably wrong, since spark divides 
memory into 3 regions - for in memory caching, for shuffling and for 
"workspace" of serialization/deserialization etc see fraction parameters.

so depending on number of your partitions might be worker will try to ingest 
too much data at once(#cores * memory pressure of one task per one partition)

there is no such thing as "right" configuration. It depends on your 
application. You can post your configuration and people will suggest some 
tunning, still best way is to try what is best for ur case depending on what u 
see in spark ui metrics(as starting point)

On 10 October 2015 at 00:13, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Greetings,
We are looking into using the GraphX connected-components algorithm on Hadoop 
for grouping operations.  Our typical data is on the order of 50-200M vertices 
with an edge:vertex ratio between 2 and 30.  While there are pathological cases 
of very large groups, they tend to be small.  I am trying to get a handle on 
the level of performance and scaling we should expect, and how to best 
configure GraphX/Spark to get there.  After some trying, we cannot get to 100M 
vertices/edges without running out of memory on a small cluster (8 nodes with 4 
cores and 8GB available for YARN on each node).  This limit seems low, as 
64GB/100M is 640 bytes per vertex, which should be enough.  Is this within 
reason?  Does anyone have sample they can share that has the right 
configurations for succeeding with this size of data and cluster?  What level 
of performance should we expect?  What happens when the data set exceed memory, 
does it spill to disk “nicely” or degrade catastrophically?

Thanks,
John Lilley


Reply via email to