Thank you for the reply Deepak.

I know with more executors / memory per executor it will work, we actually have 
a bunch of experiments we ran with various setups. I'm just trying to confirm 
that limits we are hitting are right, or there are some other configuration 
parameters we didn't try yet which would move the limits further. Since without 
any tuning limits for what we can run were much worse off.

Errors would be various executors lost: after heartbeat timeout of 10 minutes, 
out of memory errors or job just not making any progress (not completing any 
tasks) for many hours after which we'd kill them.

Maja

From: Deepak Goel <deic...@gmail.com<mailto:deic...@gmail.com>>
Date: Wednesday, June 15, 2016 at 7:13 PM
To: Maja Kabiljo <majakabi...@fb.com<mailto:majakabi...@fb.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: GraphX performance and settings


I am not an expert but some thoughts inline....

On Jun 16, 2016 6:31 AM, "Maja Kabiljo" 
<majakabi...@fb.com<mailto:majakabi...@fb.com>> wrote:
>
> Hi,
>
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.
>
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf<https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab.cs.berkeley.edu_wp-2Dcontent_uploads_2014_09_graphx.pdf&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=70jRDqS1TgwYK0kwXqlG2wyrzJDvH1bm4B3mynUQPGE&m=6-RoIPn2j_XKk53JhD7u64esgyinELGNJqDvZuyWC34&s=hrnz4WuTA6NrvS2FW6-ZHIjv3auKs4CRo_TwTsS3EA8&e=>),
>  and also generated graphs with properties similar to Facebook social graph 
> with various number of edges. Apart from performance we tried to see what is 
> the minimum amount of resources it requires in order to handle graph of some 
> size.
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine.

*******Deepak*******
I guess you have 16 machines in your test. Is that right?
******Deepak*******

Additionally we used:
> spark.shuffle.manager=hash, spark.shuffle.service.enabled=false
> Parallel GC
> PartitionStrategy.EdgePartition2D
> 8*numberOfExecutors partitions
> Here are some data points which we got:
> Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.
> For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected component

******Deepak*****
The executors are not scaling linearly. You should need max of 10 executors. 
Also what is the error it is showing for 8 executors?
*****Deepak******

> Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.

*****Deepak*****
For 4 executors can you try with 160GB. Also if you could spell out the system 
statistics during the test it would be great. My guess is with 4 connectors a 
lot of spilling is happening
*****Deepak***

> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?
>
> Any feedback would be highly appreciated!
>
> Thank you,
> Maja

Reply via email to