Hi all,

I have a question regarding the PowerIterationClusteringExample.
I have adjusted the code so that it reads a file via 
„sc.textFile(„path/to/input“)“ which works fine.

Now I wanted to benchmark the algorithm using different number of nodes to see 
how well the implementation scales. As a testbed I have up to 32 nodes 
available, each with 16 cores and Spark 2.0.2 on Yarn running.
For my smallest input data set (16MB) the runtime does not really change if I 
use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute)
Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I 
use 16 or if I use 32 nodes.

I was expecting that when I e.g. double the number of nodes the runtime would 
shrink. 
As for setting up my cluster environment I tried different suggestions from 
this paper https://hal.inria.fr/hal-01347638v1/document 
<https://hal.inria.fr/hal-01347638v1/document>

Has someone experienced the same? Or has someone suggestions what might went 
wrong?

Thanks in advance!
Lydia


Reply via email to