Hi all, I have a question regarding the PowerIterationClusteringExample. I have adjusted the code so that it reads a file via „sc.textFile(„path/to/input“)“ which works fine.
Now I wanted to benchmark the algorithm using different number of nodes to see how well the implementation scales. As a testbed I have up to 32 nodes available, each with 16 cores and Spark 2.0.2 on Yarn running. For my smallest input data set (16MB) the runtime does not really change if I use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute) Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I use 16 or if I use 32 nodes. I was expecting that when I e.g. double the number of nodes the runtime would shrink. As for setting up my cluster environment I tried different suggestions from this paper https://hal.inria.fr/hal-01347638v1/document <https://hal.inria.fr/hal-01347638v1/document> Has someone experienced the same? Or has someone suggestions what might went wrong? Thanks in advance! Lydia