Does spark performance really scale out with multiple machines?

2015-06-15 Thread Wang, Ningjun (LNG-NPV)
I try to measure how spark standalone cluster performance scale out with multiple machines. I did a test of training the SVM model which is heavy in memory computation. I measure the run time for spark standalone cluster of 1 - 3 nodes, the result is following 1 node: 35 minutes 2 nodes: 30.1

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us to help you: - What type of Spark cluster are you running (e.g., Stand-alone, Mesos, YARN)? - What does the HTTP UI

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
I just wanted to clarify - when I said you hit your maximum level of parallelism, I meant that the default number of partitions might not be large enough to take advantage of more hardware, not that there was no way to increase your parallelism - the documentation I linked gives a few suggestions