I try to measure how spark standalone cluster performance scale out with
multiple machines. I did a test of training the SVM model which is heavy in
memory computation. I measure the run time for spark standalone cluster of 1 -
3 nodes, the result is following
1 node: 35 minutes
2 nodes: 30.1
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:
- What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
YARN)?
- What does the HTTP UI
I just wanted to clarify - when I said you hit your maximum level of
parallelism, I meant that the default number of partitions might not be
large enough to take advantage of more hardware, not that there was no way
to increase your parallelism - the documentation I linked gives a few
suggestions