I try to measure how spark standalone cluster performance scale out with multiple machines. I did a test of training the SVM model which is heavy in memory computation. I measure the run time for spark standalone cluster of 1 - 3 nodes, the result is following
1 node: 35 minutes 2 nodes: 30.1 minutes 3 nodes: 30.8 minutes So the speed does not seems to increase much with more machines. I know there are overhead for coordinating tasks among different machines. Seem to me the overhead is over 30% of the total run time. Is this typical? Does anybody see significant performance increase with more machines? Is there anything I can tune my spark cluster to make it scale out with more machines? Thanks Ningjun