Not sure your data and model size. But intuitively, there is a tradeoff between parallel and network overhead. With the same data set and model, there is a optimum point of cluster size (performance may degrade at some point with the cluster size increment). You may want to test larger data set if you wan tot do some performance benchmark.
Thanks. Zhan Zhang On Dec 11, 2015, at 9:34 AM, Wei Da <xwd0...@qq.com<mailto:xwd0...@qq.com>> wrote: Hi, all I have done a test in different HW configurations of Spark 1.5.0. A KMeans algorithm has been ran in four different Spark environments, the first one ran in local mode, the other three ran in cluster mode, all the nodes are with the same CPU (6 cores) and Memory (8G). The running times are recorded in the following. I thought the performance should increase as the number of workers increasing. But the result shows no obvious improvement. Does anybody know the reason? Thanks a lot in advance! The number of rows in test data is about 2.6 million, the input file is about 810M and stores in HDFS. [X] Following is snapshot of the Spark WebUI. [X] Wei Da Wei Da xwd0...@qq.com<mailto:xwd0...@qq.com>