Not sure your data and model size. But intuitively, there is a tradeoff between 
parallel and network overhead. With the same data set and model, there is a 
optimum point of cluster size (performance may degrade at some point with the 
cluster size increment).  You may want to test larger data set if you wan tot 
do some performance benchmark.

Thanks.

Zhan Zhang



On Dec 11, 2015, at 9:34 AM, Wei Da <xwd0...@qq.com<mailto:xwd0...@qq.com>> 
wrote:

Hi, all

I have done a test in different HW configurations of Spark 1.5.0. A KMeans 
algorithm has been ran in four different Spark environments, the first one ran 
in local mode, the other three ran in cluster mode, all the nodes are with the 
same CPU (6 cores) and Memory (8G). The running times are recorded in the 
following. I thought the performance should increase as the number of workers 
increasing. But the result shows no obvious improvement. Does anybody know the 
reason? Thanks a lot in advance!

The number of rows in test data is about 2.6 million, the input file is about 
810M and stores in HDFS.
[X]


Following is snapshot of the Spark WebUI.
[X]

Wei Da

Wei Da
xwd0...@qq.com<mailto:xwd0...@qq.com>




Reply via email to