[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020693#comment-16020693 ]
zhengruifeng edited comment on SPARK-14174 at 5/23/17 6:12 AM: --------------------------------------------------------------- [~mlnick] I make some performace experiments last week. However, I got some strange result: {{MiniBatch}} process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago. {code} Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16 Low Demoniality & Dense val random = new Random(123) val n = 1000000000 val dim = 10 val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist() MiniBatchFraction Duration per iter Cost after 3 iters 1 125471 7.807840581259936E8 0.1 99093 7.807848215462652E8 0.01 95800 7.807861719437395E8 {code} {{high dimension, dense}} tests resulted in similar result. I'm going to further study on it. was (Author: podongfeng): [~mlnick] I make some performace experiments last week. However, I got some strange result: {{MiniBatch}} process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago. {code} Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16 Low Demoniality & Dense val random = new Random(123) val n = 1000000000 val dim = 10 val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist() MiniBatchFraction Duration per iter Cost after 3 iters 1 125471 7.807840581259936E8 0.1 99093 7.807848215462652E8 0.01 95800 7.807861719437395E8 {code} {{high dimension, dense}} tests resulted in similar result. I'm going to further study it. > Accelerate KMeans via Mini-Batch EM > ----------------------------------- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: zhengruifeng > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans 2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org