[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020693#comment-16020693
 ] 

zhengruifeng edited comment on SPARK-14174 at 5/23/17 6:12 AM:
---------------------------------------------------------------

[~mlnick] I make some performace experiments last week. However, I got some 
strange result:  {{MiniBatch}} process only result in very tiny speed up, which 
is quite different from the early tests in about 14 months ago.

{code}
Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16
Low Demoniality & Dense

val random = new Random(123)
val n = 1000000000
val dim = 10

val rdd = sc.parallelize(1 to n).map(i => 
Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()

MiniBatchFraction     Duration per iter    Cost after 3 iters
1                                125471                  7.807840581259936E8
0.1                             99093                    7.807848215462652E8
0.01                           95800                    7.807861719437395E8

 {code}

{{high dimension, dense}} tests resulted in similar result.

I'm going to further study on it.


was (Author: podongfeng):
[~mlnick] I make some performace experiments last week. However, I got some 
strange result:  {{MiniBatch}} process only result in very tiny speed up, which 
is quite different from the early tests in about 14 months ago.

{code}
Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16
Low Demoniality & Dense

val random = new Random(123)
val n = 1000000000
val dim = 10

val rdd = sc.parallelize(1 to n).map(i => 
Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()

MiniBatchFraction     Duration per iter    Cost after 3 iters
1                                125471                  7.807840581259936E8
0.1                             99093                    7.807848215462652E8
0.01                           95800                    7.807861719437395E8

 {code}

{{high dimension, dense}} tests resulted in similar result.

I'm going to further study it.

> Accelerate KMeans via Mini-Batch EM
> -----------------------------------
>
>                 Key: SPARK-14174
>                 URL: https://issues.apache.org/jira/browse/SPARK-14174
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans                                                    2876sec
> MiniBatch KMeans (fraction=0.1)             263sec
> MiniBatch KMeans (fraction=0.01)           90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to