zhengruifeng edited a comment on pull request #27473:
URL: https://github.com/apache/spark/pull/27473#issuecomment-624531334


   test on the first 1M rows in HIGGS:
   
   test code:
   ```scala
   
   import org.apache.spark.ml.clustering._
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.ml.linalg._
   
   val df = 
spark.read.format("libsvm").load("/data1/Datasets/higgs/HIGGS.1m").repartition(1)
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
   
   
   val gmm = new 
GaussianMixture().setSeed(0).setK(4).setMaxIter(2).setBlockSize(64)
   gmm.fit(df)
   
   
   val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = 
gmm.setK(4).setMaxIter(20).setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) }
   
   results.map(_._2.summary.numIter)
   results.map(_._2.summary.logLikelihood)
   results.map(_._3)
   ```
   
   Results **WITHOUT** native BLAS:
   ```
   scala> results.map(_._2.summary.numIter)
   res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)
   
   scala> results.map(_._2.summary.logLikelihood)
   res4: Seq[Double] = List(-2.3353357834421366E7, -2.3353357834421184E7, 
-2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, 
-2.3353357834421184E7, -2.3353357834421184E7)
   
   scala> results.map(_._3)
   res5: Seq[Long] = List(105777, 113261, 110608, 106573, 108141, 109825, 
113094)
   ```
   
   It is surprising that there is a small performance regression on dense 
input: 105777 -> 106573
   
   **blockSize==1**
   
![gmm_1](https://user-images.githubusercontent.com/7322292/81160260-6affe400-8fbc-11ea-82d0-dc63a901a584.png)
   
   **blockSize==1024**
   
![gmm_1024](https://user-images.githubusercontent.com/7322292/81160291-76eba600-8fbc-11ea-9009-9322884a3ac5.png)
   
   
-------------------------------------------------------------------------------------------
   Results **WITH** native BLAS (OPENBLAS_NUM_THREADS=1):
   ```
   scala> results.map(_._2.summary.numIter)
   res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)
   
   scala> results.map(_._2.summary.logLikelihood)
   res4: Seq[Double] = List(-2.3353357834421374E7, -2.3353357834422573E7, 
-2.3353357834422797E7, -2.335335783442225E7, -2.3353357834422205E7, 
-2.3353357834422156E7, -2.335335783442218E7)
   
   scala> results.map(_._3)
   res5: Seq[Long] = List(108005, 54975, 39802, 35807, 35027, 36369, 38717)
   ```
   
   When OpenBLAS is used, it obtain about 3x speedup.
   
   **blockSize==1** with OpenBLAS
   
![gmm_openBlas_1](https://user-images.githubusercontent.com/7322292/81160345-866aef00-8fbc-11ea-9a73-8b740ade880f.png)
   
   **blockSize==1024** with OpenBLAS
   
![gmm_openBlas_1024](https://user-images.githubusercontent.com/7322292/81160352-88cd4900-8fbc-11ea-9f02-be8ceb75f9d2.png)
   
   
   
   
-------------------------------------------------------------------------------------------
   Comparsion to Master (**WITHOUT** native BLAS):
   ```
   scala> val start = System.currentTimeMillis; val model = 
gmm.setK(4).setMaxIter(20).fit(df); val end = System.currentTimeMillis; end - 
start
   start: Long = 1587976220511                                                  
   
   model: org.apache.spark.ml.clustering.GaussianMixtureModel = 
GaussianMixtureModel: uid=GaussianMixture_753da885644b, k=4, numFeatures=28
   end: Long = 1587976324361
   res4: Long = 103850
   ```
   
   This PR keeps original behavior and performance if `BlockSize==1`
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to