zhengruifeng commented on issue #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-616919554


   @srowen Thanks for pinging me
   
   @xwu99 Could you please provide some performance results of your PR?
   
   I had [similar 
attempts](https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1)
 to optimize KMeans based on high level BLAS.
   I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. 
But I found that:
   1, it will cause performance regression when input dataset is sparse, (I 
notice that you add `spark.ml.kmeans.matrixImplementation.rowsPerMatrix`, I am 
not sure whether we should have two implementations);
   2, when input dataset is dense, I found no performace gain when 
`distanceMeasure = EUCLIDEAN`; while `distanceMeasure = EUCLIDEAN`, about 10% ~ 
20% speedup can be obtained;
   3, Native BLAS did not help too much, if single-thread is used (which is 
suggested [in 
SPARK](https://spark.apache.org/docs/latest/ml-guide.html#dependencies));
   
   
   Then I swith to another optimization approach based on 
[triangle-inequality](https://github.com/apache/spark/pull/27758), it works on 
both dense and sparse dataset, and will gain about 10%~30% when `numFeatures` 
and/or `k` are large.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to