[ 
https://issues.apache.org/jira/browse/SPARK-31454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31454:
------------------------------------

    Assignee:     (was: Apache Spark)

> An optimized K-Means based on DenseMatrix and GEMM
> --------------------------------------------------
>
>                 Key: SPARK-31454
>                 URL: https://issues.apache.org/jira/browse/SPARK-31454
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.0
>            Reporter: Xiaochang Wu
>            Priority: Major
>              Labels: performance
>
> The main computations in K-Means are calculating distances between individual 
> points and center points. Currently K-Means implementation is vector-based 
> which can't take advantage of optimized native BLAS libraries.
> When the original points are represented as dense vectors, our approach is to 
> modify the original input data structures to a DenseMatrix-based one by 
> grouping several points together. The original distance calculations can be 
> translated into a Matrix multiplication then optimized native GEMM routines 
> (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with 
> sparse vectors despite having larger memory consumption when translating 
> sparse vectors to dense matrix.
> Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
> training performance by *3.5x* with Intel MKL, looks very promising!
> To minimize end user impact, proposed changes are to use config parameters to 
> control if turn on this implementation without modifying public interfaces. 
> Parameter rowsPerMatrix is used to control how many points are grouped 
> together to build a DenseMatrix. An example:
> $ spark-submit --master $SPARK_MASTER \
>     --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \
>     --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \
>     --class org.apache.spark.examples.ml.KMeansExample 
> Several code changes are made in "spark.ml" namespace as we think 
> "spark.mllib" is in maintenance mode, some are duplications from spark.mllib 
> for using private definitions in the same package: 
>  - Modified: KMeans.scala, DatasetUtils.scala
>  - Added: KMeansMatrixImpl.scala
>  - Duplications: DistanceMeasure.scala, LocalKMeans.scala
> If this general idea is accepted by community, we are willing to contribute 
> our code to upstream and polish the implementation according to feedbacks and 
> produce benchmarks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to