[ 
https://issues.apache.org/jira/browse/SPARK-31454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31454:
---------------------------------
    Description: 
The main computations in K-Means are calculating distances between individual 
points and center points. Currently K-Means implementation is vector-based 
which can't take advantage of optimized native BLAS libraries.

When the original points are represented as dense vectors, our approach is to 
modify the original input data structures to a DenseMatrix-based one by 
grouping several points together. The original distance calculations can be 
translated into a Matrix multiplication then optimized native GEMM routines 
(Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse 
vectors despite having larger memory consumption when translating sparse 
vectors to dense matrix.

Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
training performance by *3.5x* with Intel MKL, looks very promising!

To minimize end user impact, proposed changes are to use config parameters to 
control if turn on this implementation without modifying public interfaces. 
Parameter rowsPerMatrix is used to control how many points are grouped together 
to build a DenseMatrix. An example:

$ spark-submit --master $SPARK_MASTER \

    --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \

    --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \

    --class org.apache.spark.examples.ml.KMeansExample 

Several code changes are made in "spark.ml" namespace as we think "spark.mllib" 
is in maintenance mode, some are duplications from spark.mllib for using 
private definitions in the same package: 
 - Modified: KMeans.scala, DatasetUtils.scala

 - Added: KMeansMatrixImpl.scala

 - Duplications: DistanceMeasure.scala, LocalKMeans.scala

If this general idea is accepted by community, we are willing to contribute our 
code to upstream and polish the implementation according to feedbacks and 
produce benchmarks.

 

  was:
The main computations in K-Means are calculating distances between individual 
points and center points. Currently K-Means implementation is vector-based 
which can't take advantage of optimized native BLAS libraries.

When the original points are represented as dense vectors, our approach is to 
modify the original input data structures to a DenseMatrix-based one by 
grouping several points together. The original distance calculations can be 
translated into a Matrix multiplication then optimized native GEMM routines 
(Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse 
vectors despite having larger memory consumption when translating sparse 
vectors to dense matrix.

Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
training performance by *3.5x* with Intel MKL, looks very promising!

To minimize end user impact, proposed changes are to use config parameters to 
control if turn on this implementation without modifying public interfaces. 
Parameter rowsPerMatrix is used to control how many points are grouped together 
to build a DenseMatrix. An example:

$ spark-submit --master $SPARK_MASTER \

    --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \

    --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \

    --class org.apache.spark.examples.ml.KMeansExample 

Several code changes are made in "spark.ml" namespace as we think "spark.mllib" 
is in maintenance mode, some are duplications from spark.mllib for using 
private definitions in the same package: 
 - Modified: KMeans.scala, DatasetUtils

 - Added: KMeansMatrixImpl.scala

 - Duplications: DistanceMeasure.scala, LocalKMeans.scala

If this general idea is accepted by community, we are willing to contribute our 
code to upstream and polish the implementation according to feedbacks and 
produce benchmarks.

 


> An optimized K-Means based on DenseMatrix and GEMM
> --------------------------------------------------
>
>                 Key: SPARK-31454
>                 URL: https://issues.apache.org/jira/browse/SPARK-31454
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.0
>            Reporter: Xiaochang Wu
>            Priority: Major
>              Labels: performance
>
> The main computations in K-Means are calculating distances between individual 
> points and center points. Currently K-Means implementation is vector-based 
> which can't take advantage of optimized native BLAS libraries.
> When the original points are represented as dense vectors, our approach is to 
> modify the original input data structures to a DenseMatrix-based one by 
> grouping several points together. The original distance calculations can be 
> translated into a Matrix multiplication then optimized native GEMM routines 
> (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with 
> sparse vectors despite having larger memory consumption when translating 
> sparse vectors to dense matrix.
> Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
> training performance by *3.5x* with Intel MKL, looks very promising!
> To minimize end user impact, proposed changes are to use config parameters to 
> control if turn on this implementation without modifying public interfaces. 
> Parameter rowsPerMatrix is used to control how many points are grouped 
> together to build a DenseMatrix. An example:
> $ spark-submit --master $SPARK_MASTER \
>     --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \
>     --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \
>     --class org.apache.spark.examples.ml.KMeansExample 
> Several code changes are made in "spark.ml" namespace as we think 
> "spark.mllib" is in maintenance mode, some are duplications from spark.mllib 
> for using private definitions in the same package: 
>  - Modified: KMeans.scala, DatasetUtils.scala
>  - Added: KMeansMatrixImpl.scala
>  - Duplications: DistanceMeasure.scala, LocalKMeans.scala
> If this general idea is accepted by community, we are willing to contribute 
> our code to upstream and polish the implementation according to feedbacks and 
> produce benchmarks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to