[ https://issues.apache.org/jira/browse/SPARK-31454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-31454: ------------------------------------ Assignee: (was: Apache Spark) > An optimized K-Means based on DenseMatrix and GEMM > -------------------------------------------------- > > Key: SPARK-31454 > URL: https://issues.apache.org/jira/browse/SPARK-31454 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.1.0 > Reporter: Xiaochang Wu > Priority: Major > Labels: performance > > The main computations in K-Means are calculating distances between individual > points and center points. Currently K-Means implementation is vector-based > which can't take advantage of optimized native BLAS libraries. > When the original points are represented as dense vectors, our approach is to > modify the original input data structures to a DenseMatrix-based one by > grouping several points together. The original distance calculations can be > translated into a Matrix multiplication then optimized native GEMM routines > (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with > sparse vectors despite having larger memory consumption when translating > sparse vectors to dense matrix. > Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the > training performance by *3.5x* with Intel MKL, looks very promising! > To minimize end user impact, proposed changes are to use config parameters to > control if turn on this implementation without modifying public interfaces. > Parameter rowsPerMatrix is used to control how many points are grouped > together to build a DenseMatrix. An example: > $ spark-submit --master $SPARK_MASTER \ > --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \ > --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \ > --class org.apache.spark.examples.ml.KMeansExample > Several code changes are made in "spark.ml" namespace as we think > "spark.mllib" is in maintenance mode, some are duplications from spark.mllib > for using private definitions in the same package: > - Modified: KMeans.scala, DatasetUtils.scala > - Added: KMeansMatrixImpl.scala > - Duplications: DistanceMeasure.scala, LocalKMeans.scala > If this general idea is accepted by community, we are willing to contribute > our code to upstream and polish the implementation according to feedbacks and > produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org