[ https://issues.apache.org/jira/browse/SPARK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-4581: --------------------------------- Target Version/s: 1.2.0 Assignee: DB Tsai > Refactorize StandardScaler to improve the transformation performance > -------------------------------------------------------------------- > > Key: SPARK-4581 > URL: https://issues.apache.org/jira/browse/SPARK-4581 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: DB Tsai > Assignee: DB Tsai > > The following optimizations are done to improve the StandardScaler model > transformation performance. > 1) Covert Breeze dense vector to primitive vector to reduce the overhead. > 2) Since mean can be potentially a sparse vector, we explicitly convert it to > dense primitive vector. > 3) Have a local reference to `shift` and `factor` array so JVM can locate the > value with one operation call. > 4) In pattern matching part, we use the mllib SparseVector/DenseVector > instead of breeze's vector to make the codebase cleaner. > Benchmark with mnist8m dataset: > Before, > DenseVector withMean and withStd: 50.97secs > DenseVector withMean and withoutStd: 42.11secs > DenseVector withoutMean and withStd: 8.75secs > SparseVector withoutMean and withStd: 5.437 > With this PR, > DenseVector withMean and withStd: 5.76secs > DenseVector withMean and withoutStd: 5.28secs > DenseVector withoutMean and withStd: 5.30secs > SparseVector withoutMean and withStd: 1.27 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org