[ https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188037#comment-14188037 ]
Apache Spark commented on SPARK-4129: ------------------------------------- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/2992 > Performance tuning in MultivariateOnlineSummarizer > -------------------------------------------------- > > Key: SPARK-4129 > URL: https://issues.apache.org/jira/browse/SPARK-4129 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: DB Tsai > > In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop > through the nonZero elements in the vector. However, activeIterator doesn't > perform well due to lots of overhead. In this PR, native while loop is used > for both DenseVector and SparseVector. > The benchmark result with 20 executors using mnist8m dataset: > Before: > DenseVector: 48.2 seconds > SparseVector: 16.3 seconds > After: > DenseVector: 17.8 seconds > SparseVector: 11.2 seconds > Since MultivariateOnlineSummarizer is used in several places, the overall > performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org