Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1778#issuecomment-56789080 @rezazadeh I made some changes in a local branch: https://github.com/mengxr/spark/blob/rezazadeh-dimsumv2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala . Could you merge the latest master to your branch? Then it is easy to compare the diff. I changed the following: 1) `similarColumns` -> `columnSimilarities` 2) remove `activeIterator` and specialize for dense and sparse vectors 3) cache the probabilities and denominators Those should increase the performance by ~5x. But the shuffle is still expensive, because the records are very small. Another question I have is on the sparsity of the result. I ran some tests locally and found that even with gamma = 1.0, the result is still dense (containing all (i, j) pairs) though the shuffle size is smaller.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org