[ https://issues.apache.org/jira/browse/MAHOUT-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16189151#comment-16189151 ]
ASF GitHub Bot commented on MAHOUT-2019: ---------------------------------------- GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/342 MAHOUT-2019 Sparse speedup ### Purpose of PR: to review an apparent speedup of spark-itemsimilarity and the underlying SimilarityAnalysis.cooccurrence by using an iterateNonZero instead of the previous for loops in SparseRowMatrix. For discussion only at present MAHOUT-2019 https://issues.apache.org/jira/projects/MAHOUT/issues/MAHOUT-2019?filter=allopenissues&orderby=priority+DESC%2C+updated+DESC ### Important ToDos Please mark each with an "x" - [x] A JIRA ticket exists (if not, please create this first)[https://issues.apache.org/jira/browse/ZEPPELIN/] - [x] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number. - [ ] Created unit tests where appropriate - [ ] Added licenses correct on newly added files - [ ] Assigned JIRA to self - [ ] Added documentation in scala docs/java docs, and to website - [ ] Successfully built and ran all unit tests, verified that all tests pass locally. If all of these things aren't complete, but you still feel it is appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change" Does this change break earlier versions? Is this the beginning of a larger project for which a feature branch should be made? You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout sparse-speedup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #342 ---- commit 26a2efa65e9f09df358e1021ebf45e3735e2ec6c Author: pferrel <p...@occamsmachete.com> Date: 2017-10-02T18:39:54Z minimum speedup fix commit 9330a2ed6d1211459c57863a5d664377c55aa747 Author: pferrel <p...@occamsmachete.com> Date: 2017-10-02T19:27:47Z minimum speedup fix with cast exception check commit 722bd11f01e7250f99f21f17ec7211bf5abb2089 Author: pferrel <p...@occamsmachete.com> Date: 2017-10-02T20:33:07Z added cast exception logging to SparseRowMatrix commit 02700ef13c44e403cba58288dcbab5cfabed8585 Author: pferrel <p...@occamsmachete.com> Date: 2017-10-02T20:35:14Z Merge branch 'master' into sparse-speedup ---- > SparseRowMatrix assign ops user for loops instead of iterateNonZero and so > can be optimized > ------------------------------------------------------------------------------------------- > > Key: MAHOUT-2019 > URL: https://issues.apache.org/jira/browse/MAHOUT-2019 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.13.0 > Reporter: Pat Ferrel > Assignee: Pat Ferrel > Fix For: 0.13.1 > > > DRMs get blockified into SparseRowMatrix instances if the density is low. But > SRM inherits the implementation of method like "assign" from AbstractMatrix, > which uses nest for loops to traverse rows. For multiplying 2 matrices that > are extremely sparse, the kind if data you see in collaborative filtering, > this is extremely wasteful of execution time. Better to use a sparse vector's > iterateNonZero Iterator for some function types. -- This message was sent by Atlassian JIRA (v6.4.14#64029)