[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user rekhajoshm closed the pull request at: https://github.com/apache/spark/pull/9980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user rekhajoshm commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-160803099 I concur @mengxr . Tried YourKit, and VisualVm Profiling.This does not fix the concern based on my runs with MovieLensALS and RecommendationExample. I do run into a set of other issues :-) If i do not get anything soon on this, will close this pull. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-160798418 @rekhajoshm You need to do profiling on big datasets. If the improvement is not significant, then this is not the right fix. Essentially we are shuffling many small objects `(srcId, (dstId, rating))`. I don't think the fix would be trivial. We could probably see improvement if we switch the backend to DataFrame/Tungsten. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/9980#discussion_r46220915 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -275,16 +276,13 @@ object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { num: Int): RDD[(Int, Array[(Int, Double)])] = { val srcBlocks = blockify(rank, srcFeatures) val dstBlocks = blockify(rank, dstFeatures) +val output = new ArrayBuffer[(Int, (Int, Double))]() val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case ((srcIds, srcFactors), (dstIds, dstFactors)) => -val m = srcIds.length -val n = dstIds.length val ratings = srcFactors.transpose.multiply(dstFactors) -val output = new Array[(Int, (Int, Double))](m * n) -var k = 0 +output.clear() ratings.foreachActive { (i, j, r) => --- End diff -- We don't need `output` to hold the buffer. The following should work, though it doesn't really fix the GC problem: ~~~scala for (i <- 0 until m; j <- 0 until n) yield { (srcIds(i), dstIds(j), ratings(i, j)) } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user rekhajoshm commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159772331 Thanks @mengxr Any alternative suggestion for improving upon objects needed on recommendAll functionality? I did multiple profiling/heap dump by running MatrixFactorizationModelSuite with IntelliJ/Visualvm. The GC %. used heap space and heap dumps/instances are non conclusive.thanks. thanks @srowen , fixed for your comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159763910 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/46720/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159763909 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159763828 **[Test build #46720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46720/consoleFull)** for PR 9980 at commit [`4104978`](https://github.com/apache/spark/commit/41049787a1b2f3cba8e77623c69a9f590006199f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159755471 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159755473 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/46718/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159755392 **[Test build #46718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46718/consoleFull)** for PR 9980 at commit [`4b2bb59`](https://github.com/apache/spark/commit/4b2bb59f46dad86cd7f09671040800f2664dfad0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159754992 **[Test build #46720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46720/consoleFull)** for PR 9980 at commit [`4104978`](https://github.com/apache/spark/commit/41049787a1b2f3cba8e77623c69a9f590006199f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/9980#discussion_r45929208 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -275,15 +276,13 @@ object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { num: Int): RDD[(Int, Array[(Int, Double)])] = { val srcBlocks = blockify(rank, srcFeatures) val dstBlocks = blockify(rank, dstFeatures) +val output = new ArrayBuffer[(Int, (Int, Double))]() val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case ((srcIds, srcFactors), (dstIds, dstFactors)) => -val m = srcIds.length -val n = dstIds.length val ratings = srcFactors.transpose.multiply(dstFactors) -val output = new Array[(Int, (Int, Double))](m * n) var k = 0 ratings.foreachActive { (i, j, r) => - output(k) = (srcIds(i), (dstIds(j), r)) + output.append((srcIds(i), (dstIds(j), r))) --- End diff -- Is k even needed now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159750518 This won't help much and it may cause issues because the buffer is not cleaned. It would be helpful if you can profile the implementation and show that the number of temporary objects are reduced. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9980#issuecomment-159749151 **[Test build #46718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46718/consoleFull)** for PR 9980 at commit [`4b2bb59`](https://github.com/apache/spark/commit/4b2bb59f46dad86cd7f09671040800f2664dfad0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11968] [MLlib] : MatrixFactorizationMod...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/spark/pull/9980 [SPARK-11968] [MLlib] : MatrixFactorizationModel recommendAll for GC times Fix for ALS recommend all methods for GC times You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/spark SPARK-11968 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9980.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9980 commit e3677c9fa9697e0d34f9df52442085a6a481c9e9 Author: Rekha Joshi Date: 2015-05-05T23:10:08Z Merge pull request #1 from apache/master Pulling functionality from apache spark commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75 Author: Rekha Joshi Date: 2015-05-08T21:49:09Z Merge pull request #2 from apache/master pull latest from apache spark commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c Author: Rekha Joshi Date: 2015-06-22T00:08:08Z Merge pull request #3 from apache/master Pulling functionality from apache spark commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3 Author: Rekha Joshi Date: 2015-09-17T01:03:09Z Merge pull request #4 from apache/master Pulling functionality from apache spark commit b123c601e459d1ad17511fd91dd304032154882a Author: Rekha Joshi Date: 2015-11-25T18:50:32Z Merge pull request #5 from apache/master pull request from apache/master commit 4b2bb59f46dad86cd7f09671040800f2664dfad0 Author: Joshi Date: 2015-11-25T22:48:56Z Fix for ALS recommend all methods for GC times --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org