[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82458/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #82458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #82458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 Jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82456/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #82456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #82456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81825/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #81825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)** for PR 18748 at commit [`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #81825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)** for PR 18748 at commit [`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 Any further comments on this? @srowen @mpjlu @jkbradley? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 @srowen not really sure about which of `Set` vs the `Dataset` would be more common. I'm inclined to stick with `Dataset` to keep the API consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 Ah ok - that clears things up. Yes that `predict` method is very inefficient relative to the `recommendForAll` setup. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Thanks @MLnick . I have double checked my test. Since there is no recommendForUserSubset , my previous test is MLLIB MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating of many users for many products. The performance of this function is low comparing with recommendForAll. This PR calls recommendForAll with a subset of the users, I agree with your test results. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 Ok, so I did some larger-scale test on a cluster (3x workers, each with 48 cores / 100GB allocated RAM with 1 executor). On same `movielens-latest` datasets (~250,000 users and ~33,000 movies), using a **30% sample** of user ids: ``` scala> // all users scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) } Time taken: 25104 ms scala> // user sample scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ => Unit) } Time taken: 8963 ms scala> 8963 / 25104.0 res16: Double = 0.35703473550031867 ``` On a much larger dataset - Amazon books ratings data (8 million users, 2.3 million items) also using a **30% user sample**: ``` scala> // all users scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) } Time taken: 32985936 ms => 9.16 hours scala> // user sample scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ => Unit) } Time taken: 8164421 ms => 2.26 hours scala> 8164421 / 32985936.0 res7: Double = 0.24751218216151272 ``` So it's a reasonably consistent range *25-35%* of time for a *30%* user sample (I found broadly similar results with a 70% user sample, taking about 60% of the recommend-for-all time). @mpjlu could you double check your results? What I find is consistent with my expectations that computing for a subset should take time roughly proportional to the ratio of the ids in the subset to the total. It appears to me the extra `distinct` and `join` don't have too much impact on overall runtime. However your results are very different so we should understand why. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Thanks. This is my test setting: 3 workersï¼ each: 40 cores, 196G memory, 1 executor. Data Size: user 480,000, item 17,000 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 I don't get similar results to you (granted I have just tested locally). ``` scala> spark.time { userRecsAll.foreach(_ => Unit) } Time taken: 122422 ms scala> spark.time { userRecsPart.foreach(_ => Unit) } Time taken: 50228 ms ``` Here, `userRecsPart` is a 30% sample, and the time is ~40% of the `recommendForAllUsers` time. I will try some larger-scale tests. It could be that the `join` and `distinct` causes the underperformance. However, those operations would increase the number of partitions in the computation a lot due to `spark.sql.shuffle.partitions` setting if using defaults. Setting this to say `8` (the number of threads I have locally), I get ``` scala> spark.time { userRecsPart.foreach(_ => Unit) } Time taken: 37362 ms ``` So, about 30% of the full time for the 30% sample. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Did you test the performance of this, I tested the performance of MLLIB recommendForUserSubset some days ago, the performance is not good. Suppose the time of recommendForAll is 35s, recommend for 1/3 Users use this may need 90s. Maybe it is faster to use recommendForAll then select 1/3 users. But if recommend tens or hundreds of users, this is faster than recommendForAll. So should we add come commends in the code about when it is better to use recommendForUserSubset. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79998/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #79998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)** for PR 18748 at commit [`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18748 Seems reasonable; would it be more or less common/natural for someone to specify the users as a simple set, rather than a Dataset? not sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18748 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79997/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #79997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)** for PR 18748 at commit [`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #79998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)** for PR 18748 at commit [`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18748 **[Test build #79997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)** for PR 18748 at commit [`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18748 **Note 1** this implementation must perform a `distinct` on the input data frame id column to guarantee correct results, since otherwise multiple "copies" of the same recommendations would be generated for duplicate ids, and the resulting recommendations contain duplicates. This could alternatively be left to the user to handle, and assume that the input data frame contains no duplicates. But for now I've opted for the safest option even if it introduces this inefficiency. **Note 2** This does not support `coldStartStrategy`. Therefore no recommendations will be returned for ids in the input dataframe that are not contained in the model (this is analogous to `coldStartStrategy=drop` for `transform`). I believe this makes most sense, since supporting something like the `na` option would be a bit involved and not add that much value. However it could be done (but would need to return `null` rows in the `recommendation` column for these cases). Later, when other cold start strategies might be supported (e.g. average factor vectors), this method could return recommendations even for ids that are not contained in the model. cc @srowen @jkbradley @yanboliang @mpjlu @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org