Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 For performance tests, I've been using the MovieLens `ml-latest` dataset [here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings with `259,137` users and `39,443` movies. So it's not enormous but "recommend all" does a lot of work - generating `1,631,206,099` predicted ratings raw before the `top-k`. Some quick tests for the existing `recommendProductsForUsers` gives `306 sec`. ``` scala> spark.time { oldModel.recommendProductsForUsers(k).count } Time taken: 306512 ms res11: Long = 259137 ``` As part of my performance testing I've tried a few approaches roughly similar to this PR, but using `Window` and `filter` rather than this top-k aggregator (which is a neat idea). At first I thought this PR was really good: ``` scala> spark.time { newModel.recommendForAllUsers(k).count } Time taken: 151504 ms res3: Long = 259137 ``` `151 sec` seems fast! But then I tried this: ``` scala> spark.time { newModel.recommendForAllUsers(k).show } +------+--------------------+ |userId| recommendations| +------+--------------------+ | 35982|[[131382,15.53116...| | 67782|[[131382,29.72169...| | 82672|[[132954,12.19152...| |155042|[[148954,16.09084...| |167532|[[118942,13.94282...| |168802|[[27212,11.881494...| |216112|[[109159,25.46359...| |243392|[[153010,9.85302]...| |255132|[[131382,15.50626...| |255362|[[131382,10.08476...| | 17389|[[152711,16.09958...| |120899|[[156956,12.61003...| |213089|[[82055,13.293286...| |253769|[[152711,16.57459...| |258129|[[152711,22.50499...| | 24347|[[152711,12.31282...| | 35947|[[153184,11.04110...| |103357|[[132954,13.26898...| |130557|[[118942,14.00168...| |156017|[[153010,12.24449...| +------+--------------------+ only showing top 20 rows Time taken: 672524 ms ``` `672 sec`, over 2x slower than `mllib` impl. Not sure why `count` is fast relative to `show` (maybe Spark SQL is not doing all the actual compute, while for `show` it does need to?).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org