Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    For performance tests, I've been using the MovieLens `ml-latest` dataset 
[here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings 
with `259,137` users and `39,443` movies.
    
    So it's not enormous but "recommend all" does a lot of work - generating 
`1,631,206,099` predicted ratings raw before the `top-k`.
    
    Some quick tests for the existing `recommendProductsForUsers` gives `306 
sec`.
    ```
    scala> spark.time { oldModel.recommendProductsForUsers(k).count }
    Time taken: 306512 ms
    res11: Long = 259137
    ```
    
    As part of my performance testing I've tried a few approaches roughly 
similar to this PR, but using `Window` and `filter` rather than this top-k 
aggregator (which is a neat idea). 
    
    At first I thought this PR was really good: 
    ```
    scala> spark.time { newModel.recommendForAllUsers(k).count }
    Time taken: 151504 ms
    res3: Long = 259137
    ```
    
    `151 sec` seems fast!
    
    But then I tried this: 
    ```
    scala> spark.time { newModel.recommendForAllUsers(k).show }
    +------+--------------------+
    |userId|     recommendations|
    +------+--------------------+
    | 35982|[[131382,15.53116...|
    | 67782|[[131382,29.72169...|
    | 82672|[[132954,12.19152...|
    |155042|[[148954,16.09084...|
    |167532|[[118942,13.94282...|
    |168802|[[27212,11.881494...|
    |216112|[[109159,25.46359...|
    |243392|[[153010,9.85302]...|
    |255132|[[131382,15.50626...|
    |255362|[[131382,10.08476...|
    | 17389|[[152711,16.09958...|
    |120899|[[156956,12.61003...|
    |213089|[[82055,13.293286...|
    |253769|[[152711,16.57459...|
    |258129|[[152711,22.50499...|
    | 24347|[[152711,12.31282...|
    | 35947|[[153184,11.04110...|
    |103357|[[132954,13.26898...|
    |130557|[[118942,14.00168...|
    |156017|[[153010,12.24449...|
    +------+--------------------+
    only showing top 20 rows
    
    Time taken: 672524 ms
    ```
    
    `672 sec`, over 2x slower than `mllib` impl.
    
    Not sure why `count` is fast relative to `show` (maybe Spark SQL is not 
doing all the actual compute, while for `show` it does need to?).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to