[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221379#comment-14221379 ]
Debasish Das commented on SPARK-3066: ------------------------------------- I did experiments on MovieLens dataset with varying rank on my localhost spark with 4 GB RAM and 4 cores to see how much MAP improvement we see as the rank is scaled === rank=10 (default) Got 1000209 ratings from 6040 users on 3706 movies. Training: 799747, test: 200462. Test RMSE = 0.8528377625407709. Test users 6036 MAP 0.03851426277536059 Runtime: 30s === rank=25 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800417, test: 199792. Test RMSE = 0.8518001349769724. Test users 6037 MAP 0.04508057348514959 Runtime: 30 s === rank=50 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800823, test: 199386. Test RMSE = 0.8487416471685229. Test users 6038 MAP 0.05145126538369158 Runtime 42s === rank=100 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800720, test: 199489. Test RMSE = 0.8508095863317275. Test users 6033 MAP 0.0561225429735388 Runtime 1.5m === rank=150 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800257, test: 199952. Test RMSE = 0.8435902056186158. Test users 6035 MAP 0.05855252471878818 Runtime 3.6 m === rank=200 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800356, test: 199853. Test RMSE = 0.8452385688272382. Test users 6037 MAP 0.059176892052172934 Runtime 7.4 mins I will run through MovieLens10m and Netflix dataset and generate the numbers of them with varying ranks as well. I need to run them on cluster. > Support recommendAll in matrix factorization model > -------------------------------------------------- > > Key: SPARK-3066 > URL: https://issues.apache.org/jira/browse/SPARK-3066 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xiangrui Meng > > ALS returns a matrix factorization model, which we can use to predict ratings > for individual queries as well as small batches. In practice, users may want > to compute top-k recommendations offline for all users. It is very expensive > but a common problem. We can do some optimization like > 1) collect one side (either user or product) and broadcast it as a matrix > 2) use level-3 BLAS to compute inner products > 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org