[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218667#comment-14218667
 ] 

Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM:
----------------------------------------------------------------

[~mengxr] as per our discussions, I added APIs for batch user and product 
recommendation and MAP computation for recommending topK products for users...

Note that I don't use reservoir sampling and used your idea of filtering the 
test set users for which there are no model being built...I thought reservoir 
sampling should be part of a separate PR

APIs added:

recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every 
user

recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from 
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as 
well if rest of the flow makes sense...

On examples.MovieLensALS we can activate the user map calculation using 
--validateRecommendation flag:

./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies.                             
                                                                                
                
Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306.                                                 
                                                                                
                
Test users 6032 MAP 0.03798337814233403 

I will run the netflix dataset and report the MAP measures for that..

On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 
240GB for topK users for each product and that takes around 5 mins...on an 
average I generate a ranked list of 6000 users for each product...Basically 
internally we are using the batch API:

recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product


was (Author: debasish83):
@mengxr as per our discussions, I added APIs for batch user and product 
recommendation and MAP computation for recommending topK products for users...

Note that I don't use reservoir sampling and used your idea of filtering the 
test set users for which there are no model being built...I thought reservoir 
sampling should be part of a separate PR

APIs added:

recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every 
user

recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from 
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as 
well if rest of the flow makes sense...

On examples.MovieLensALS we can activate the user map calculation using 
--validateRecommendation flag:

./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies.                             
                                                                                
                
Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306.                                                 
                                                                                
                
Test users 6032 MAP 0.03798337814233403 

I will run the netflix dataset and report the MAP measures for that..

On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 
240GB for topK users for each product and that takes around 5 mins...on an 
average I generate a ranked list of 6000 users for each product...Basically 
internally we are using the batch API:

recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

> Support recommendAll in matrix factorization model
> --------------------------------------------------
>
>                 Key: SPARK-3066
>                 URL: https://issues.apache.org/jira/browse/SPARK-3066
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>
> ALS returns a matrix factorization model, which we can use to predict ratings 
> for individual queries as well as small batches. In practice, users may want 
> to compute top-k recommendations offline for all users. It is very expensive 
> but a common problem. We can do some optimization like
> 1) collect one side (either user or product) and broadcast it as a matrix
> 2) use level-3 BLAS to compute inner products
> 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to