Hi,

I am trying to develop a recommender system for about 1 million users and 10
thousand items. Currently it's a simple regression based model where for
every user, item pair in dataset we generate some features and learn model
from it. Till training and evaluation everything is fine the bottleneck is
prediction and ranking for deployment, as at the end of day we need to
recommend each user top 10 personalized items. To do this for every user I
need to use model to predict his rating/preference on all items and take top
10 items from list. Hence after learning the model I need to do 10K X
1million predictions (model.predict(featureVector)).    

Currently I have the following process, feature vectors are sparse and of
length ~300 each.
*1. userFeatures:RDD[(Int, Vector)] , itemFeatures:RDD[(Int, Vector)]*
I do cartesian product of above to generate every user, item combination and
corresponding feature:
*2. val allUIFeat:RDD[(Int, Int, Vector)] =
userFeatures.cartesian(itemFeatures).map(...)*
Then I use the model to do prediction as follow:
*3. val allUIPred:RDD[(Int, Int, Double)] = allUIFeat.map{x => (x._1, x._2,
model.predict(x._3))}*
*4. Then we do group by user and sort to get top 10 items.*

We are not able to complete step 3 above, its taking a really long time
(~5hrs) to get all the predictions which is really long considering we
already have the model and it just needs to do some computation for
prediction. I have tried partitioning  userFeatures across 800 partitions
before doing above steps, still it was of no help.

I am using about 100 executor , 2 core, each executor with 2gb RAM.

Are there any suggestions to make these predictions fast?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to