Hi, I am trying to develop a recommender system for about 1 million users and 10 thousand items. Currently it's a simple regression based model where for every user, item pair in dataset we generate some features and learn model from it. Till training and evaluation everything is fine the bottleneck is prediction and ranking for deployment, as at the end of day we need to recommend each user top 10 personalized items. To do this for every user I need to use model to predict his rating/preference on all items and take top 10 items from list. Hence after learning the model I need to do 10K X 1million predictions (model.predict(featureVector)).
Currently I have the following process, feature vectors are sparse and of length ~300 each. *1. userFeatures:RDD[(Int, Vector)] , itemFeatures:RDD[(Int, Vector)]* I do cartesian product of above to generate every user, item combination and corresponding feature: *2. val allUIFeat:RDD[(Int, Int, Vector)] = userFeatures.cartesian(itemFeatures).map(...)* Then I use the model to do prediction as follow: *3. val allUIPred:RDD[(Int, Int, Double)] = allUIFeat.map{x => (x._1, x._2, model.predict(x._3))}* *4. Then we do group by user and sort to get top 10 items.* We are not able to complete step 3 above, its taking a really long time (~5hrs) to get all the predictions which is really long considering we already have the model and it just needs to do some computation for prediction. I have tried partitioning userFeatures across 800 partitions before doing above steps, still it was of no help. I am using about 100 executor , 2 core, each executor with 2gb RAM. Are there any suggestions to make these predictions fast? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098.html Sent from the Apache Spark User List mailing list archive at Nabble.com.