Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi, If you do cartesian join to predict users' preference over all the products, I think that 8 nodes with 64GB ram would not be enough for the data. Recently, I used als for a similar situation, but just 10M users and 0.1M products, the minimum requirement is 9 nodes with 10GB RAM. Moreover,

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks much for your reply. By saying on the fly, you mean caching the trained model, and querying it for each user joined with 30M products when needed? Our question is more about the general approach, what if we have 7M DAU? How the companies deal with that using Spark? On Wed, Mar 18, 2015

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
Not just the join, but this means you're trying to compute 600 trillion dot products. It will never finish fast. Basically: don't do this :) You don't in general compute all recommendations for all users, but recompute for a small subset of users that were or are likely to be active soon. (Or

Apache Spark ALS recommendations approach

2015-03-18 Thread Aram
) val usersProducts = users.cartesian(products) val recommendations = model.predict(usersProducts) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-ALS-recommendations-approach-tp22116.html Sent from the Apache Spark User List mailing list

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
I don't think that you need memory to put the whole joined data set in memory. However memory is unlikely to be the limiting factor, it's the massive shuffle. OK, you really do have a large recommendation problem if you're recommending for at least 7M users per day! My hunch is that it won't be

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks gen for helpful post. Thank you Sean, we're currently exploring this world of recommendations with Spark, and your posts are very helpful to us. We've noticed that you're a co-author of Advanced Analytics with Spark, just not to get to deep into offtopic, will it be finished soon? On Wed,