Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Trying to build recommendation system using Spark MLLib's ALS.

Currently, we're trying to pre-build recommendations for all users on daily
basis. We're using simple implicit feedbacks and ALS.

The problem is, we have 20M users and 30M products, and to call the main
predict() method, we need to have the cartesian join for users and
products, which is too huge, and it may take days to generate only the
join. Is there a way to avoid cartesian join to make the process faster?

Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
the data.

val users: RDD[Int] = ???   // RDD with 20M userIds
val products: RDD[Int] = ???// RDD with 30M productIds
val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

val model = new ALS().setRank(10).setIterations(10)
  .setLambda(0.0001).setImplicitPrefs(true)
  .setAlpha(40).run(ratings)

val usersProducts = users.cartesian(products)
val recommendations = model.predict(usersProducts)


Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi,

If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover, even the program pass, the time of treatment will be very long.
Maybe you should try to reduce the set to predict for each client, as in
practice, you never need predict the preference of all products to make a
recommendation.

Hope this will be helpful.

Cheers
Gen


On Wed, Mar 18, 2015 at 12:13 PM, Aram Mkrtchyan 
aram.mkrtchyan...@gmail.com wrote:

 Trying to build recommendation system using Spark MLLib's ALS.

 Currently, we're trying to pre-build recommendations for all users on
 daily basis. We're using simple implicit feedbacks and ALS.

 The problem is, we have 20M users and 30M products, and to call the main
 predict() method, we need to have the cartesian join for users and
 products, which is too huge, and it may take days to generate only the
 join. Is there a way to avoid cartesian join to make the process faster?

 Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
 for the data.

 val users: RDD[Int] = ???   // RDD with 20M userIds
 val products: RDD[Int] = ???// RDD with 30M productIds
 val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

 val model = new ALS().setRank(10).setIterations(10)
   .setLambda(0.0001).setImplicitPrefs(true)
   .setAlpha(40).run(ratings)

 val usersProducts = users.cartesian(products)
 val recommendations = model.predict(usersProducts)




Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks much for your reply.

By saying on the fly, you mean caching the trained model, and querying it
for each user joined with 30M products when needed?

Our question is more about the general approach, what if we have 7M DAU?
How the companies deal with that using Spark?


On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen so...@cloudera.com wrote:

 Not just the join, but this means you're trying to compute 600
 trillion dot products. It will never finish fast. Basically: don't do
 this :) You don't in general compute all recommendations for all
 users, but recompute for a small subset of users that were or are
 likely to be active soon. (Or compute on the fly.) Is anything like
 that an option?

 On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Trying to build recommendation system using Spark MLLib's ALS.
 
  Currently, we're trying to pre-build recommendations for all users on
 daily
  basis. We're using simple implicit feedbacks and ALS.
 
  The problem is, we have 20M users and 30M products, and to call the main
  predict() method, we need to have the cartesian join for users and
 products,
  which is too huge, and it may take days to generate only the join. Is
 there
  a way to avoid cartesian join to make the process faster?
 
  Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
 for
  the data.
 
  val users: RDD[Int] = ???   // RDD with 20M userIds
  val products: RDD[Int] = ???// RDD with 30M productIds
  val ratings : RDD[Rating] = ??? // RDD with all user-product
 feedbacks
 
  val model = new ALS().setRank(10).setIterations(10)
.setLambda(0.0001).setImplicitPrefs(true)
.setAlpha(40).run(ratings)
 
  val usersProducts = users.cartesian(products)
  val recommendations = model.predict(usersProducts)



Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or compute on the fly.) Is anything like
that an option?

On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
aram.mkrtchyan...@gmail.com wrote:
 Trying to build recommendation system using Spark MLLib's ALS.

 Currently, we're trying to pre-build recommendations for all users on daily
 basis. We're using simple implicit feedbacks and ALS.

 The problem is, we have 20M users and 30M products, and to call the main
 predict() method, we need to have the cartesian join for users and products,
 which is too huge, and it may take days to generate only the join. Is there
 a way to avoid cartesian join to make the process faster?

 Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
 the data.

 val users: RDD[Int] = ???   // RDD with 20M userIds
 val products: RDD[Int] = ???// RDD with 30M productIds
 val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

 val model = new ALS().setRank(10).setIterations(10)
   .setLambda(0.0001).setImplicitPrefs(true)
   .setAlpha(40).run(ratings)

 val usersProducts = users.cartesian(products)
 val recommendations = model.predict(usersProducts)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark ALS recommendations approach

2015-03-18 Thread Aram
Hi all,

Trying to build recommendation system using Spark MLLib's ALS.

Currently, we're trying to pre-build recommendations for all users on daily
basis. We're using simple implicit feedbacks and ALS.

The problem is, we have 20M users and 30M products, and to call the main
predict() method, we need to have the cartesian join for users and products,
which is too huge, and it may take days to generate only the join. Is there
a way to avoid cartesian join to make the process faster?

Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
the data.

val users: RDD[Int] = ???   // RDD with 20M userIds
val products: RDD[Int] = ???// RDD with 30M productIds
val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

val model = new ALS().setRank(10).setIterations(10)
  .setLambda(0.0001).setImplicitPrefs(true)
  .setAlpha(40).run(ratings)

val usersProducts = users.cartesian(products)
val recommendations = model.predict(usersProducts)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-ALS-recommendations-approach-tp22116.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
I don't think that you need memory to put the whole joined data set in
memory. However memory is unlikely to be the limiting factor, it's the
massive shuffle.

OK, you really do have a large recommendation problem if you're
recommending for at least 7M users per day!

My hunch is that it won't be fast enough to use the simple predict()
or recommendProducts() method repeatedly. There was a proposal to make
a recommendAll() method which you could crib
(https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
still a work in progress since the point there was to do more work to
make it possibly scale.

You may consider writing a bit of custom code to do the scoring. For
example cache parts of the item-factor matrix in memory on the workers
and score user feature vectors in bulk against them.

There's a different school of though which is to try to compute only
what you need, on the fly, and cache it if you like. That is good in
that it doesn't waste effort and makes the result fresh, but, of
course, means creating or consuming some other system to do the
scoring and getting *that* to run fast.

You can also look into techniques like LSH for probabilistically
guessing which tiny subset of all items are worth considering, but
that's also something that needs building more code.

I'm sure a couple people could chime in on that here but it's kind of
a separate topic.

On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
aram.mkrtchyan...@gmail.com wrote:
 Thanks much for your reply.

 By saying on the fly, you mean caching the trained model, and querying it
 for each user joined with 30M products when needed?

 Our question is more about the general approach, what if we have 7M DAU?
 How the companies deal with that using Spark?


 On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen so...@cloudera.com wrote:

 Not just the join, but this means you're trying to compute 600
 trillion dot products. It will never finish fast. Basically: don't do
 this :) You don't in general compute all recommendations for all
 users, but recompute for a small subset of users that were or are
 likely to be active soon. (Or compute on the fly.) Is anything like
 that an option?

 On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Trying to build recommendation system using Spark MLLib's ALS.
 
  Currently, we're trying to pre-build recommendations for all users on
  daily
  basis. We're using simple implicit feedbacks and ALS.
 
  The problem is, we have 20M users and 30M products, and to call the main
  predict() method, we need to have the cartesian join for users and
  products,
  which is too huge, and it may take days to generate only the join. Is
  there
  a way to avoid cartesian join to make the process faster?
 
  Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
  for
  the data.
 
  val users: RDD[Int] = ???   // RDD with 20M userIds
  val products: RDD[Int] = ???// RDD with 30M productIds
  val ratings : RDD[Rating] = ??? // RDD with all user-product
  feedbacks
 
  val model = new ALS().setRank(10).setIterations(10)
.setLambda(0.0001).setImplicitPrefs(true)
.setAlpha(40).run(ratings)
 
  val usersProducts = users.cartesian(products)
  val recommendations = model.predict(usersProducts)



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098

Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
this for 1.4...this is useful in general for cross validating on prec@k
measure to tune the model...

Right now it uses level 1 blas but the next extension is to use level 3
blas to further make the compute faster...
 On Mar 18, 2015 6:48 AM, Sean Owen so...@cloudera.com wrote:

 I don't think that you need memory to put the whole joined data set in
 memory. However memory is unlikely to be the limiting factor, it's the
 massive shuffle.

 OK, you really do have a large recommendation problem if you're
 recommending for at least 7M users per day!

 My hunch is that it won't be fast enough to use the simple predict()
 or recommendProducts() method repeatedly. There was a proposal to make
 a recommendAll() method which you could crib
 (https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
 still a work in progress since the point there was to do more work to
 make it possibly scale.

 You may consider writing a bit of custom code to do the scoring. For
 example cache parts of the item-factor matrix in memory on the workers
 and score user feature vectors in bulk against them.

 There's a different school of though which is to try to compute only
 what you need, on the fly, and cache it if you like. That is good in
 that it doesn't waste effort and makes the result fresh, but, of
 course, means creating or consuming some other system to do the
 scoring and getting *that* to run fast.

 You can also look into techniques like LSH for probabilistically
 guessing which tiny subset of all items are worth considering, but
 that's also something that needs building more code.

 I'm sure a couple people could chime in on that here but it's kind of
 a separate topic.

 On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Thanks much for your reply.
 
  By saying on the fly, you mean caching the trained model, and querying it
  for each user joined with 30M products when needed?
 
  Our question is more about the general approach, what if we have 7M DAU?
  How the companies deal with that using Spark?
 
 
  On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen so...@cloudera.com wrote:
 
  Not just the join, but this means you're trying to compute 600
  trillion dot products. It will never finish fast. Basically: don't do
  this :) You don't in general compute all recommendations for all
  users, but recompute for a small subset of users that were or are
  likely to be active soon. (Or compute on the fly.) Is anything like
  that an option?
 
  On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
  aram.mkrtchyan...@gmail.com wrote:
   Trying to build recommendation system using Spark MLLib's ALS.
  
   Currently, we're trying to pre-build recommendations for all users on
   daily
   basis. We're using simple implicit feedbacks and ALS.
  
   The problem is, we have 20M users and 30M products, and to call the
 main
   predict() method, we need to have the cartesian join for users and
   products,
   which is too huge, and it may take days to generate only the join. Is
   there
   a way to avoid cartesian join to make the process faster?
  
   Currently we have 8 nodes with 64Gb of RAM, I think it should be
 enough
   for
   the data.
  
   val users: RDD[Int] = ???   // RDD with 20M userIds
   val products: RDD[Int] = ???// RDD with 30M productIds
   val ratings : RDD[Rating] = ??? // RDD with all user-product
   feedbacks
  
   val model = new ALS().setRank(10).setIterations(10)
 .setLambda(0.0001).setImplicitPrefs(true)
 .setAlpha(40).run(ratings)
  
   val usersProducts = users.cartesian(products)
   val recommendations = model.predict(usersProducts)
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks gen for helpful post.

Thank you Sean, we're currently exploring this world of recommendations
with Spark, and your posts are very helpful to us.
We've noticed that you're a co-author of Advanced Analytics with Spark,
just not to get to deep into offtopic, will it be finished soon?

On Wed, Mar 18, 2015 at 5:47 PM, Sean Owen so...@cloudera.com wrote:

 I don't think that you need memory to put the whole joined data set in
 memory. However memory is unlikely to be the limiting factor, it's the
 massive shuffle.

 OK, you really do have a large recommendation problem if you're
 recommending for at least 7M users per day!

 My hunch is that it won't be fast enough to use the simple predict()
 or recommendProducts() method repeatedly. There was a proposal to make
 a recommendAll() method which you could crib
 (https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
 still a work in progress since the point there was to do more work to
 make it possibly scale.

 You may consider writing a bit of custom code to do the scoring. For
 example cache parts of the item-factor matrix in memory on the workers
 and score user feature vectors in bulk against them.

 There's a different school of though which is to try to compute only
 what you need, on the fly, and cache it if you like. That is good in
 that it doesn't waste effort and makes the result fresh, but, of
 course, means creating or consuming some other system to do the
 scoring and getting *that* to run fast.

 You can also look into techniques like LSH for probabilistically
 guessing which tiny subset of all items are worth considering, but
 that's also something that needs building more code.

 I'm sure a couple people could chime in on that here but it's kind of
 a separate topic.

 On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Thanks much for your reply.
 
  By saying on the fly, you mean caching the trained model, and querying it
  for each user joined with 30M products when needed?
 
  Our question is more about the general approach, what if we have 7M DAU?
  How the companies deal with that using Spark?
 
 
  On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen so...@cloudera.com wrote:
 
  Not just the join, but this means you're trying to compute 600
  trillion dot products. It will never finish fast. Basically: don't do
  this :) You don't in general compute all recommendations for all
  users, but recompute for a small subset of users that were or are
  likely to be active soon. (Or compute on the fly.) Is anything like
  that an option?
 
  On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
  aram.mkrtchyan...@gmail.com wrote:
   Trying to build recommendation system using Spark MLLib's ALS.
  
   Currently, we're trying to pre-build recommendations for all users on
   daily
   basis. We're using simple implicit feedbacks and ALS.
  
   The problem is, we have 20M users and 30M products, and to call the
 main
   predict() method, we need to have the cartesian join for users and
   products,
   which is too huge, and it may take days to generate only the join. Is
   there
   a way to avoid cartesian join to make the process faster?
  
   Currently we have 8 nodes with 64Gb of RAM, I think it should be
 enough
   for
   the data.
  
   val users: RDD[Int] = ???   // RDD with 20M userIds
   val products: RDD[Int] = ???// RDD with 30M productIds
   val ratings : RDD[Rating] = ??? // RDD with all user-product
   feedbacks
  
   val model = new ALS().setRank(10).setIterations(10)
 .setLambda(0.0001).setImplicitPrefs(true)
 .setAlpha(40).run(ratings)
  
   val usersProducts = users.cartesian(products)
   val recommendations = model.predict(usersProducts)