[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896190#comment-15896190
 ] 

Sean Owen commented on SPARK-6407:
--

I did some work on this, but it's not a paper or anything, just some code, in 
and around these bits of code, which try to compute new user/item updates on 
the fly:

https://github.com/OryxProject/oryx/blob/master/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L198
https://github.com/OryxProject/oryx/blob/master/app/oryx-app-common/src/main/java/com/cloudera/oryx/app/als/ALSUtils.java

The choices about the semantics of the updates are in ALSUtils. If you dig into 
it, we can discuss offline and I can probably write more in the docs to make it 
clearer what's happening.


> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177
 ] 

Daniel Li commented on SPARK-6407:
--

bq. In practice fold-in works fine. Folding in a day or so of updates has been 
OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896154#comment-15896154
 ] 

Sean Owen commented on SPARK-6407:
--

Computing one or two iterations per update -- as in every time someone clicks 
on a product or something? no, that's way way too slow. Each would launch tens 
of large distributed jobs.

In practice fold-in works fine. Folding in a day or so of updates has been OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank. 

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896033#comment-15896033
 ] 

Daniel Li commented on SPARK-6407:
--

Appreciate the quick reply, [~srowen].

Yeah, we'd be recomputing them, but not from scratch since we'd be starting 
with optimized _U_ and _V_.  It would likely take only one or two iterations 
before reconvergence.  Would this still be considered too expensive?

The thing I hesitate about regarding fold-in updating is that the assumption 
that only the corresponding user row and item row will change may be too 
simplifying (since, of course, there's a "rippling out" effect—all items the 
user rated previous need to be updated, then all users that rated any of those 
items would need updating, etc.).  Then again, even if we take this rippling 
into account the computation may not be too expensive, since a single update 
likely won't affect the RMSE enough to delay convergence.  (Though I haven't 
worked out the math showing this; it's just a hunch.)

Do you have any insights into this?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895605#comment-15895605
 ] 

Sean Owen commented on SPARK-6407:
--

How is it different from recomputing all of U and V?
Doing anything to all of the matrices is probably out of the question for an 
online update. 
The point of fold-in is to update only the two affected rows and make the 
simplifying assumption that nothing else changes, because it would be too 
expensive to recompute anything.
If you mean batch together enough to make it worthwhile to update, then yes at 
some point that's worth it, but it just reduces to re-running the batch 
algorithm for a few iterations again.

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895589#comment-15895589
 ] 

Daniel Li commented on SPARK-6407:
--

Reviving this thread since I'm interested in implementing streaming CF for 
Spark.

bq. Using ALS for online updates is expensive.

Recomputing the factor matrices _U_ and _V_ from scratch for every update would 
be terribly expensive, but what about keeping _U_ and _V_ around and simply 
recomputing another round or two after each new rating that comes in?  The 
algorithm would simply be continually following a moving optimum.  I can't 
imagine the RMSE changing much due to small updates if we use a convergence 
threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative 
Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] 
instead of a fixed number of iterations.

(In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice 
of time, new updates that come in during a round of calculation could be 
"batched" into the next round of calculation, increasing efficiency.)

Thoughts?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481874#comment-14481874
 ] 

Burak Yavuz commented on SPARK-6407:


I actually worked on this over the weekend for fun and have a streaming, 
gradient descent based, matrix factorization model implemented here: 
https://github.com/brkyvz/streaming-matrix-factorization

It is a very naive implementation, but it might be something to work on top of. 
I will publish a Spark Package for it as soon as I get the tests in. The model 
it uses for predicting ratings for user `u` and product `p` is:
{code}
r = U(u) * P^T(p) + bu(u) + bp(p) + mu
{code}
where U(u) is the u'th row of the User matrix, P(p) is the p'th row for the 
product matrix, bu(u) is the u'th element of the user bias vector, bp(p) is the 
p'th element of the product bias vector and mu is the global average.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481711#comment-14481711
 ] 

Xiangrui Meng commented on SPARK-6407:
--

Attached the comment from Chunnan Yao in SPARK-6711:

On-line Collaborative Filtering(CF) has been widely used and studied. To 
re-train a CF model from scratch every time when new data comes in is very 
inefficient 
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
 However, in Spark community we see few discussion about collaborative 
filtering on streaming data. Given streaming k-means, streaming logistic 
regression, and the on-going incremental model training of Naive Bayes 
Classifier (SPARK-4144), we think it is meaningful to consider streaming 
Collaborative Filtering support on MLlib. 

We have already been considering about this issue during the past week. We plan 
to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
SGD instead of ALS, which is easier to be tackled under streaming data. 

Fortunately, the authors of this paper have implemented their algorithm as a 
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14423680#comment-14423680
 ] 

Xiangrui Meng commented on SPARK-6407:
--

Using ALS for online updates is expensive. I think we should use the factors 
from ALS as the initial point and use a stochastic gradient descent scheme for 
online update, e.g. DSGD: http://dl.acm.org/citation.cfm?id=2020426. I'm not 
sure whether this would work. Someone should work out the math first.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392646#comment-14392646
 ] 

Chris Fregly commented on SPARK-6407:
-

from [~mengxr] 

The online update should be implemented with GraphX or indexedrdd, 
which may take some time. There is no open-source solution.

Try doing a survey on existing algorithms for online matrix 
factorization updates.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392742#comment-14392742
 ] 

Sean Owen commented on SPARK-6407:
--

ALS doesn't use gradient descent, at least not enough in the sense that these 
linear models do that you could reuse the implementation. I am accustomed to 
fold-in for approximate streaming updates to an ALS model, but yes it does kind 
of need to mutate some RDD-based data structured efficiently like an 
IndexedRDD. Although the idea is simple I also don't know of good theoretical 
approaches and have just made up reasonable heuristics in the past.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393021#comment-14393021
 ] 

Joseph K. Bradley commented on SPARK-6407:
--

I'm not too familiar with the area, but it seems similar to randomized linear 
algebra work if you can assume the incoming data is i.i.d.  But [~mengxr] may 
be more familiar with this literature than me...

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org