[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896190#comment-15896190 ] Sean Owen commented on SPARK-6407: -- I did some work on this, but it's not a paper or anything, just some code, in and around these bits of code, which try to compute new user/item updates on the fly: https://github.com/OryxProject/oryx/blob/master/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L198 https://github.com/OryxProject/oryx/blob/master/app/oryx-app-common/src/main/java/com/cloudera/oryx/app/als/ALSUtils.java The choices about the semantics of the updates are in ALSUtils. If you dig into it, we can discuss offline and I can probably write more in the docs to make it clearer what's happening. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177 ] Daniel Li commented on SPARK-6407: -- bq. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896154#comment-15896154 ] Sean Owen commented on SPARK-6407: -- Computing one or two iterations per update -- as in every time someone clicks on a product or something? no, that's way way too slow. Each would launch tens of large distributed jobs. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896033#comment-15896033 ] Daniel Li commented on SPARK-6407: -- Appreciate the quick reply, [~srowen]. Yeah, we'd be recomputing them, but not from scratch since we'd be starting with optimized _U_ and _V_. It would likely take only one or two iterations before reconvergence. Would this still be considered too expensive? The thing I hesitate about regarding fold-in updating is that the assumption that only the corresponding user row and item row will change may be too simplifying (since, of course, there's a "rippling out" effect—all items the user rated previous need to be updated, then all users that rated any of those items would need updating, etc.). Then again, even if we take this rippling into account the computation may not be too expensive, since a single update likely won't affect the RMSE enough to delay convergence. (Though I haven't worked out the math showing this; it's just a hunch.) Do you have any insights into this? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895605#comment-15895605 ] Sean Owen commented on SPARK-6407: -- How is it different from recomputing all of U and V? Doing anything to all of the matrices is probably out of the question for an online update. The point of fold-in is to update only the two affected rows and make the simplifying assumption that nothing else changes, because it would be too expensive to recompute anything. If you mean batch together enough to make it worthwhile to update, then yes at some point that's worth it, but it just reduces to re-running the batch algorithm for a few iterations again. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895589#comment-15895589 ] Daniel Li commented on SPARK-6407: -- Reviving this thread since I'm interested in implementing streaming CF for Spark. bq. Using ALS for online updates is expensive. Recomputing the factor matrices _U_ and _V_ from scratch for every update would be terribly expensive, but what about keeping _U_ and _V_ around and simply recomputing another round or two after each new rating that comes in? The algorithm would simply be continually following a moving optimum. I can't imagine the RMSE changing much due to small updates if we use a convergence threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] instead of a fixed number of iterations. (In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice of time, new updates that come in during a round of calculation could be "batched" into the next round of calculation, increasing efficiency.) Thoughts? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481874#comment-14481874 ] Burak Yavuz commented on SPARK-6407: I actually worked on this over the weekend for fun and have a streaming, gradient descent based, matrix factorization model implemented here: https://github.com/brkyvz/streaming-matrix-factorization It is a very naive implementation, but it might be something to work on top of. I will publish a Spark Package for it as soon as I get the tests in. The model it uses for predicting ratings for user `u` and product `p` is: {code} r = U(u) * P^T(p) + bu(u) + bp(p) + mu {code} where U(u) is the u'th row of the User matrix, P(p) is the p'th row for the product matrix, bu(u) is the u'th element of the user bias vector, bp(p) is the p'th element of the product bias vector and mu is the global average. Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481711#comment-14481711 ] Xiangrui Meng commented on SPARK-6407: -- Attached the comment from Chunnan Yao in SPARK-6711: On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few discussion about collaborative filtering on streaming data. Given streaming k-means, streaming logistic regression, and the on-going incremental model training of Naive Bayes Classifier (SPARK-4144), we think it is meaningful to consider streaming Collaborative Filtering support on MLlib. We have already been considering about this issue during the past week. We plan to refer to this paper (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on SGD instead of ALS, which is easier to be tackled under streaming data. Fortunately, the authors of this paper have implemented their algorithm as a Github Project, based on Storm: https://github.com/MrChrisJohnson/CollabStream Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14423680#comment-14423680 ] Xiangrui Meng commented on SPARK-6407: -- Using ALS for online updates is expensive. I think we should use the factors from ALS as the initial point and use a stochastic gradient descent scheme for online update, e.g. DSGD: http://dl.acm.org/citation.cfm?id=2020426. I'm not sure whether this would work. Someone should work out the math first. Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392646#comment-14392646 ] Chris Fregly commented on SPARK-6407: - from [~mengxr] The online update should be implemented with GraphX or indexedrdd, which may take some time. There is no open-source solution. Try doing a survey on existing algorithms for online matrix factorization updates. Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392742#comment-14392742 ] Sean Owen commented on SPARK-6407: -- ALS doesn't use gradient descent, at least not enough in the sense that these linear models do that you could reuse the implementation. I am accustomed to fold-in for approximate streaming updates to an ALS model, but yes it does kind of need to mutate some RDD-based data structured efficiently like an IndexedRDD. Although the idea is simple I also don't know of good theoretical approaches and have just made up reasonable heuristics in the past. Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393021#comment-14393021 ] Joseph K. Bradley commented on SPARK-6407: -- I'm not too familiar with the area, but it seems similar to randomized linear algebra work if you can assume the incoming data is i.i.d. But [~mengxr] may be more familiar with this literature than me... Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org