Re: possible bug in Spark's ALS implementation...
I think multiply by ratings is a heuristic that worked on rating related problems like netflix dataset or any other ratings datasets but the scope of NMF is much more broad than that @Sean please correct me in case you don't agree... Definitely it's good to add all the rating dataset related optimizations but it should not be in the core algorithm but say RatingALS which extends upon core ALS... On Wed, Apr 2, 2014 at 12:06 AM, Michael Allman m...@allman.ms wrote: Hi Nick, I don't have my spark clone in front of me, but OTOH the major differences are/were: 1. Oryx multiplies lambda by alpha. 2. Oryx uses a different matrix inverse algorithm. It maintains a certain symmetry which the Spark algo does not, however I don't think this difference has a real impact on the results. 3. Oryx supports the specification of a convergence threshold for termination of the algorithm, based on delta rmse on a subset of the training set if I recall correctly. I've been using that as the termination criterion instead of a fixed number of iterations. 4. Oryx uses the weighted regularization scheme you alluded to below, multiplying lambda by the number of ratings. I've patched the spark impl to support (4) but haven't pushed it to my clone on github. I think it would be a valuable feature to support officially. I'd also like to work on (3) but don't have time now. I've only been using Oryx the past couple of weeks. Cheers, Michael On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote: Hi Michael Would you mind setting out exactly what differences you did find between the Spark and Oryx implementations? Would be good to be clear on them, and also see if there are further tricks/enhancements from the Oryx one that can be ported (such as the lambda * numRatings adjustment). N On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman [hidden email] wrote: I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank 1 or a large number of iterations. I will post a bug report with a thorough explanation this weekend or on Monday. I believe I've been able to track down every difference between the Spark and Oryx implementations that lead to difference results. I made some adjustments to the spark implementation so that, given the same initial product/item vectors, the resulting model is identical to the one produced by Oryx within a small numerical tolerance. I've verified this for small data sets and am working on verifying this with some large data sets. Aside from those already identified in this thread, another significant difference in the Spark implementation is that it begins the factorization process by computing the product matrix (Y) from the initial user matrix (X). Both of the papers on ALS referred to in this thread begin the process by computing the user matrix. I haven't done any testing comparing the models generated starting from Y or X, but they are very different. Is there a reason Spark begins the iteration by computing Y? Initializing both X and Y as is done in the Spark implementation seems unnecessary unless I'm overlooking some desired side-effect. Only the factor matrix which generates the other in the first iteration needs to be initialized. I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s -ALS-implementation-tp2567p2704.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s -ALS-implementation-tp2567p3588.html To unsubscribe from possible bug in Spark's ALS implementation..., click here. NAML -- View this message in context: Re: possible bug in Spark's ALS implementation...http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS
Re: possible bug in Spark's ALS implementation...
-tp2567p2704.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s -ALS-implementation-tp2567p3588.html To unsubscribe from possible bug in Spark's ALS implementation..., click here. NAML View this message in context: Re: possible bug in Spark's ALS implementation... Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
Sorry, the link was wrong. Should be https://github.com/apache/spark/pull/131 -Xiangrui On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman m...@allman.ms wrote: Hi Xiangrui, I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can you explain? Also, thanks for addressing the issue with factor matrix persistence in PR 165. I was probably not going to get to that for a while. I will try to test your changes today for speed improvements. Cheers, Michael -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2817.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
Glad to hear the speed-up. Wish we can improve the implementation further in the future. -Xiangrui On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman m...@allman.ms wrote: I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
The factor matrix Y is used twice in implicit ALS computation, one to compute global Y^T Y, and another to compute local Y_i^T C_i Y_i. -Xiangrui On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote: I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this? This sounds like a good thing to add, though I'd like to understand why these are being recomputed (it seemed that the code would only use each one once). Do you have any sense why that is? Matei
Re: possible bug in Spark's ALS implementation...
Hi Michael, I made couple changes to implicit ALS. One gives faster construction of YtY (https://github.com/apache/spark/pull/161), which was merged into master. The other caches intermediate matrix factors properly (https://github.com/apache/spark/pull/165). They should give you the same result as before, but faster (~2x in my local tests). If you have time to try the improved version, please let me know the speed-up on your data. Thanks! Best, Xiangrui On Mon, Mar 17, 2014 at 5:07 PM, Michael Allman m...@allman.ms wrote: I've created https://spark-project.atlassian.net/browse/SPARK-1263 to address the issue of the factor matrix recomputation. I'm planning to submit a related pull request shortly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2785.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote: I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this? This sounds like a good thing to add, though I’d like to understand why these are being recomputed (it seemed that the code would only use each one once). Do you have any sense why that is? Matei
Re: possible bug in Spark's ALS implementation...
I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank 1 or a large number of iterations. I will post a bug report with a thorough explanation this weekend or on Monday. I believe I've been able to track down every difference between the Spark and Oryx implementations that lead to difference results. I made some adjustments to the spark implementation so that, given the same initial product/item vectors, the resulting model is identical to the one produced by Oryx within a small numerical tolerance. I've verified this for small data sets and am working on verifying this with some large data sets. Aside from those already identified in this thread, another significant difference in the Spark implementation is that it begins the factorization process by computing the product matrix (Y) from the initial user matrix (X). Both of the papers on ALS referred to in this thread begin the process by computing the user matrix. I haven't done any testing comparing the models generated starting from Y or X, but they are very different. Is there a reason Spark begins the iteration by computing Y? Initializing both X and Y as is done in the Spark implementation seems unnecessary unless I'm overlooking some desired side-effect. Only the factor matrix which generates the other in the first iteration needs to be initialized. I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
possible bug in Spark's ALS implementation...
Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it, however I can say that it's based on the same kind of data in the paper---relative viewing time of videos. (Specifically, the rating for each video is defined as total viewing time across all visitors divided by video duration). I'm seeing counterintuitive, sometimes nonsensical recommendations. For comparison, I've run the training data through Oryx's in-VM implementation of implicit ALS with the same parameters. Oryx uses the same algorithm. (Source in this file: https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java) The recommendations made by each system compared to one other are very different---moreso than I think could be explained by differences in initial state. The recommendations made by the Oryx models look much better, especially as I increase the number of latent factors and the iterations. The Spark models' recommendations don't improve with increases in either latent factors or iterations. Sometimes, they get worse. Because of the (understandably) highly-optimized and terse style of Spark's ALS implementation, I've had a very hard time following it well enough to debug the issue definitively. However, I have found a section of code that looks incorrect. As described in the paper, part of the implicit ALS algorithm involves computing a matrix product YtCuY (equation 4 in the paper). To optimize this computation, this expression is rewritten as YtY + Yt(Cu - I)Y. I believe that's what should be happening here: https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376 However, it looks like this code is in fact computing YtY + YtY(Cu - I), which is the same as YtYCu. If so, that's a bug. Can someone familiar with this code evaluate my claim? Cheers, Michael
Re: possible bug in Spark's ALS implementation...
Hi Michael, I can help check the current implementation. Would you please go to https://spark-project.atlassian.net/browse/SPARK and create a ticket about this issue with component MLlib? Thanks! Best, Xiangrui On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman m...@allman.ms wrote: Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it, however I can say that it's based on the same kind of data in the paper---relative viewing time of videos. (Specifically, the rating for each video is defined as total viewing time across all visitors divided by video duration). I'm seeing counterintuitive, sometimes nonsensical recommendations. For comparison, I've run the training data through Oryx's in-VM implementation of implicit ALS with the same parameters. Oryx uses the same algorithm. (Source in this file: https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java) The recommendations made by each system compared to one other are very different---moreso than I think could be explained by differences in initial state. The recommendations made by the Oryx models look much better, especially as I increase the number of latent factors and the iterations. The Spark models' recommendations don't improve with increases in either latent factors or iterations. Sometimes, they get worse. Because of the (understandably) highly-optimized and terse style of Spark's ALS implementation, I've had a very hard time following it well enough to debug the issue definitively. However, I have found a section of code that looks incorrect. As described in the paper, part of the implicit ALS algorithm involves computing a matrix product YtCuY (equation 4 in the paper). To optimize this computation, this expression is rewritten as YtY + Yt(Cu - I)Y. I believe that's what should be happening here: https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376 However, it looks like this code is in fact computing YtY + YtY(Cu - I), which is the same as YtYCu. If so, that's a bug. Can someone familiar with this code evaluate my claim? Cheers, Michael