Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Debasish Das
I think multiply by ratings is a heuristic that worked on rating related
problems like netflix dataset or any other ratings datasets but the scope
of NMF is much more broad than that

@Sean please correct me in case you don't agree...

Definitely it's good to add all the rating dataset related optimizations
but it should not be in the core algorithm but say RatingALS which extends
upon core ALS...



On Wed, Apr 2, 2014 at 12:06 AM, Michael Allman m...@allman.ms wrote:

 Hi Nick,

 I don't have my spark clone in front of me, but OTOH the major differences
 are/were:

 1. Oryx multiplies lambda by alpha.

 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
 symmetry which the Spark algo does not, however I don't think this
 difference has a real impact on the results.

 3. Oryx supports the specification of a convergence threshold for
 termination of the algorithm, based on delta rmse on a subset of the
 training set if I recall correctly. I've been using that as the
 termination criterion instead of a fixed number of iterations.

 4. Oryx uses the weighted regularization scheme you alluded to below,
 multiplying lambda by the number of ratings.

 I've patched the spark impl to support (4) but haven't pushed it to my
 clone on github. I think it would be a valuable feature to support
 officially. I'd also like to work on (3) but don't have time now. I've
 only been using Oryx the past couple of weeks.

 Cheers,

 Michael

 On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:

  Hi Michael
  Would you mind setting out exactly what differences you did find between
 the
  Spark and Oryx implementations? Would be good to be clear on them, and
 also
  see if there are further tricks/enhancements from the Oryx one that can
 be
  ported (such as the lambda * numRatings adjustment).
 
  N
 
 
  On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman [hidden email] wrote:
I've been thoroughly investigating this issue over the past
couple of days
and have discovered quite a bit. For one thing, there is
definitely (at
least) one issue/bug in the Spark implementation that leads to
incorrect
results for models generated with rank  1 or a large number of
iterations.
I will post a bug report with a thorough explanation this
weekend or on
Monday.
 
I believe I've been able to track down every difference between
the Spark
and Oryx implementations that lead to difference results. I made
some
adjustments to the spark implementation so that, given the same
initial
product/item vectors, the resulting model is identical to the
one produced
by Oryx within a small numerical tolerance. I've verified this
for small
data sets and am working on verifying this with some large data
sets.
 
Aside from those already identified in this thread, another
significant
difference in the Spark implementation is that it begins the
factorization
process by computing the product matrix (Y) from the initial
user matrix
(X). Both of the papers on ALS referred to in this thread begin
the process
by computing the user matrix. I haven't done any testing
comparing the
models generated starting from Y or X, but they are very
different. Is there
a reason Spark begins the iteration by computing Y?
 
Initializing both X and Y as is done in the Spark implementation
seems
unnecessary unless I'm overlooking some desired side-effect.
Only the factor
matrix which generates the other in the first iteration needs to
be
initialized.
 
I also found that the product and user RDDs were being rebuilt
many times
over in my tests, even for tiny data sets. By persisting the RDD
returned
from updateFeatures() I was able to avoid a raft of duplicate
computations.
Is there a reason not to do this?
 
Thanks.
 
 
 
--
View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
-ALS-implementation-tp2567p2704.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
 
 
 
 

  If you reply to this email, your message will be added to the discussion
  below:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
  -ALS-implementation-tp2567p3588.html
  To unsubscribe from possible bug in Spark's ALS implementation..., click
  here. NAML
 
 

 --
 View this message in context: Re: possible bug in Spark's ALS
 implementation...http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Sean Owen
-tp2567p2704.html
   Sent from the Apache Spark User List mailing list archive at
   Nabble.com.



 
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
 -ALS-implementation-tp2567p3588.html
 To unsubscribe from possible bug in Spark's ALS implementation..., click
 here. NAML



 
 View this message in context: Re: possible bug in Spark's ALS
 implementation...

 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Sorry, the link was wrong. Should be
https://github.com/apache/spark/pull/131 -Xiangrui


On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman m...@allman.ms wrote:
 Hi Xiangrui,

 I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can
 you explain?

 Also, thanks for addressing the issue with factor matrix persistence in PR
 165. I was probably not going to get to that for a while.

 I will try to test your changes today for speed improvements.

 Cheers,

 Michael



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2817.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
I just ran a runtime performance comparison between 0.9.0-incubating and your
als branch. I saw a 1.5x improvement in performance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Glad to hear the speed-up. Wish we can improve the implementation
further in the future. -Xiangrui

On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman m...@allman.ms wrote:
 I just ran a runtime performance comparison between 0.9.0-incubating and your
 als branch. I saw a 1.5x improvement in performance.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
The factor matrix Y is used twice in implicit ALS computation, one to
compute global Y^T Y, and another to compute local Y_i^T C_i Y_i.
-Xiangrui

On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote:

 I also found that the product and user RDDs were being rebuilt many times
 over in my tests, even for tiny data sets. By persisting the RDD returned
 from updateFeatures() I was able to avoid a raft of duplicate computations.
 Is there a reason not to do this?


 This sounds like a good thing to add, though I'd like to understand why
 these are being recomputed (it seemed that the code would only use each one
 once). Do you have any sense why that is?

 Matei


Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
Hi Michael,

I made couple changes to implicit ALS. One gives faster construction
of YtY (https://github.com/apache/spark/pull/161), which was merged
into master. The other caches intermediate matrix factors properly
(https://github.com/apache/spark/pull/165). They should give you the
same result as before, but faster (~2x in my local tests). If you have
time to try the improved version, please let me know the speed-up on
your data. Thanks!

Best,
Xiangrui

On Mon, Mar 17, 2014 at 5:07 PM, Michael Allman m...@allman.ms wrote:
 I've created https://spark-project.atlassian.net/browse/SPARK-1263 to address
 the issue of the factor matrix recomputation. I'm planning to submit a
 related pull request shortly.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2785.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-16 Thread Matei Zaharia
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote:

 I also found that the product and user RDDs were being rebuilt many times
 over in my tests, even for tiny data sets. By persisting the RDD returned
 from updateFeatures() I was able to avoid a raft of duplicate computations.
 Is there a reason not to do this?

This sounds like a good thing to add, though I’d like to understand why these 
are being recomputed (it seemed that the code would only use each one once). Do 
you have any sense why that is?

Matei

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Michael Allman
I've been thoroughly investigating this issue over the past couple of days
and have discovered quite a bit. For one thing, there is definitely (at
least) one issue/bug in the Spark implementation that leads to incorrect
results for models generated with rank  1 or a large number of iterations.
I will post a bug report with a thorough explanation this weekend or on
Monday.

I believe I've been able to track down every difference between the Spark
and Oryx implementations that lead to difference results. I made some
adjustments to the spark implementation so that, given the same initial
product/item vectors, the resulting model is identical to the one produced
by Oryx within a small numerical tolerance. I've verified this for small
data sets and am working on verifying this with some large data sets.

Aside from those already identified in this thread, another significant
difference in the Spark implementation is that it begins the factorization
process by computing the product matrix (Y) from the initial user matrix
(X). Both of the papers on ALS referred to in this thread begin the process
by computing the user matrix. I haven't done any testing comparing the
models generated starting from Y or X, but they are very different. Is there
a reason Spark begins the iteration by computing Y?

Initializing both X and Y as is done in the Spark implementation seems
unnecessary unless I'm overlooking some desired side-effect. Only the factor
matrix which generates the other in the first iteration needs to be
initialized.

I also found that the product and user RDDs were being rebuilt many times
over in my tests, even for tiny data sets. By persisting the RDD returned
from updateFeatures() I was able to avoid a raft of duplicate computations.
Is there a reason not to do this?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


possible bug in Spark's ALS implementation...

2014-03-11 Thread Michael Allman

Hi,

I'm implementing a recommender based on the algorithm described in 
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the 
basis for Spark's ALS implementation for data sets with implicit features. 
The data set I'm working with is proprietary and I cannot share it, 
however I can say that it's based on the same kind of data in the 
paper---relative viewing time of videos. (Specifically, the rating for 
each video is defined as total viewing time across all visitors divided by 
video duration).


I'm seeing counterintuitive, sometimes nonsensical recommendations. For 
comparison, I've run the training data through Oryx's in-VM implementation 
of implicit ALS with the same parameters. Oryx uses the same algorithm. 
(Source in this file: 
https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)


The recommendations made by each system compared to one other are very 
different---moreso than I think could be explained by differences in 
initial state. The recommendations made by the Oryx models look much 
better, especially as I increase the number of latent factors and the 
iterations. The Spark models' recommendations don't improve with increases 
in either latent factors or iterations. Sometimes, they get worse.


Because of the (understandably) highly-optimized and terse style of 
Spark's ALS implementation, I've had a very hard time following it well 
enough to debug the issue definitively. However, I have found a section of 
code that looks incorrect. As described in the paper, part of the implicit 
ALS algorithm involves computing a matrix product YtCuY (equation 4 in the 
paper). To optimize this computation, this expression is rewritten as YtY 
+ Yt(Cu - I)Y. I believe that's what should be happening here:


https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376

However, it looks like this code is in fact computing YtY + YtY(Cu - I), 
which is the same as YtYCu. If so, that's a bug. Can someone familiar with 
this code evaluate my claim?


Cheers,

Michael


Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Hi Michael,

I can help check the current implementation. Would you please go to
https://spark-project.atlassian.net/browse/SPARK and create a ticket
about this issue with component MLlib? Thanks!

Best,
Xiangrui

On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman m...@allman.ms wrote:
 Hi,

 I'm implementing a recommender based on the algorithm described in
 http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
 basis for Spark's ALS implementation for data sets with implicit features.
 The data set I'm working with is proprietary and I cannot share it, however
 I can say that it's based on the same kind of data in the paper---relative
 viewing time of videos. (Specifically, the rating for each video is
 defined as total viewing time across all visitors divided by video
 duration).

 I'm seeing counterintuitive, sometimes nonsensical recommendations. For
 comparison, I've run the training data through Oryx's in-VM implementation
 of implicit ALS with the same parameters. Oryx uses the same algorithm.
 (Source in this file:
 https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)

 The recommendations made by each system compared to one other are very
 different---moreso than I think could be explained by differences in initial
 state. The recommendations made by the Oryx models look much better,
 especially as I increase the number of latent factors and the iterations.
 The Spark models' recommendations don't improve with increases in either
 latent factors or iterations. Sometimes, they get worse.

 Because of the (understandably) highly-optimized and terse style of Spark's
 ALS implementation, I've had a very hard time following it well enough to
 debug the issue definitively. However, I have found a section of code that
 looks incorrect. As described in the paper, part of the implicit ALS
 algorithm involves computing a matrix product YtCuY (equation 4 in the
 paper). To optimize this computation, this expression is rewritten as YtY +
 Yt(Cu - I)Y. I believe that's what should be happening here:

 https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376

 However, it looks like this code is in fact computing YtY + YtY(Cu - I),
 which is the same as YtYCu. If so, that's a bug. Can someone familiar with
 this code evaluate my claim?

 Cheers,

 Michael