Hi,
I'm implementing a recommender based on the algorithm described in
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
basis for Spark's ALS implementation for data sets with implicit features.
The data set I'm working with is proprietary and I cannot share it,
however I can say that it's based on the same kind of data in the
paper---relative viewing time of videos. (Specifically, the "rating" for
each video is defined as total viewing time across all visitors divided by
video duration).
I'm seeing counterintuitive, sometimes nonsensical recommendations. For
comparison, I've run the training data through Oryx's in-VM implementation
of implicit ALS with the same parameters. Oryx uses the same algorithm.
(Source in this file:
https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
The recommendations made by each system compared to one other are very
different---moreso than I think could be explained by differences in
initial state. The recommendations made by the Oryx models look much
better, especially as I increase the number of latent factors and the
iterations. The Spark models' recommendations don't improve with increases
in either latent factors or iterations. Sometimes, they get worse.
Because of the (understandably) highly-optimized and terse style of
Spark's ALS implementation, I've had a very hard time following it well
enough to debug the issue definitively. However, I have found a section of
code that looks incorrect. As described in the paper, part of the implicit
ALS algorithm involves computing a matrix product YtCuY (equation 4 in the
paper). To optimize this computation, this expression is rewritten as YtY
+ Yt(Cu - I)Y. I believe that's what should be happening here:
https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376
However, it looks like this code is in fact computing YtY + YtY(Cu - I),
which is the same as YtYCu. If so, that's a bug. Can someone familiar with
this code evaluate my claim?
Cheers,
Michael