Hi Michael,

Thanks for looking into the details! Computing X first and computing Y
first can deliver different results, because the initial objective
values could differ by a lot. But the algorithm should converge after
a few iterations. It is hard to tell which should go first. After all,
the definitions of "user" and "product" are arbitrary. One trick we
can do is to rescale the columns of X and Y after each iteration such
that they have the same column norms.

For the comparison, you should compute some metrics to verify the convergence.

I don't think initializing Y is necessary if we start with X. However,
if Y_0 is not used, the data is not actually generated. So the
overhead should be small.

Best,
Xiangrui

On Fri, Mar 14, 2014 at 5:52 PM, Michael Allman <m...@allman.ms> wrote:
> I've been thoroughly investigating this issue over the past couple of days
> and have discovered quite a bit. For one thing, there is definitely (at
> least) one issue/bug in the Spark implementation that leads to incorrect
> results for models generated with rank > 1 or a large number of iterations.
> I will post a bug report with a thorough explanation this weekend or on
> Monday.
>
> I believe I've been able to track down every difference between the Spark
> and Oryx implementations that lead to difference results. I made some
> adjustments to the spark implementation so that, given the same initial
> product/item vectors, the resulting model is identical to the one produced
> by Oryx within a small numerical tolerance. I've verified this for small
> data sets and am working on verifying this with some large data sets.
>
> Aside from those already identified in this thread, another significant
> difference in the Spark implementation is that it begins the factorization
> process by computing the product matrix (Y) from the initial user matrix
> (X). Both of the papers on ALS referred to in this thread begin the process
> by computing the user matrix. I haven't done any testing comparing the
> models generated starting from Y or X, but they are very different. Is there
> a reason Spark begins the iteration by computing Y?
>
> Initializing both X and Y as is done in the Spark implementation seems
> unnecessary unless I'm overlooking some desired side-effect. Only the factor
> matrix which generates the other in the first iteration needs to be
> initialized.
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
> Thanks.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to