Hi Steven,

  Glad to see you here in the land of more-than-140chars!

On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <[email protected]> wrote:
>
>
> My research partner and I have a dataset that consists of 400,000
> users and 1.6 million articles, with about 22 million nonzeros. We are
> trying to use this data to make recommendations to users. We have
> tried using the SVD and PLSI, both with unsatisfactory results, and
> are now attempting kPCA.
>

You've done SVD on the 400K x 1.6M input matrix already, and found
the results unsatisfactory?  What was it you were using the singular
vectors for, and how did you do the svd computation?


> We have a 400,000 by 400,000 sparse symmetric positive-definite
> matrix, H, that we need the top couple hundred eigenvectors/values
> for. Jake has told me that I can use mahout 180 unchanged, but it will
> be doing redundant work and the output eigenvalues are the squares of
> the ones we actually want. This sounds like a good approach, but it
> would be great if mahout had an optimized eigendecomposition for
> symmetric matrices. Jake suggested I submit a JIRA ticket regarding
> this, which I plan to do.
>

Depending on how much RAM you have on your box, "couple hundred"
eigenvectors on 400k dimensions is at least "a couple" GB, and since
Lanczos tends to get less eigenvectors than you expect (so you need
to ask for more than you think you want), "a couple" could turn into
maybe 10GB.  Doable, but hitting the boundaries of scalability.

I'm in the process of filing another JIRA ticket concerning removing
the need to keep the full Lanczos basis in RAM (because it's not really
necessary, although it will be slower without the RAM available), to
make this more scalable (the current implementation scales to "infinite"
row size, but not infinite column size).

H is the pairwise distance in feature space (calculated using a kernel
> function) between each pair of users (or some subset of users). After
> I mentioned this to Jake, he asked me "why aren't you just doing it
> all in one go? Kernelize on the rows, and do SVD on that? Why do the
> M*M^t intermediate step?" Unfortunately, I'm not sure what you're
> asking, Jake, can you clarify?
>

H is the *distance* using your kernel function, or *similarity*?  If it's
sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
generalized inner product / kernel, right?

What I mean is this: when computing the SVD of your original matrix
via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
with GHA).  If your kernel can be represented as

  k(r_i, r_j) = f(r_i).dot( f(r_j) )

where f is some map from R^(A.numCols) to R^(kernelDimension)
Then you can just apply the kernel to the rows as you iterate over
them and compute the SVD of the kernelized form of A.

Of course, if you're not doing a kernel representable in that way,
then you're not getting far down this road.  The other problem with
this approach is that you really need to make sure you're doing
a hashing trick on the kernelized rows (a la stochastic
decomposition), or else you're probably going to run into problems
of size on your singular vector set (their dimensionality will be too
big).  Stochastic decomposition is really the thing that will make
this work much better (and I've got planned for the 0.4 release).

What kinds of kernels are you planning to use?  How do you
plan on dealing with the userSparsity * numUsers^2 * kernelCost
computational cost of producing H?  That's even higher than the
desiredRank * numUsers * userSparsity cost of Lanczos.

  -jake

Reply via email to