Hi Steven, Glad to see you here in the land of more-than-140chars!
On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <[email protected]> wrote: > > > My research partner and I have a dataset that consists of 400,000 > users and 1.6 million articles, with about 22 million nonzeros. We are > trying to use this data to make recommendations to users. We have > tried using the SVD and PLSI, both with unsatisfactory results, and > are now attempting kPCA. > You've done SVD on the 400K x 1.6M input matrix already, and found the results unsatisfactory? What was it you were using the singular vectors for, and how did you do the svd computation? > We have a 400,000 by 400,000 sparse symmetric positive-definite > matrix, H, that we need the top couple hundred eigenvectors/values > for. Jake has told me that I can use mahout 180 unchanged, but it will > be doing redundant work and the output eigenvalues are the squares of > the ones we actually want. This sounds like a good approach, but it > would be great if mahout had an optimized eigendecomposition for > symmetric matrices. Jake suggested I submit a JIRA ticket regarding > this, which I plan to do. > Depending on how much RAM you have on your box, "couple hundred" eigenvectors on 400k dimensions is at least "a couple" GB, and since Lanczos tends to get less eigenvectors than you expect (so you need to ask for more than you think you want), "a couple" could turn into maybe 10GB. Doable, but hitting the boundaries of scalability. I'm in the process of filing another JIRA ticket concerning removing the need to keep the full Lanczos basis in RAM (because it's not really necessary, although it will be slower without the RAM available), to make this more scalable (the current implementation scales to "infinite" row size, but not infinite column size). H is the pairwise distance in feature space (calculated using a kernel > function) between each pair of users (or some subset of users). After > I mentioned this to Jake, he asked me "why aren't you just doing it > all in one go? Kernelize on the rows, and do SVD on that? Why do the > M*M^t intermediate step?" Unfortunately, I'm not sure what you're > asking, Jake, can you clarify? > H is the *distance* using your kernel function, or *similarity*? If it's sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the generalized inner product / kernel, right? What I mean is this: when computing the SVD of your original matrix via Lanczos, you compute (A*A^t) * v by taking the rows r_i and summing up all of the: ( v.dot(r_i) ) * r_i (a similar thing happens with GHA). If your kernel can be represented as k(r_i, r_j) = f(r_i).dot( f(r_j) ) where f is some map from R^(A.numCols) to R^(kernelDimension) Then you can just apply the kernel to the rows as you iterate over them and compute the SVD of the kernelized form of A. Of course, if you're not doing a kernel representable in that way, then you're not getting far down this road. The other problem with this approach is that you really need to make sure you're doing a hashing trick on the kernelized rows (a la stochastic decomposition), or else you're probably going to run into problems of size on your singular vector set (their dimensionality will be too big). Stochastic decomposition is really the thing that will make this work much better (and I've got planned for the 0.4 release). What kinds of kernels are you planning to use? How do you plan on dealing with the userSparsity * numUsers^2 * kernelCost computational cost of producing H? That's even higher than the desiredRank * numUsers * userSparsity cost of Lanczos. -jake
