I was chatting with Jake Mannix on twitter regarding mahout 180 and if that patch is suitable for sparse symmetric positive-definite matrices, and he suggested we continue the conversation on the mailing list, so:
My research partner and I have a dataset that consists of 400,000 users and 1.6 million articles, with about 22 million nonzeros. We are trying to use this data to make recommendations to users. We have tried using the SVD and PLSI, both with unsatisfactory results, and are now attempting kPCA. We have a 400,000 by 400,000 sparse symmetric positive-definite matrix, H, that we need the top couple hundred eigenvectors/values for. Jake has told me that I can use mahout 180 unchanged, but it will be doing redundant work and the output eigenvalues are the squares of the ones we actually want. This sounds like a good approach, but it would be great if mahout had an optimized eigendecomposition for symmetric matrices. Jake suggested I submit a JIRA ticket regarding this, which I plan to do. H is the pairwise distance in feature space (calculated using a kernel function) between each pair of users (or some subset of users). After I mentioned this to Jake, he asked me "why aren't you just doing it all in one go? Kernelize on the rows, and do SVD on that? Why do the M*M^t intermediate step?" Unfortunately, I'm not sure what you're asking, Jake, can you clarify? Steven Buss [email protected] http://www.stevenbuss.com/
