Hi Mike, Welcome to the long journey down the road of dimensional reduction. :)
On Fri, Mar 5, 2010 at 5:05 PM, mike bowles <m...@mbowles.com> wrote: > > Really large matrices require using one of the randomizing methods to get > done. "Require" is a strong term. Really really large (but still sparse!) matrices are SVD'able by both parallel and (really fast) serial means (and Mahout has implementations of both: in trunk, see the o.a.m.math.hadoop.decomposer package for example). I would agree that randomizing methods (for a good, and fairly recent review, see: http://arxiv.org/abs/0909.4061 ) are probably the best way to go, and are on the roadmap for mahout-0.4, as you can see from: MAHOUT-309<https://issues.apache.org/jira/browse/MAHOUT-309> > Lanczos Algorithm - The Lanczos algorithm is covered pretty well in Chapter > 9 of (golub and van loan "Matrix Computations"). The Lanczos algorithm > constructs an orthogonal basis set for the Krylov spaces (e.g. the space > spanned by (x, Ax, A^x,..)). The basic Lanczos algorithm is only suited to > symmetric matrices and has some numerical difficulties maintaining > orthonormality. The numerical issues are addressed in "Matrix > Computations". That book also described the Arnoldi algorithm, which is a > derivative of the Lanczos algorithm. The Arnoldi algorithm works for > square > non-symmetric matrices. The Arnoldi algorithm also has some numerical > issues which are touched upon in "Matrix Computations". > Lanczos is currently implemented in Mahout, in a way which does the repeated multiplication by A in parallel over Hadoop. We haven't really benchmarked how fast it is, but it seems to scale horizonally pretty well (a matrix with twice the number of non-zero entries gets done in roughly the same amount of time if you double the number of machines to go at it). The numerical overflow/underflow issues are well known, and are handled by a variety of fairly ad hoc methods in Mahout (we brute-force re-orthonormalize after every iteration, and after the whole thing is done, another pass may be executed which strips out all the spurious "duplicate" or non-converging eigenvectors). There are more sophisticated techniques like "implicitly-restarted Lanczos" as well as others. If there is need/desire, we may go down that route at some point. > Divide and Conquer - A more current favorite is the "divide and conquer" > algorithm is also described in "matrix computations". In Section 8.54 (in > As Ted said, these are techniques for dealing with matrices which are tridiagonal, which you'll rarely have if you have a monstrously sized matrix. Additionally, algorithms out there already do this in basically linear time (both Colt, which Mahout has absorbed the codebase of, and Apache Commons Math can do eigendecomposition on tridiagonal matrices of reasonable size very very quickly - we use the Colt version internally as part of Lanczos). > Non-square matrices - All of the methods I've found for actually producing > singular values or singular vectors are targeted at symmetric matrices or > at > most square unsymmetric ones (the arnoldi algorithm). I haven't seen > anything that attacks the non-square matrix that inevitably arises in > latent > semantic analysis (for example). One approach that comes to mind is the > following. Suppose we have an mxn matrix M that needs svd'ing. It's easy > to show that MM^T has the same LHS singular vectors as M. Yep, as Ted said as well, finding the eigenvectors of M*M^T or M^T*M will get you either the left or right singular vectors of M. > So an obvious way to use existing svd machinery would be to > calculate the eigenvectors and eigenvalues of MM^T and of M^TM. Typically > one of these is relatively small (on the order of the number of singular > values in the final approximation squared (or so) or klog(k) for better > randomized methods). Quite possibly neither of these is very small, if you're not doing a stochastic decomposition. For natural language corpora, one of the matrices is numDocuments x numDocuments, and is relatively sparse (but not as sparse as M, by a couple orders of magnitude sometimes), and numUniqueTerms x numUniqueTerms, which while much more bounded than document space, is still often really huge (if you use ngrams or phrases, could be millions by millions), and even more dense. Also as Ted mentioned, there are tricks to do the decomposition of (M^T M) far more efficiently than actually ever computing M^T M in the first place. In Mahout, we do it by a simple reordering the summation that goes into (M^T * M) * v, and do it in one pass over the data. Incidentally, I had never thought of Ted's technique of doing it on the wider matrix of (0 M^T \\ M 0). It may be more stable, but is it as fast? It seems that each "single iteration" of multiplying by this matrix, you are actually multiplying both M *and* M^T, which is two iterations over the data set, not one. Should be pretty easy to implement though. This is the best I've got so far. If anyone has better, I'd really love to > hear about it. > If you're not familiar with the Generalized Hebbian Algorithm, it's another way to rapidly find singular vectors by simply streaming over the rows, keeping very little in memory, and can run on a single machine even for enormous data sets (but it will take a while if the data size is really huge). It's described in detail on Genevieve Gorrell's website: http://www.dcs.shef.ac.uk/~genevieve/ml.html (it's implemented in Mahout in o.a.m.math.decomposer.HebbianSolver and does not require Hadoop). Check out trunk (or get 0.3 when it comes out "Real Soon Now") and give our implementations a spin! -jake > > > > Mike > >