Hi Mike,

  Welcome to the long journey down the road of dimensional reduction. :)

On Fri, Mar 5, 2010 at 5:05 PM, mike bowles <m...@mbowles.com> wrote:
>
> Really large matrices require using one of the randomizing methods to get
> done.


"Require" is a strong term.  Really really large (but still sparse!)
matrices
are SVD'able by both parallel and (really fast) serial means (and Mahout
has implementations of both: in trunk, see the
o.a.m.math.hadoop.decomposer package for example).

I would agree that randomizing methods (for a good, and fairly recent
review,
see: http://arxiv.org/abs/0909.4061 ) are probably the best way to go, and
are
on the roadmap for mahout-0.4, as you can see from:
MAHOUT-309<https://issues.apache.org/jira/browse/MAHOUT-309>


> Lanczos Algorithm - The Lanczos algorithm is covered pretty well in Chapter
> 9 of (golub and van loan "Matrix Computations").  The Lanczos algorithm
> constructs an orthogonal basis set for the Krylov spaces (e.g. the space
> spanned by (x, Ax, A^x,..)).  The basic Lanczos algorithm is only suited to
> symmetric matrices and has some numerical difficulties maintaining
> orthonormality.  The numerical issues are addressed in "Matrix
> Computations".  That book also described the Arnoldi algorithm, which is a
> derivative of the Lanczos algorithm.  The Arnoldi algorithm works for
> square
> non-symmetric matrices.  The Arnoldi algorithm also has some numerical
> issues which are touched upon in "Matrix Computations".
>

Lanczos is currently implemented in Mahout, in a way which does the
repeated multiplication by A in parallel over Hadoop.  We haven't really
benchmarked how fast it is, but it seems to scale horizonally pretty well
(a matrix with twice the number of non-zero entries gets done in roughly
the same amount of time if you double the number of machines to go
at it).

The numerical overflow/underflow issues are well known, and are handled
by a variety of fairly ad hoc methods in Mahout (we brute-force
re-orthonormalize after every iteration, and after the whole thing is done,
another pass may be executed which strips out all the spurious "duplicate"
or non-converging eigenvectors).  There are more sophisticated techniques
like "implicitly-restarted Lanczos" as well as others.  If there is
need/desire,
we may go down that route at some point.


> Divide and Conquer - A more current favorite is the "divide and conquer"
> algorithm is also described in "matrix computations".  In Section  8.54 (in
>

As Ted said, these are techniques for dealing with matrices which are
tridiagonal, which you'll rarely have if you have a monstrously sized
matrix.
Additionally, algorithms out there already do this in basically linear time
(both Colt, which Mahout has absorbed the codebase of, and Apache
Commons Math can do eigendecomposition on tridiagonal matrices of
reasonable size very very quickly - we use the Colt version internally
as part of Lanczos).


> Non-square matrices - All of the methods I've found for actually producing
> singular values or singular vectors are targeted at symmetric matrices or
> at
> most square unsymmetric ones (the arnoldi algorithm).  I haven't seen
> anything that attacks the non-square matrix that inevitably arises in
> latent
> semantic analysis (for example).  One approach that comes to mind is the
> following.  Suppose we have an mxn matrix M that needs svd'ing.  It's easy
> to show that MM^T has the same LHS singular vectors as M.


Yep, as Ted said as well, finding the eigenvectors of M*M^T or M^T*M will
get you either the left or right singular vectors of M.


>  So an obvious way to use existing svd machinery would be to
> calculate the eigenvectors and eigenvalues of MM^T and of M^TM.  Typically
> one of these is relatively small (on the order of the number of singular
> values in the final approximation squared (or so) or klog(k) for better
> randomized methods).


Quite possibly neither of these is very small, if you're not doing a
stochastic decomposition.  For natural language corpora, one of
the matrices is numDocuments x numDocuments, and is relatively
sparse (but not as sparse as M, by a couple orders of magnitude
sometimes), and numUniqueTerms x numUniqueTerms, which while
much more bounded than document space, is still often really huge
(if you use ngrams or phrases, could be millions by millions), and
even more dense.

Also as Ted mentioned, there are tricks to do the decomposition
of (M^T M) far more efficiently than actually ever computing M^T M
in the first place.  In Mahout, we do it by a simple reordering the
summation that goes into (M^T * M) * v, and do it in one pass over
the data.  Incidentally, I had never thought of Ted's technique of
doing it on the wider matrix of (0 M^T \\ M 0).  It may be more stable,
but is it as fast?  It seems that each "single iteration" of multiplying
by this matrix, you are actually multiplying both M *and* M^T, which
is two iterations over the data set, not one.  Should be pretty easy
to implement though.

This is the best I've got so far.  If anyone has better, I'd really love to
> hear about it.
>

If you're not familiar with the Generalized Hebbian Algorithm, it's
another way to rapidly find singular vectors by simply streaming over
the rows, keeping very little in memory, and can run on a single
machine even for enormous data sets (but it will take a while if
the data size is really huge).  It's described in detail on Genevieve
Gorrell's website: http://www.dcs.shef.ac.uk/~genevieve/ml.html
(it's implemented in Mahout in o.a.m.math.decomposer.HebbianSolver
and does not require Hadoop).

Check out trunk (or get 0.3 when it comes out "Real Soon Now") and
give our implementations a spin!

  -jake



>
>
>
> Mike
>
>

Reply via email to