[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Jake Mannix (JIRA) Fri, 09 Oct 2009 09:22:05 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764075#action_12764075
 ]


Jake Mannix commented on MAHOUT-180:
------------------------------------

Hey Prasen,  the approach to doing parallelized Lanczos is parallelized 
multiplication of (the square of) your input matrix by vectors, so my 
decomposer project has a DistributedMatrix class which is backed by matrix with 
rows which live in HDFS, and then Lanczos iteration just leaves your matrix 
where it is (so no additional data transfer of order the size of your matrix) 
and sends one vector at a time out to it to do parallelized matrix-by-vector 
multiplication.

Then, when you've gone far enough, you do a non-parallel SVD on a very small 
matrix which lives in memory, and you've got one half of your set of singular 
vector pairs (which is often all you need).

I'm not sure where the best write up on this approach is - once you decide on 
doing Lanczos, the parallelization procedure is straightforward (in the words 
of far-too-many CS professors: "At this point, it's only a matter of 
'engineering'" - heh).  The main not-completely-straightforward trick I used 
here is avoiding squaring the input matrix anywhere (which is needed in some 
form if you've got a non-square or non-symmetric matrix), and staying in 
completely sparse row-matrix form.

Currently I'm a bit blocked on providing a patch for this, given that the 
underlying linear algebra primitives in Mahout are probably going to change (to 
commons-math vectors and matrices most likely), in which case my port of 
decomposer will need to get completely refactored, and just doing the refactor 
once makes a lot more sense.

> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Priority: Minor
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Reply via email to