[jira] Updated: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Jake Mannix (JIRA) Fri, 19 Feb 2010 23:28:04 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jake Mannix updated MAHOUT-180:
-------------------------------

    Attachment: MAHOUT-180.patch

Ok, new patch.  Contains, in addition to the woefully unfinished 
DistributedRowMatrix (only timesSquared() is implemented really, for the 
purposes of Lanczos / SVD), DistributedLanczosSolver, for doing raw Lanczos, 
and EigenVerificationJob, which checks to see what the cosAngle errors on the 
purported eigenvectors are, what their eigenvalues are, and whether they're all 
orthonormal.  

It then removes "bad" eigenvector/value pairs (based on a user-specified error 
margin), and also trims out any which have eigenvalue below a user-specified 
minimum eigenvalue threshold, then saves them back to HDFS, with a totally 
hacky "decorated" vector which has the eigenvalue and error baked into the name 
of the vector (that part should be redone less horribly).

So usage is like so:

$HADOOP_HOME/bin/hadoop -jar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input 
path/to/vector-sequence-file --output outpath --numRows 0 (currently ignored, 
not needed) --numCols 12345 --rank 100 



> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
> MAHOUT-180.patch, MAHOUT-180.patch
>
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

Reply via email to