[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-180:
-------------------------------

    Attachment: MAHOUT-180.patch

Adds an EigenVerificationJob, which takes just as long as the 
DistributedLanczosJob, but at the end of the day prunes away superfluous 
eigenvectors and those with eigenvalue below a chosen threshold, and saves them 
back to hdfs, and you get nice output like this:

10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|0| = 
|2905105.325066675|, err = 8.231193504570911E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|2| = 
|921522.2179480934|, err = 3.312905505481467E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|5| = 
|422593.76267148677|, err = 9.690026558928366E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|7| = 
|326205.19577841857|, err = 8.280043317654417E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|8| = 
|284990.59331463446|, err = 2.398081733190338E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|11| = 
|253500.28860628756|, err = 0.03797261160467913 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|12| = 
|253433.24512060767|, err = 1.4885870314174099E-12 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|15| = 
|222354.15336025952|, err = 7.081002451059248E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|17| = 
|215156.97325760772|, err = 1.2456702336294256E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|18| = 
|200592.7782982773|, err = 3.72257780156815E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|19| = 
|191270.06867188454|, err = 3.610445276081009E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|20| = 
|168446.0868356986|, err = 1.2598810883446276E-12 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|21| = 
|166320.23361954943|, err = 1.0635936575909E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|22| = 
|158882.90261344844|, err = 2.708944180085382E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|25| = 
|134577.41793521316|, err = 1.7830181775480014E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|26| = 
|124021.7093344012|, err = 1.773026170326375E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|27| = 
|121824.37156673158|, err = 1.3680168109431179E-12 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|28| = 
|119613.86741751489|, err = 1.1823875212257917E-12 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|29| = 
|119104.75278971005|, err = 1.2567724638756772E-13 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|30| = 
|100623.36519880772|, err = 1.155742168634788E-12 to 
verifiedOutput/largestCleanEigens
10/02/10 01:24:06 INFO decomposer.EigenVerificationJob: appending e|31| = 
|88661.27936320615|, err = 1.0856870957809406E-12 to 
verifiedOutput/largestCleanEigens


the indexes skip (e|0|, e|2|, e|5|, e|7|, ... ) because superfluous ones were 
found (error too high, not orthogonal either), and these are in descending 
eigenvalue order (this is on the 20news dataset).

Next to try it on wikipedia, and then some cleanup and this should be ready for 
public consumption.

> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
> MAHOUT-180.patch
>
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to