[ 
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014986#comment-13014986
 ] 

Sean Owen commented on MAHOUT-369:
----------------------------------

I'd like to commit the patch. Danny seems confident it's the right change and 
Jake felt it was probably right. Derek suggests maybe there is more that needs 
to go into the patch. Danny could you confirm whether these are all the changes 
that are necessary?

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>              Labels: DistributedLanczosSolver, decomposer
>             Fix For: 0.5
>
>         Attachments: MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() 
> vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and 
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest 
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in 
> *reverse* order, meaning the most significant vector is marked "0", the 2nd 
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal 
> componenets), corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, 
> which for Lanczos decomposition are usually garbage.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to