[
https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855014#action_12855014
]
Jake Mannix commented on MAHOUT-369:
------------------------------------
Hold on that Sean, I made the loop like that for a reason. I need to check in
again and verify if/where it's wrong, but it was not an oversight, it has to do
with the way the Colt code does EigenDecomposition.
> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
> Key: MAHOUT-369
> URL: https://issues.apache.org/jira/browse/MAHOUT-369
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.3, 0.4
> Reporter: Danny Leshem
> Fix For: 0.4
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows()
> vectors.
> {code}
> log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and
> eigenValues to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
> for(int i=0; i<eigenVectors.numRows() - 1; i++) {
> ...
> }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest
> eigenvalue) is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in
> *reverse* order, meaning the most significant vector is marked "0", the 2nd
> most significant is marked "1", etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal
> componenets), corresponding eigenvalues can be easily matched and compared.
> 2) Makes it easier to discard the least significant principal components,
> which for Lanczos decomposition are usually garbage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.