[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854971#action_12854971 ]
Ted Dunning commented on MAHOUT-369: ------------------------------------ Can you create a suggested patch? > Issues with DistributedLanczosSolver output > ------------------------------------------- > > Key: MAHOUT-369 > URL: https://issues.apache.org/jira/browse/MAHOUT-369 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.3, 0.4 > Reporter: Danny Leshem > Fix For: 0.4 > > > DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() > vectors. > {code} > log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and > eigenValues to: " + outputPath); > {code} > However, a few lines later (line 106) we have > {code} > for(int i=0; i<eigenVectors.numRows() - 1; i++) { > ... > } > {code} > which only persists eigenVectors.numRows()-1 vectors. > Seems like the most significant eigenvector (i.e. the one with the largest > eigenvalue) is omitted... off by one bug? > Also, I think it would be better if the eigenvectors are persisted in > *reverse* order, meaning the most significant vector is marked "0", the 2nd > most significant is marked "1", etc. > This, for two reasons: > 1) When performing another PCA on the same corpus (say, with more principal > componenets), corresponding eigenvalues can be easily matched and compared. > 2) Makes it easier to discard the least significant principal components, > which for Lanczos decomposition are usually garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.