[ https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838069#action_12838069 ]
Jake Mannix commented on MAHOUT-308: ------------------------------------ Of course, making sure that individual mappers have to be able to fit dense soon-to-be-eigenvectors in memory too, which becomes problematic as numCols grows too high, and even without this constraint this solution would only allow scaling to go from millions of columns to close to a billion columns. Modifying the M/R job to shard the column space is doable, but realistically, stochastic SVD is the solution for multi-billions of features. > Improve Lanczos to handle extremely large feature sets (without hashing) > ------------------------------------------------------------------------ > > Key: MAHOUT-308 > URL: https://issues.apache.org/jira/browse/MAHOUT-308 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Environment: all > Reporter: Jake Mannix > Assignee: Jake Mannix > Fix For: 0.4 > > > DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the > driver (client) computer while Hadoop is iterating. The memory requirements > of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for > desiredRank = a few hundred, starts to cap out usefulness at > some-small-number * millions of columns for most commodity hardware. > The solution (without doing stochastic decomposition) is to persist the > Lanczos basis to disk, except for the most recent two vectors. Some care > must be taken in the "orthogonalizeAgainstBasis()" method call, which uses > the entire basis. This part would be slower this way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.