[ https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851843#action_12851843 ]
Danny Leshem edited comment on MAHOUT-308 at 3/31/10 12:39 PM: --------------------------------------------------------------- Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 100. Will be more than happy to review and test any submitted patch! Probably also a blocker for [Mahout's record-breaking attempt|http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/201002.mbox/<c7d45fc71002251150n3a8a70e4m3a7ca6e87df09...@mail.gmail.com>], if you ever manage to find a big enough dataset :) was (Author: dleshem): Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 100. Will be more than happy to review and test any submitted patch! Probably also a blocker for Mahout's record-breaking attempt, if you ever manage to find a big enough dataset :) http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/201002.mbox/<c7d45fc71002251150n3a8a70e4m3a7ca6e87df09...@mail.gmail.com> > Improve Lanczos to handle extremely large feature sets (without hashing) > ------------------------------------------------------------------------ > > Key: MAHOUT-308 > URL: https://issues.apache.org/jira/browse/MAHOUT-308 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Environment: all > Reporter: Jake Mannix > Assignee: Jake Mannix > Fix For: 0.4 > > > DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the > driver (client) computer while Hadoop is iterating. The memory requirements > of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for > desiredRank = a few hundred, starts to cap out usefulness at > some-small-number * millions of columns for most commodity hardware. > The solution (without doing stochastic decomposition) is to persist the > Lanczos basis to disk, except for the most recent two vectors. Some care > must be taken in the "orthogonalizeAgainstBasis()" method call, which uses > the entire basis. This part would be slower this way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.