[ 
https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851843#action_12851843
 ] 

Danny Leshem edited comment on MAHOUT-308 at 3/31/10 12:39 PM:
---------------------------------------------------------------

Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 100.
Will be more than happy to review and test any submitted patch!

Probably also a blocker for [Mahout's record-breaking 
attempt|http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/201002.mbox/<c7d45fc71002251150n3a8a70e4m3a7ca6e87df09...@mail.gmail.com>],
 if you ever manage to find a big enough dataset :)

      was (Author: dleshem):
    Currently a blocker for me, with numColumnsOfInput ~ 10M and desiredRank ~ 
100.
Will be more than happy to review and test any submitted patch!

Probably also a blocker for Mahout's record-breaking attempt, if you ever 
manage to find a big enough dataset :)
http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/201002.mbox/<c7d45fc71002251150n3a8a70e4m3a7ca6e87df09...@mail.gmail.com>
  
> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-308
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.4
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the 
> driver (client) computer while Hadoop is iterating.  The memory requirements 
> of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for 
> desiredRank = a few hundred, starts to cap out usefulness at 
> some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the 
> Lanczos basis to disk, except for the most recent two vectors.  Some care 
> must be taken in the "orthogonalizeAgainstBasis()" method call, which uses 
> the entire basis.  This part would be slower this way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to