Improve Lanczos to handle extremely large feature sets (without hashing)
------------------------------------------------------------------------

                 Key: MAHOUT-308
                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
             Project: Mahout
          Issue Type: Improvement
          Components: Math
    Affects Versions: 0.3
         Environment: all
            Reporter: Jake Mannix
            Assignee: Jake Mannix
             Fix For: 0.4


DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the 
driver (client) computer while Hadoop is iterating.  The memory requirements of 
this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a 
few hundred, starts to cap out usefulness at some-small-number * millions of 
columns for most commodity hardware.

The solution (without doing stochastic decomposition) is to persist the Lanczos 
basis to disk, except for the most recent two vectors.  Some care must be taken 
in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis.  
This part would be slower this way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to