[jira] [Commented] (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

Nathan Halko (Commented) (JIRA) Thu, 29 Dec 2011 23:58:24 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177583#comment-13177583
 ]


Nathan Halko commented on MAHOUT-308:
-------------------------------------

Not sure if this is the right place for this but I have been searching 
elsewhere without success.  

Setup:  Finding 150 singular values for 1e6 x 1e6 matrix, super sparse (2G on 
disk)

I'm getting Java Heap errors using Lanczos svd in SNAPSHOT-0.6.  The way I 
interpret the code, specifying a --workingDir uses HdfsBackedLanczosState which 
stores each basis vector in dfs (which I can see that they live there).  When 
the vectors are needed (orthogonalization and projecting the eignevectors) it 
seems they are read from disk one by one with only a few dense vectors in 
memory at one time (current, basis vector i, and an accumulation vector).  This 
should have very light mem requirements and hammer the network, however, I'm 
not seeing this behavior.  

Is this a known issue, a memory leak or something?  Is there something behind 
the scenes that keeps these vectors in memory?  I can't come to grips with the 
error below.


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.Object.clone(Native Method)
        at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
        at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
        at 
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99)
        at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
        at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
        at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at 
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:151)
        at 
org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
        at 
org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
        at 
org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
        at 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
        at 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:123)
        at 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at 
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

                
> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-308
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.5
>
>         Attachments: MAHOUT-308.patch
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the 
> driver (client) computer while Hadoop is iterating.  The memory requirements 
> of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for 
> desiredRank = a few hundred, starts to cap out usefulness at 
> some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the 
> Lanczos basis to disk, except for the most recent two vectors.  Some care 
> must be taken in the "orthogonalizeAgainstBasis()" method call, which uses 
> the entire basis.  This part would be slower this way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

Reply via email to