Re: is it possible to compute the SVD for a large scale matrix

Lance Norskog Sat, 09 Apr 2011 20:15:17 -0700

To make things worse, the script bin/start-daemon.sh emits two
heapsize specs instead of one:
java blah blah blah -Xmx500m -Xmx1000m blah blah blah
where both numbers come from "somewhere". Shell environment variables
are a lousy data store.


On Fri, Apr 8, 2011 at 6:57 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> I don't think it is going to remedy his condition. He is having oom in the
> driver and hadoop.env controls heap for tasktracker and such (not even child
> task memory). He needs more memory in the frontend which is indeed the
> bottleneck for that right now.
>
> apologies for brevity.
>
> Sent from my android.
> -Dmitriy
> On Apr 8, 2011 4:06 AM, "Danny Bickson" <danny.bick...@gmail.com> wrote:
>> Now try to increase heap size in the file conf/hadoop-env.sh
>> For example
>>
>> HADOOP_HEAPSIZE=4000
>>
>> - Danny Bickson
>>
>> On Thu, Apr 7, 2011 at 10:32 PM, Wei Li <wei.le...@gmail.com> wrote:
>>
>>>
>>> Hi Danny and All:
>>>
>>> I have increased the JVM memory, the mapred.child.java.opts, but still
>>> failed after 2 or 3 passes through the corpus.
>>>
>>> And the matrix dimension is about 600,000 * 600,000, error info is as
>>> follows:
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>> at
>>>
> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
>>> at
>>>
> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
>>> at
>>>
> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
>>> at
>>>
> org.apache.mahout.math.RandomAccessSparseVector.assign(RandomAccessSparseVector.java:106)
>>> at
>>>
> org.apache.mahout.math.SparseRowMatrix.assignRow(SparseRowMatrix.java:148)
>>> at
>>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:134)
>>> at
>>>
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:177)
>>> at
>>>
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:110)
>>> at
>>>
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:253)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at
>>>
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:259)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at
>>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>>
>>> Best
>>> Wei
>>>
>>> On Thu, Apr 7, 2011 at 7:59 AM, Wei Li <wei.le...@gmail.com> wrote:
>>>
>>>> Hi All:
>>>>
>>>> sorry for misunderstanding, the dimension is about 600,000 * 600,000 :)
>>>>
>>>> Best
>>>> Wei
>>>>
>>>>
>>>> On Wed, Apr 6, 2011 at 6:53 PM, Danny Bickson <danny.bick...@gmail.com
>>wrote:
>>>>
>>>>> Hi.
>>>>> Do you mean 60 million by 60 million? I guess this may be potentially
>>>>> rather big for Mahout.
>>>>> Another option you have is to try GraphLab: see
>>>>> http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
>>>>> I will be happy to give you support in case you would like to try
>>>>> GraphLab.
>>>>>
>>>>> Best,
>>>>>
>>>>> DB
>>>>>
>>>>>
>>>>> On Wed, Apr 6, 2011 at 2:13 AM, Wei Li <wei.le...@gmail.com> wrote:
>>>>>
>>>>>> Hi Danny:
>>>>>>
>>>>>> I have transformed the csv data into the DistributedRowMatrix
>>>>>> format, but it still failed due to the memory problem after 2 or 3
>>>>>> iterations.
>>>>>>
>>>>>> my matrix dimension is about 60w * 60w, it is possible to do the
>>>>>> svd decomposition for this scale using Mahout?
>>>>>>
>>>>>> Best
>>>>>> Wei
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 26, 2011 at 1:43 AM, Danny Bickson <
> danny.bick...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Wei,
>>>>>>> You must verify you use SPARSE matrix and not dense, or else you will
>>>>>>> surely get out of memory.
>>>>>>> Take a look at this example:
>>>>>>>
> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>>>>>> On how to prepare the input.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Danny Bickson
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 25, 2011 at 1:33 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>wrote:
>>>>>>>
>>>>>>>> Wei,
>>>>>>>>
>>>>>>>> 1) i think DenseMatrix is a RAM-only representation. Naturally, you
>>>>>>>> get OOM because it all has to fit in memory. If you want to run
>>>>>>>> RAM-only SVD computation, you perhaps don't need Mahout. If you want
>>>>>>>> to run distributed SVD computations, you need to prepare your data
> in
>>>>>>>> what is called DistributedRowMatrix format. This is a sequence file
>>>>>>>> with keys being whatever key you need to identify your rows, and
>>>>>>>> values being VectorWritable wrapping either of vector
> implementations
>>>>>>>> found in mahout (Dense, sparse sequenctial, sparse random).
>>>>>>>> 2) Once you've prepared your data in DRM format, you can run either
> of
>>>>>>>> SVD algorithms found in Mahout. It can be Lanczos solver ('mahout
> svd
>>>>>>>> ... ") or, on the trunk you can also find a stochastic svd method
>>>>>>>> ('mahout ssvd ...") which is issue MAHOUT-593 i mentioned earlier.
>>>>>>>>
>>>>>>>> Either way, I am not sure why you want DenseMatrix unless you want
> to
>>>>>>>> use RAM-only Colt SVD solver -- but you certainly don't have to
> focus
>>>>>>>> on Mahout implementation of one if you just want a RAM solver.
>>>>>>>>
>>>>>>>> -d
>>>>>>>>
>>>>>>>> On Fri, Mar 25, 2011 at 3:25 AM, Wei Li <wei.le...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Actually, I would like to perform the spectral clustering on a
> large
>>>>>>>> scale
>>>>>>>> > sparse matrix, but it failed due to the OutOfMemory error when
>>>>>>>> creating the
>>>>>>>> > DenseMatrix for SVD decomposition.
>>>>>>>> >
>>>>>>>> > Best
>>>>>>>> > Wei
>>>>>>>> >
>>>>>>>> > On Fri, Mar 25, 2011 at 4:05 PM, Dmitriy Lyubimov <
>>>>>>>> dlie...@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> SSVD != Lanczos. if you do PCA or LSI it is perhaps what you
> need.
>>>>>>>> it
>>>>>>>> >> can take on these things. Well at least some of my branches can,
> if
>>>>>>>> >> not the official patch.
>>>>>>>> >>
>>>>>>>> >> -d
>>>>>>>> >>
>>>>>>>> >> On Thu, Mar 24, 2011 at 11:09 PM, Wei Li <wei.le...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >> >
>>>>>>>> >> > thanks for your reply
>>>>>>>> >> >
>>>>>>>> >> > my matrix is not very dense, a sparse matrix.
>>>>>>>> >> >
>>>>>>>> >> > I have tried the svd of Mahout, but failed due to the
> OutOfMemory
>>>>>>>> error.
>>>>>>>> >> >
>>>>>>>> >> > Best
>>>>>>>> >> > Wei
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> > On Fri, Mar 25, 2011 at 2:03 PM, Dmitriy Lyubimov <
>>>>>>>> dlie...@gmail.com>
>>>>>>>> >> > wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> you can certainly try to write it out into a DRM (distributed
>>>>>>>> row
>>>>>>>> >> >> matrix) and run stochastic SVD on hadoop (off the trunk now).
>>>>>>>> see
>>>>>>>> >> >> MAHOUT-593. This is suitable if you have a good decay of
>>>>>>>> singular
>>>>>>>> >> >> values (but if you don't it probably just means you have so
> much
>>>>>>>> noise
>>>>>>>> >> >> that it masks the problem you are trying to solve in your
> data).
>>>>>>>> >> >>
>>>>>>>> >> >> Current committed solution is not most efficient yet, but it
>>>>>>>> should be
>>>>>>>> >> >> quite capable.
>>>>>>>> >> >>
>>>>>>>> >> >> If you do, let me know how it went.
>>>>>>>> >> >>
>>>>>>>> >> >> thanks.
>>>>>>>> >> >> -d
>>>>>>>> >> >>
>>>>>>>> >> >> On Thu, Mar 24, 2011 at 10:59 PM, Dmitriy Lyubimov <
>>>>>>>> dlie...@gmail.com>
>>>>>>>> >> >> wrote:
>>>>>>>> >> >> > Are you sure your matrix is dense?
>>>>>>>> >> >> >
>>>>>>>> >> >> > On Thu, Mar 24, 2011 at 9:59 PM, Wei Li <wei.le...@gmail.com
>>
>>>>>>>> wrote:
>>>>>>>> >> >> >> Hi All:
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> is it possible to compute the SVD factorization for a
>>>>>>>> 600,000 *
>>>>>>>> >> >> >> 600,000
>>>>>>>> >> >> >> matrix using Mahout?
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> I have got the OutOfMemory error when creating the
>>>>>>>> DenseMatrix.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Best
>>>>>>>> >> >> >> Wei
>>>>>>>> >> >> >>
>>>>>>>> >> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: is it possible to compute the SVD for a large scale matrix

Reply via email to