RE: Memory problems with KMeans

Palleti, Pallavi Thu, 13 Nov 2008 19:47:39 -0800

Hi Philippe,

 The problem is because the key(Cluster centroid) is very big. I have
faced similar issues in Fuzzy KMeans and fixed it. More details can be
found at https://issues.apache.org/jira/browse/MAHOUT-79. 
I have done the changes locally for K-Means. I should be able to add a
patch to mahout by tomorrow.


Thanks
Pallavi

-----Original Message-----
From: Philippe Lamarche [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 14, 2008 4:05 AM
To: [email protected]
Subject: Re: Memory problems with KMeans

I can run the default synthetic control k-means it runs without
problems.

I can also run my "personalised" version with a dataset of 57kb.

My bigger dataset, which gives me memory problem, is 118meg. Could this
be
the problem? I was hoping that it would take forever to run, but would
still
be able to crunch through.





On Thu, Nov 13, 2008 at 4:58 PM, Grant Ingersoll
<[EMAIL PROTECTED]>wrote:

> Hmm, my guess is that you are actually giving the VM too much memory,
such
> that you are constantly in swap, which is forcing the GC to take too
long,
> which is what that exception is, I believe.
>
>  I have a 2gb laptop and was able to run the Synthetic control without
a
> problem, although it looks like you are doing things a little bit
> differently by using a diff. distance measure.  Can you run the
"default"
> synthetic control K-means?
>
>
>
>
> On Nov 13, 2008, at 3:41 PM, Philippe Lamarche wrote:
>
>  2gig
>>
>>
>> On Thu, Nov 13, 2008 at 2:57 PM, Grant Ingersoll <[EMAIL PROTECTED]
>> >wrote:
>>
>>  How much memory does your laptop have?
>>>
>>>
>>> On Nov 13, 2008, at 11:53 AM, Philippe Lamarche wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> I am using KMeans to do some text clustering and I get into memory
>>>> problems.
>>>> As of now, I only tried it on a laptop in pseudo distributed
>>>> master/slave
>>>> mode.
>>>>
>>>> This is on Hadoop branch-0.19. The "texttovector.jar" contains a
hacked
>>>> version of the syntheticcontrol KMeans example, the only difference
is
>>>> in
>>>> the first input phase.
>>>>
>>>> Is this memory error "normal"? I am running with export
>>>> HADOOP_OPTS="-server
>>>> -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:NewSize=1G
>>>> -XX:MaxNewSize=1G
>>>> -XX:-UseGCOverheadLimit"
>>>>
>>>> In my understanding, the "-XX:-UseGCOverheadLimit" should remove
the
>>>> GCOverhead "feature".
>>>>
>>>> Any ideas?
>>>>
>>>>
>>>>
>>>>
>>>> [EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop jar
>>>> /home/philippe/workspace/MTI830/dist/texttovector.jar
>>>> org.apache.mahout.clustering.text.kmeans.Job
testallmti/vectors/part*
>>>> testallclusteroutput1
org.apache.mahout.utils.TanimotoDistanceMeasure
>>>> 1.001
>>>> .001 .000005 10
>>>> 08/11/13 11:37:23 WARN mapred.JobClient: Use GenericOptionsParser
for
>>>> parsing the arguments. Applications should implement Tool for the
same.
>>>> 08/11/13 11:37:23 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 08/11/13 11:37:23 INFO mapred.JobClient: Running job:
>>>> job_200811131133_0007
>>>> 08/11/13 11:37:24 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/11/13 11:37:37 INFO mapred.JobClient:  map 31% reduce 0%
>>>> 08/11/13 11:37:42 INFO mapred.JobClient:  map 63% reduce 0%
>>>> 08/11/13 11:37:45 INFO mapred.JobClient:  map 83% reduce 0%
>>>> 08/11/13 11:37:50 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 08/11/13 11:37:51 INFO mapred.JobClient: Job complete:
>>>> job_200811131133_0007
>>>> 08/11/13 11:37:51 INFO mapred.JobClient: Counters: 7
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:   File Systems
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     HDFS bytes
read=118875664
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     HDFS bytes
>>>> written=146866785
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:   Job Counters
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     Launched map tasks=2
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     Data-local map tasks=2
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     Map input records=1702
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     Map input
bytes=118836254
>>>> 08/11/13 11:37:51 INFO mapred.JobClient:     Map output
records=1702
>>>> 08/11/13 11:37:51 WARN mapred.JobClient: Use GenericOptionsParser
for
>>>> parsing the arguments. Applications should implement Tool for the
same.
>>>> 08/11/13 11:37:51 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/11/13 11:37:51 INFO mapred.JobClient: Running job:
>>>> job_200811131133_0008
>>>> 08/11/13 11:37:52 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/11/13 11:38:07 INFO mapred.JobClient:  map 4% reduce 0%
>>>> 08/11/13 11:38:12 INFO mapred.JobClient:  map 9% reduce 0%
>>>> 08/11/13 11:38:17 INFO mapred.JobClient:  map 11% reduce 0%
>>>> 08/11/13 11:38:22 INFO mapred.JobClient:  map 13% reduce 0%
>>>> 08/11/13 11:38:27 INFO mapred.JobClient:  map 15% reduce 0%
>>>> 08/11/13 11:38:32 INFO mapred.JobClient:  map 16% reduce 0%
>>>> 08/11/13 11:38:37 INFO mapred.JobClient:  map 18% reduce 0%
>>>> 08/11/13 11:38:42 INFO mapred.JobClient:  map 19% reduce 0%
>>>> 08/11/13 11:38:47 INFO mapred.JobClient:  map 21% reduce 0%
>>>> 08/11/13 11:38:52 INFO mapred.JobClient:  map 22% reduce 0%
>>>> 08/11/13 11:38:57 INFO mapred.JobClient:  map 23% reduce 0%
>>>> 08/11/13 11:39:01 INFO mapred.JobClient:  map 24% reduce 0%
>>>> 08/11/13 11:39:06 INFO mapred.JobClient:  map 25% reduce 0%
>>>> 08/11/13 11:39:12 INFO mapred.JobClient:  map 26% reduce 0%
>>>> 08/11/13 11:39:17 INFO mapred.JobClient:  map 27% reduce 0%
>>>> 08/11/13 11:39:27 INFO mapred.JobClient:  map 28% reduce 0%
>>>> 08/11/13 11:39:37 INFO mapred.JobClient:  map 29% reduce 0%
>>>> 08/11/13 11:39:47 INFO mapred.JobClient:  map 30% reduce 0%
>>>> 08/11/13 11:39:57 INFO mapred.JobClient:  map 31% reduce 0%
>>>> 08/11/13 11:40:07 INFO mapred.JobClient:  map 32% reduce 0%
>>>> 08/11/13 11:40:17 INFO mapred.JobClient:  map 33% reduce 0%
>>>> 08/11/13 11:40:32 INFO mapred.JobClient:  map 34% reduce 0%
>>>> 08/11/13 11:40:42 INFO mapred.JobClient:  map 35% reduce 0%
>>>> 08/11/13 11:40:52 INFO mapred.JobClient:  map 36% reduce 0%
>>>> 08/11/13 11:41:07 INFO mapred.JobClient:  map 37% reduce 0%
>>>> 08/11/13 11:41:17 INFO mapred.JobClient:  map 38% reduce 0%
>>>> 08/11/13 11:41:33 INFO mapred.JobClient:  map 39% reduce 0%
>>>> 08/11/13 11:41:38 INFO mapred.JobClient:  map 40% reduce 0%
>>>> 08/11/13 11:41:53 INFO mapred.JobClient:  map 41% reduce 0%
>>>> 08/11/13 11:42:03 INFO mapred.JobClient:  map 42% reduce 0%
>>>> 08/11/13 11:42:17 INFO mapred.JobClient:  map 43% reduce 0%
>>>> 08/11/13 11:42:32 INFO mapred.JobClient:  map 44% reduce 0%
>>>> 08/11/13 11:42:42 INFO mapred.JobClient:  map 45% reduce 0%
>>>> 08/11/13 11:42:57 INFO mapred.JobClient:  map 46% reduce 0%
>>>> 08/11/13 11:43:13 INFO mapred.JobClient:  map 47% reduce 0%
>>>> 08/11/13 11:43:33 INFO mapred.JobClient:  map 48% reduce 0%
>>>> 08/11/13 11:43:48 INFO mapred.JobClient:  map 49% reduce 0%
>>>> 08/11/13 11:44:08 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 08/11/13 11:44:28 INFO mapred.JobClient:  map 51% reduce 0%
>>>> 08/11/13 11:44:53 INFO mapred.JobClient:  map 52% reduce 0%
>>>> 08/11/13 11:45:23 INFO mapred.JobClient:  map 53% reduce 0%
>>>> 08/11/13 11:46:03 INFO mapred.JobClient:  map 54% reduce 0%
>>>> 08/11/13 11:46:10 INFO mapred.JobClient:  map 28% reduce 0%
>>>> 08/11/13 11:46:10 INFO mapred.JobClient: Task Id :
>>>> attempt_200811131133_0008_m_000000_0, Status : FAILED
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>  at
>>>>
org.apache.mahout.matrix.DenseVector$Iterator.next(DenseVector.java:184)
>>>>  at
>>>>
org.apache.mahout.matrix.DenseVector$Iterator.next(DenseVector.java:172)
>>>>  at
>>>>
>>>>
>>>>
org.apache.mahout.utils.TanimotoDistanceMeasure.distance(TanimotoDistanc
eMeasure.java:73)
>>>>  at
>>>>
>>>>
>>>>
org.apache.mahout.clustering.canopy.Canopy.emitPointToNewCanopies(Canopy
.java:181)
>>>>  at
>>>>
>>>>
org.apache.mahout.clustering.canopy.CanopyMapper.map(CanopyMapper.java:4
2)
>>>>  at
>>>>
>>>>
org.apache.mahout.clustering.canopy.CanopyMapper.map(CanopyMapper.java:3
4)
>>>>  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:155)
>>>>
>>>> [EMAIL PROTECTED]:/usr/local/hadoop$
>>>>
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

RE: Memory problems with KMeans

Reply via email to