Hi Philippe, The problem is because the key(Cluster centroid) is very big. I have faced similar issues in Fuzzy KMeans and fixed it. More details can be found at https://issues.apache.org/jira/browse/MAHOUT-79. I have done the changes locally for K-Means. I should be able to add a patch to mahout by tomorrow.
Thanks Pallavi -----Original Message----- From: Philippe Lamarche [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2008 4:05 AM To: [email protected] Subject: Re: Memory problems with KMeans I can run the default synthetic control k-means it runs without problems. I can also run my "personalised" version with a dataset of 57kb. My bigger dataset, which gives me memory problem, is 118meg. Could this be the problem? I was hoping that it would take forever to run, but would still be able to crunch through. On Thu, Nov 13, 2008 at 4:58 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > Hmm, my guess is that you are actually giving the VM too much memory, such > that you are constantly in swap, which is forcing the GC to take too long, > which is what that exception is, I believe. > > I have a 2gb laptop and was able to run the Synthetic control without a > problem, although it looks like you are doing things a little bit > differently by using a diff. distance measure. Can you run the "default" > synthetic control K-means? > > > > > On Nov 13, 2008, at 3:41 PM, Philippe Lamarche wrote: > > 2gig >> >> >> On Thu, Nov 13, 2008 at 2:57 PM, Grant Ingersoll <[EMAIL PROTECTED] >> >wrote: >> >> How much memory does your laptop have? >>> >>> >>> On Nov 13, 2008, at 11:53 AM, Philippe Lamarche wrote: >>> >>> Hi, >>> >>>> >>>> I am using KMeans to do some text clustering and I get into memory >>>> problems. >>>> As of now, I only tried it on a laptop in pseudo distributed >>>> master/slave >>>> mode. >>>> >>>> This is on Hadoop branch-0.19. The "texttovector.jar" contains a hacked >>>> version of the syntheticcontrol KMeans example, the only difference is >>>> in >>>> the first input phase. >>>> >>>> Is this memory error "normal"? I am running with export >>>> HADOOP_OPTS="-server >>>> -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:NewSize=1G >>>> -XX:MaxNewSize=1G >>>> -XX:-UseGCOverheadLimit" >>>> >>>> In my understanding, the "-XX:-UseGCOverheadLimit" should remove the >>>> GCOverhead "feature". >>>> >>>> Any ideas? >>>> >>>> >>>> >>>> >>>> [EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop jar >>>> /home/philippe/workspace/MTI830/dist/texttovector.jar >>>> org.apache.mahout.clustering.text.kmeans.Job testallmti/vectors/part* >>>> testallclusteroutput1 org.apache.mahout.utils.TanimotoDistanceMeasure >>>> 1.001 >>>> .001 .000005 10 >>>> 08/11/13 11:37:23 WARN mapred.JobClient: Use GenericOptionsParser for >>>> parsing the arguments. Applications should implement Tool for the same. >>>> 08/11/13 11:37:23 INFO mapred.FileInputFormat: Total input paths to >>>> process >>>> : 1 >>>> 08/11/13 11:37:23 INFO mapred.JobClient: Running job: >>>> job_200811131133_0007 >>>> 08/11/13 11:37:24 INFO mapred.JobClient: map 0% reduce 0% >>>> 08/11/13 11:37:37 INFO mapred.JobClient: map 31% reduce 0% >>>> 08/11/13 11:37:42 INFO mapred.JobClient: map 63% reduce 0% >>>> 08/11/13 11:37:45 INFO mapred.JobClient: map 83% reduce 0% >>>> 08/11/13 11:37:50 INFO mapred.JobClient: map 100% reduce 0% >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Job complete: >>>> job_200811131133_0007 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Counters: 7 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: File Systems >>>> 08/11/13 11:37:51 INFO mapred.JobClient: HDFS bytes read=118875664 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: HDFS bytes >>>> written=146866785 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Job Counters >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Launched map tasks=2 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Data-local map tasks=2 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Map-Reduce Framework >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Map input records=1702 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Map input bytes=118836254 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Map output records=1702 >>>> 08/11/13 11:37:51 WARN mapred.JobClient: Use GenericOptionsParser for >>>> parsing the arguments. Applications should implement Tool for the same. >>>> 08/11/13 11:37:51 INFO mapred.FileInputFormat: Total input paths to >>>> process >>>> : 2 >>>> 08/11/13 11:37:51 INFO mapred.JobClient: Running job: >>>> job_200811131133_0008 >>>> 08/11/13 11:37:52 INFO mapred.JobClient: map 0% reduce 0% >>>> 08/11/13 11:38:07 INFO mapred.JobClient: map 4% reduce 0% >>>> 08/11/13 11:38:12 INFO mapred.JobClient: map 9% reduce 0% >>>> 08/11/13 11:38:17 INFO mapred.JobClient: map 11% reduce 0% >>>> 08/11/13 11:38:22 INFO mapred.JobClient: map 13% reduce 0% >>>> 08/11/13 11:38:27 INFO mapred.JobClient: map 15% reduce 0% >>>> 08/11/13 11:38:32 INFO mapred.JobClient: map 16% reduce 0% >>>> 08/11/13 11:38:37 INFO mapred.JobClient: map 18% reduce 0% >>>> 08/11/13 11:38:42 INFO mapred.JobClient: map 19% reduce 0% >>>> 08/11/13 11:38:47 INFO mapred.JobClient: map 21% reduce 0% >>>> 08/11/13 11:38:52 INFO mapred.JobClient: map 22% reduce 0% >>>> 08/11/13 11:38:57 INFO mapred.JobClient: map 23% reduce 0% >>>> 08/11/13 11:39:01 INFO mapred.JobClient: map 24% reduce 0% >>>> 08/11/13 11:39:06 INFO mapred.JobClient: map 25% reduce 0% >>>> 08/11/13 11:39:12 INFO mapred.JobClient: map 26% reduce 0% >>>> 08/11/13 11:39:17 INFO mapred.JobClient: map 27% reduce 0% >>>> 08/11/13 11:39:27 INFO mapred.JobClient: map 28% reduce 0% >>>> 08/11/13 11:39:37 INFO mapred.JobClient: map 29% reduce 0% >>>> 08/11/13 11:39:47 INFO mapred.JobClient: map 30% reduce 0% >>>> 08/11/13 11:39:57 INFO mapred.JobClient: map 31% reduce 0% >>>> 08/11/13 11:40:07 INFO mapred.JobClient: map 32% reduce 0% >>>> 08/11/13 11:40:17 INFO mapred.JobClient: map 33% reduce 0% >>>> 08/11/13 11:40:32 INFO mapred.JobClient: map 34% reduce 0% >>>> 08/11/13 11:40:42 INFO mapred.JobClient: map 35% reduce 0% >>>> 08/11/13 11:40:52 INFO mapred.JobClient: map 36% reduce 0% >>>> 08/11/13 11:41:07 INFO mapred.JobClient: map 37% reduce 0% >>>> 08/11/13 11:41:17 INFO mapred.JobClient: map 38% reduce 0% >>>> 08/11/13 11:41:33 INFO mapred.JobClient: map 39% reduce 0% >>>> 08/11/13 11:41:38 INFO mapred.JobClient: map 40% reduce 0% >>>> 08/11/13 11:41:53 INFO mapred.JobClient: map 41% reduce 0% >>>> 08/11/13 11:42:03 INFO mapred.JobClient: map 42% reduce 0% >>>> 08/11/13 11:42:17 INFO mapred.JobClient: map 43% reduce 0% >>>> 08/11/13 11:42:32 INFO mapred.JobClient: map 44% reduce 0% >>>> 08/11/13 11:42:42 INFO mapred.JobClient: map 45% reduce 0% >>>> 08/11/13 11:42:57 INFO mapred.JobClient: map 46% reduce 0% >>>> 08/11/13 11:43:13 INFO mapred.JobClient: map 47% reduce 0% >>>> 08/11/13 11:43:33 INFO mapred.JobClient: map 48% reduce 0% >>>> 08/11/13 11:43:48 INFO mapred.JobClient: map 49% reduce 0% >>>> 08/11/13 11:44:08 INFO mapred.JobClient: map 50% reduce 0% >>>> 08/11/13 11:44:28 INFO mapred.JobClient: map 51% reduce 0% >>>> 08/11/13 11:44:53 INFO mapred.JobClient: map 52% reduce 0% >>>> 08/11/13 11:45:23 INFO mapred.JobClient: map 53% reduce 0% >>>> 08/11/13 11:46:03 INFO mapred.JobClient: map 54% reduce 0% >>>> 08/11/13 11:46:10 INFO mapred.JobClient: map 28% reduce 0% >>>> 08/11/13 11:46:10 INFO mapred.JobClient: Task Id : >>>> attempt_200811131133_0008_m_000000_0, Status : FAILED >>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>> at >>>> org.apache.mahout.matrix.DenseVector$Iterator.next(DenseVector.java:184) >>>> at >>>> org.apache.mahout.matrix.DenseVector$Iterator.next(DenseVector.java:172) >>>> at >>>> >>>> >>>> org.apache.mahout.utils.TanimotoDistanceMeasure.distance(TanimotoDistanc eMeasure.java:73) >>>> at >>>> >>>> >>>> org.apache.mahout.clustering.canopy.Canopy.emitPointToNewCanopies(Canopy .java:181) >>>> at >>>> >>>> org.apache.mahout.clustering.canopy.CanopyMapper.map(CanopyMapper.java:4 2) >>>> at >>>> >>>> org.apache.mahout.clustering.canopy.CanopyMapper.map(CanopyMapper.java:3 4) >>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) >>>> at org.apache.hadoop.mapred.Child.main(Child.java:155) >>>> >>>> [EMAIL PROTECTED]:/usr/local/hadoop$ >>>> >>>> >>> -------------------------- >>> Grant Ingersoll >>> >>> Lucene Helpful Hints: >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>> http://wiki.apache.org/lucene-java/LuceneFAQ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > >
