RE: fkmeans or Cluster Dumper not working?

Jeff Eastman Mon, 25 Jul 2011 22:13:23 -0700

Also makes sense that fuzzyk centroids would be completely dense, since every 
point is a member of every cluster. My reducer heaps are 4G.


-----Original Message-----
From: Jeff Eastman [mailto:jeast...@narus.com]
Sent: Monday, July 25, 2011 2:32 PM
To: user@mahout.apache.org; Jeffrey
Subject: RE: fkmeans or Cluster Dumper not working?

I'm able to run fuzzyk on your data set with k=10 and k=50 without problems. I 
also ran it fine with k=100 just to push it a bit harder. Runs took longer as k 
increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s, 45s, 
1m11s). The cluster dumper is throwing an OME with your data points and 
probably also with the larger cluster volumes, suggesting it needs a larger 
-Xmx value since it is running locally and not influenced by the cluster vm 
parameters.

I will try some more and keep you updated.

The cluster dumper is throwing an OME trying to inhale all your data points. It 
is running locally

-----Original Message-----
From: Jeffrey [mailto:mycyber...@yahoo.com]
Sent: Sunday, July 24, 2011 12:51 AM
To: user@mahout.apache.org
Subject: Re: fkmeans or Cluster Dumper not working?

Erm, is there any update? is the problem reproducible?

Best wishes,
Jeffrey04



>________________________________
>From: Jeffrey <mycyber...@yahoo.com>
>To: Jeff Eastman <jeast...@narus.com>; "user@mahout.apache.org" 
><user@mahout.apache.org>
>Sent: Friday, July 22, 2011 12:40 AM
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>
>Hi Jeff,
>
>
>lol, this is probably my last reply before i fall asleep (GMT+8 here).
>
>
>First thing first, data file is here: http://coolsilon.com/image-tag.mvc
>
>
>Q: What is the cardinality of your vector data?
>about 1000+ rows (resources) * 14 000+ columns (tags)
>Q: Is it sparse or dense?
>sparse (assuming sparse = each vector contains mostly 0)
>Q: How many vectors are you trying to cluster?
>all of them? (1000+ rows)
>Q: What is the exact error you see when fkmeans fails with k=10? With k=50?
>i think i posted the exception when k=50, but will post them again here
>
>
>k=10, fkmeans actually works, but cluster dumper returns exception, however, 
>if i take out --pointsDir, then it would work (output looks ok, but without 
>all the points)
>
>
>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
> --overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>    ...
>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
> --pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt 
> Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: 
> {--dictionaryType=text, --endPhase=2147483647, 
> --output=image-tag-clusters.txt, --pointsDir=sensei/clusters/clusteredPoints, 
> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>            at java.lang.Object.clone(Native Method)
>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>            at 
> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>            at 
> org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>            at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>            at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>            at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>            at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>            at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>            at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>            at 
> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>            at 
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>            at 
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>            at 
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>            at 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>            at 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>            at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>            at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>            at java.lang.reflect.Method.invoke(Method.java:616)
>            at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>            at 
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>            at 
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>            at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>            at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>            at java.lang.reflect.Method.invoke(Method.java:616)
>            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
> image-tag-clusters.txt Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: 
> {--dictionaryType=text, --endPhase=2147483647, 
> --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
> --startPhase=0, --tempDir=temp}
>    11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms
>
>
>k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map 
>0% reduce 0%) after the exception
>
>
>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
> --overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>    Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: 
> {--clustering=null, --clusters=sensei/clusters/clusters-0, 
> --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>  --emitMostLikely=false, --endPhase=2147483647, 
> --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
> --numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
> --tempDir=temp, --threshold=0}
>    11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
>    11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop 
> library
>    11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
>    11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
>    11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
>    11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to 
> sensei/clusters/clusters-0/part-randomSeed
>    11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
> Iteration 1
>    11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to process 
> : 1
>    11/07/22 00:21:26 INFO mapred.JobClient: Running job: job_201107211512_0029
>    11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>    11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>    11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>    11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>    11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>    11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>    11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>    11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>    11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>    11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>    11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>    11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>    11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>    11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>    11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>    11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>    11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>    11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>    11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>    11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>    11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>    11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>    11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>    11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>    11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>    11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>    11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>    11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>    11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>    11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>    11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>    11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>    11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>    11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>    11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>    11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>    11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>    11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>    11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>    11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>    11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>    11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>    11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>    11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>    11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>    11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>    11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>    11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>    11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>    11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>    11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>    11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>    11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>    11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>    11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>    11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>    11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>    11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>    11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>    11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>    11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>    11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>    11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>    11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>    11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>    11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>    11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>    11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>    11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>    11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>    11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>    11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>    11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>    11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>    11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>    11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>    11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>    11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>    11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>    11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>    11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>    11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>    11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>    11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>    11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>    11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>    11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>    11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>    11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>    11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>    11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>    11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>    11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>    11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>    11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>    11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>    11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>    11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>    11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>    11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>    11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>    11/07/22 00:35:07 INFO mapred.JobClient: Task Id : 
> attempt_201107211512_0029_m_000000_0, Status : FAILED
>    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for output/file.out
>            at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>            at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>            at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>            at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>            at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>            at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>            at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>            at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>            at java.security.AccessController.doPrivileged(Native Method)
>            at javax.security.auth.Subject.doAs(Subject.java:416)
>            at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>            at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
>
>    11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>    ...
>
>
>Q: What are the Hadoop heap settings you are using for your job?
>I am new to hadoop, not sure where to get those, but got these from 
>localhost:50070, is it right?
>147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / 
>966.69 MB (3%)
>
>
>p/s: i keep forgetting to include my operating environment, sorry. I basically 
>run this in a guest operating system (in a virtualbox virtual machine), 
>assigned 1 CPU core, and 1.5GB of memory. Then the host operating system is OS 
>X 10.6.8 running on alubook (macbook late 2008 model) with 4GB of memory.
>
>
>    $ cat /etc/*-release
>    DISTRIB_ID=Ubuntu
>    DISTRIB_RELEASE=11.04
>    DISTRIB_CODENAME=natty
>    DISTRIB_DESCRIPTION="Ubuntu 11.04"
>    $ uname -a
>    Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC 2011 
> i686 i686 i386 GNU/Linux
>
>
>Best wishes,
>Jeffrey04
>
>>________________________________
>>From: Jeff Eastman <jeast...@narus.com>
>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey 
>><mycyber...@yahoo.com>
>>Sent: Thursday, July 21, 2011 11:54 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>Excellent, so this appears to be localized to fuzzyk. Unfortunately, the 
>>Apache mail server strips off attachments so you'd need another mechanism (a 
>>JIRA?) to upload your data if it is not too large. Some more questions in the 
>>interim:
>>
>>- What is the cardinality of your vector data?
>>- Is it sparse or dense?
>>- How many vectors are you trying to cluster?
>>- What is the exact error you see when fkmeans fails with k=10? With k=50?
>>- What are the Hadoop heap settings you are using for your job?
>>
>>-----Original Message-----
>>From: Jeffrey [mailto:mycyber...@yahoo.com]
>>Sent: Thursday, July 21, 2011 11:21 AM
>>To: user@mahout.apache.org
>>Subject: Re: fkmeans or Cluster Dumper not
 working?
>>
>>Hi Jeff,
>>
>>Q: Did you change your invocation to specify a different -c directory (e.g. 
>>clusters-0)?
>>A: Yes :)
>>
>>Q: Did you add the -cl argument?
>>A: Yes :)
>>
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>--overwrite --emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>--overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>--overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>
>>Q: What is the new CLI invocation for clusterdump?
>>A:
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir
 sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>
>>
>>Q: Did this work for -k 10? What happens with -k 50?
>>A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails 
>>when k=50, so i can't dump when k=50
>>
>>Q: Have you tried kmeans?
>>A: Yes (all tested on 0.6-snapshot)
>>
>>k=5: no problem :)
>>k=10: no problem :)
>>k=50: no problem :)
>>
>>p/s: attached with the test data i used (in mvc format), let me know if you 
>>guys prefer raw data in arff format
>>
>>Best wishes,
>>Jeffrey04
>>
>>
>>
>>>________________________________
>>>From: Jeff Eastman <jeast...@narus.com>
>>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey 
>>><mycyber...@yahoo.com>
>>>Sent: Thursday, July 21, 2011 9:36 PM
>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>
>>>You are correct, the wiki for fkmeans did not mention the -cl argument. I've 
>>>added that just now. I think this is what Frank means in his comment but you 
>>>do *not* have to write any custom code to get the cluster dumper to do what 
>>>you want, just use the -cl argument and specify clusteredPoints as the -p 
>>>input to clusterdump.
>>>
>>>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how 
>>>to invoke the clustering and cluster dumper from Java at least.
>>>
>>>Did you change your invocation to specify a different -c directory (e.g. 
>>>clusters-0)?
>>>Did you add the -cl argument?
>>>What is the new CLI invocation for clusterdump?
>>>Did this work for -k 10? What happens with -k
 50?
>>>Have you tried kmeans?
>>>
>>>I can help you better if you will give me answers to my questions
>>>
>>>-----Original Message-----
>>>From: Jeffrey [mailto:mycyber...@yahoo.com]
>>>Sent: Thursday, July 21, 2011 4:30 AM
>>>To: user@mahout.apache.org
>>>Subject: Re: fkmeans or Cluster Dumper not working?
>>>
>>>Hi again,
>>>
>>>Let me update on what's working and what's not working.
>>>
>>>Works:
>>>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>>fkmeans clustering (5 clusters)
>>>clusterdump (5 clusters) - so points are not included in the clusterdump and 
>>>I need to write a program for it?
>>>
>>>Not Working:
>>>fkmeans clustering (50 clusters) - same error
>>>clusterdump (10
 clusters) - same error
>>>
>>>
>>>so it seems to attach points to the cluster dumper output like the synthetic 
>>>control example does, i would have to write some code as pointed by 
>>>@Frank_Scholten ? 
>>>https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>
>>>Best wishes,
>>>Jeffrey04
>>>
>>>>________________________________
>>>>From: Jeff Eastman <jeast...@narus.com>
>>>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey 
>>>><mycyber...@yahoo.com>
>>>>Sent: Wednesday, July 20, 2011 11:53 PM
>>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>>
>>>>Hi Jeffrey,
>>>>
>>>>It is always difficult to debug remotely, but here are some suggestions:
>>>>- First, you are specifying both an input clusters directory --clusters and 
>>>>--numClusters clusters so the job is sampling 10 points from your input 
>>>>data set and writing them to clusteredPoints as the prior clusters for the 
>>>>first iteration. You should pick a different name for this directory, as 
>>>>the clusteredPoints directory is used by the -cl (--clustering) option 
>>>>(which you did not supply) to write out the clustered (classified) input 
>>>>vectors. When you subsequently supplied clusteredPoints to the 
>>>>clusterdumper it was expecting a different format and that caused the 
>>>>exception you saw. Change your --clusters directory (clusters-0 is good)
 and add a -cl argument and things should go more smoothly. The -cl option is 
not the default and so no clustering of the input points is performed without 
this (Many people get caught by this and perhaps the default should be changed, 
but clustering can be expensive and so it is not performed without request).
>>>>- If you still have problems, try again with k-means. The similarity to 
>>>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>>>problems with k-means
>>>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>>>problems, unless your vectors are very large and you are getting an OME in 
>>>>the reducer. Since the reducer is calculating centroid vectors for the next 
>>>>iteration these will become more dense and memory will increase 
>>>>substantially.
>>>>- I can't figure out what might be causing your second exception. It is 
>>>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>>>argument
 problems.
>>>>
>>>>Hope this helps,
>>>>Jeff
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Jeffrey [mailto:mycyber...@yahoo.com]
>>>>Sent: Wednesday, July 20, 2011 2:41 AM
>>>>To: user@mahout.apache.org
>>>>Subject: fkmeans or Cluster Dumper not working?
>>>>
>>>>Hi,
>>>>
>>>>I am trying to generate clusters using the fkmeans command line tool from 
>>>>my test data. Not sure if this is correct, as it only runs one iteration 
>>>>(output from 0.6-snapshot, gotta use some workaround to some weird bug - 
>>>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>>>> )
>>>>
>>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 
>>>>--numClusters 10 --overwrite --m 5
>>>>Running on hadoop, using 
>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>>>> 
>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>>>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>> --emitMostLikely=true, --endPhase=2147483647, 
>>>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>>>>--numClusters=10, --output=sensei/clusters, --overwrite=null, 
>>>>--startPhase=0, --tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO 
>>>>common.HadoopUtil: Deleting sensei/clusters11/07/20
 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 
14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 
14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>>>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>>>>compress.CodecPool: Got brand-new decompressor
>>>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>>>>sensei/clusteredPoints/part-randomSeed
>>>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
>>>>Iteration 1
>>>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process 
>>>>: 1
>>>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>>11/07/20 14:05:57 INFO
 mapred.JobClient:  map 5% reduce 0%
>>>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce
 0%
>>>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>>11/07/20 14:07:13 INFO
 mapred.JobClient:  map 65% reduce 0%
>>>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce
 0%
>>>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>>>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all 
>>>>reduces waiting after reserving slots (ms)=0
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>>>>waiting after
 reserving slots (ms)=0
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>>>11/07/20 14:08:31 INFO mapred.JobClient:
 FILE_BYTES_WRITTEN=132572666
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>>>>bytes=2246233
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>>>11/07/20 14:08:32 INFO
 mapred.JobClient:     Spilled Records=590
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>>>
>>>>if I increase the --numClusters argument (e.g. 50), then it will return 
>>>>exception after
>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>
>>>>and would retry again (also reproducible using 0.6-snapshot)
>>>>
>>>>...
>>>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce
 0%
>>>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>>>>attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>>>valid local directory for output/file.out
>>>>        at 
>>>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>        at 
>>>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>        at 
>>>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>        at 
>>>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>        at 
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>        at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>        at 
>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>        at 
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>
>>>>11/07/20 14:22:32 INFO
 mapred.JobClient:  map 0% reduce 0%
>>>>...
>>>>
>>>>Then I ran cluster dumper to dump information about the clusters, this 
>>>>command would work if I only care about the cluster centroids (both 0.5 
>>>>release and 0.6-snapshot)
>>>>
>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>>image-tag-clusters.txt
>>>>Running on hadoop, using 
>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>MAHOUT-JOB: 
>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>>>--startPhase=0, --tempDir=temp}
>>>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761
 ms
>>>>
>>>>but if I want to see the degree of membership of each points, I get another 
>>>>exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>>>
>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>Running on hadoop, using 
>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>MAHOUT-JOB: 
>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>>>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
>>>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>>>native-zlib library
>>>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>>>Exception in thread "main" java.lang.ClassCastException: 
>>>>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
>>>>        at 
>>>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>        at 
>>>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>        at 
>>>> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>        at 
>>>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>        at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>        at 
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>        at 
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>        at 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>        at
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>
>>>>erm, would writing a short program to call the API (btw, can't seem to find 
>>>>the latest API doc?) be a better choice here? Or did I do anything wrong 
>>>>here (yes, Java is not my main language, and I am very new to Mahout.. and 
>>>>h)?
>>>>
>>>>the data is converted from an arff file with about 1000 rows (resource) and 
>>>>14k columns (tag), and it is just a subset of my data. (actually made a 
>>>>mistake so it is now generating resource clusters instead of tag clusters, 
>>>>but I am just doing this as a proof of concept whether mahout is good 
>>>>enough for the task)
>>>>
>>>>Best
 wishes,
>>>>Jeffrey04
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>

RE: fkmeans or Cluster Dumper not working?

Reply via email to