Hi again,

Let me update on what's working and what's not working.

Works:
fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
fkmeans clustering (5 clusters)
clusterdump (5 clusters) - so points are not included in the clusterdump and I 
need to write a program for it?

Not Working:
fkmeans clustering (50 clusters) - same error
clusterdump (10 clusters) - same error


so it seems to attach points to the cluster dumper output like the synthetic 
control example does, i would have to write some code as pointed 
by @Frank_Scholten ? 
https://twitter.com/#!/Frank_Scholten/status/93617269296472064

Best wishes,
Jeffrey04

>________________________________
>From: Jeff Eastman <jeast...@narus.com>
>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey 
><mycyber...@yahoo.com>
>Sent: Wednesday, July 20, 2011 11:53 PM
>Subject: RE: fkmeans or Cluster Dumper not working?
>
>Hi Jeffrey,
>
>It is always difficult to debug remotely, but here are some suggestions:
>- First, you are specifying both an input clusters directory --clusters and 
>--numClusters clusters so the job is sampling 10 points from your input data 
>set and writing them to clusteredPoints as the prior clusters for the first 
>iteration. You should pick a different name for this directory, as the 
>clusteredPoints directory is used by the -cl (--clustering) option (which you 
>did not supply) to write out the clustered (classified) input vectors. When 
>you subsequently supplied clusteredPoints to the clusterdumper it was 
>expecting a different format and that caused the exception you saw. Change 
>your --clusters directory (clusters-0 is good) and add a -cl argument and 
>things should go more smoothly. The -cl option is not the default and so no 
>clustering of the input points is performed without this (Many people get 
>caught by this and perhaps the default should be changed, but clustering can 
>be expensive and so it is not performed without request).
>- If you still have problems, try again with k-means. The similarity to 
>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>problems with k-means
>- I don't see why changing the -k argument from 10 to 50 should cause any 
>problems, unless your vectors are very large and you are getting an OME in the 
>reducer. Since the reducer is calculating centroid vectors for the next 
>iteration these will become more dense and memory will increase substantially.
>- I can't figure out what might be causing your second exception. It is 
>bombing inside of Hadoop file IO and this causes me to suspect command 
>argument problems.
>
>Hope this helps,
>Jeff
>
>
>-----Original Message-----
>From: Jeffrey [mailto:mycyber...@yahoo.com] 
>Sent: Wednesday, July 20, 2011 2:41 AM
>To: user@mahout.apache.org
>Subject: fkmeans or Cluster Dumper not working?
>
>Hi,
>
>I am trying to generate clusters using the fkmeans command line tool from my 
>test data. Not sure if this is correct, as it only runs one iteration (output 
>from 0.6-snapshot, gotta use some workaround to some weird bug 
>- http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
> )
>
>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 
>10 --overwrite --m 5
>Running on hadoop, using 
>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
> 
>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --emitMostLikely=true, --endPhase=2147483647, 
>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
>--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: 
>Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
>sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded the 
>native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>compress.CodecPool: Got brand-new decompressor
>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>sensei/clusteredPoints/part-randomSeed
>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 1
>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1
>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters 
>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
>waiting after reserving slots (ms)=0
>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>waiting after reserving slots (ms)=0
>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters 
>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters 
>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>bytes=2246233
>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>
>if I increase the --numClusters argument (e.g. 50), then it will return 
>exception after 
>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>
>and would retry again (also reproducible using 0.6-snapshot)
>
>...
>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>attempt_201107201152_0022_m_000000_0, Status : FAILED
>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>valid local directory for output/file.out
>        at 
>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>        at 
>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>        at 
>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>        at 
>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>        at 
>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>        at 
>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>        at 
>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:416)
>        at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
>11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
>...
>
>Then I ran cluster dumper to dump information about the clusters, this command 
>would work if I only care about the cluster centroids (both 0.5 release and 
>0.6-snapshot)
>
>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>image-tag-clusters.txt
>Running on hadoop, using 
>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>MAHOUT-JOB: 
>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>{--dictionaryType=text, --endPhase=2147483647, 
>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>--startPhase=0, --tempDir=temp}
>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms
>
>but if I want to see the degree of membership of each points, I get another 
>exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>
>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>Running on hadoop, using 
>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>MAHOUT-JOB: 
>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>{--dictionaryType=text, --endPhase=2147483647, 
>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>native-zlib library
>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>Exception in thread "main" java.lang.ClassCastException: 
>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
>        at 
>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>        at 
>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>        at 
>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>        at 
>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:616)
>        at 
>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:616)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>erm, would writing a short program to call the API (btw, can't seem to find 
>the latest API doc?) be a better choice here? Or did I do anything wrong here 
>(yes, Java is not my main language, and I am very new to Mahout.. and h)?
>
>the data is converted from an arff file with about 1000 rows (resource) and 
>14k columns (tag), and it is just a subset of my data. (actually made a 
>mistake so it is now generating resource clusters instead of tag clusters, but 
>I am just doing this as a proof of concept whether mahout is good enough for 
>the task)
>
>Best wishes,
>Jeffrey04
>
>
>

Reply via email to