Hi Jeffrey, It is always difficult to debug remotely, but here are some suggestions: - First, you are specifying both an input clusters directory --clusters and --numClusters clusters so the job is sampling 10 points from your input data set and writing them to clusteredPoints as the prior clusters for the first iteration. You should pick a different name for this directory, as the clusteredPoints directory is used by the -cl (--clustering) option (which you did not supply) to write out the clustered (classified) input vectors. When you subsequently supplied clusteredPoints to the clusterdumper it was expecting a different format and that caused the exception you saw. Change your --clusters directory (clusters-0 is good) and add a -cl argument and things should go more smoothly. The -cl option is not the default and so no clustering of the input points is performed without this (Many people get caught by this and perhaps the default should be changed, but clustering can be expensive and so it is not performed without request). - If you still have problems, try again with k-means. The similarity to fkmeans is good and it will eliminate fkmeans itself if you see the same problems with k-means - I don't see why changing the -k argument from 10 to 50 should cause any problems, unless your vectors are very large and you are getting an OME in the reducer. Since the reducer is calculating centroid vectors for the next iteration these will become more dense and memory will increase substantially. - I can't figure out what might be causing your second exception. It is bombing inside of Hadoop file IO and this causes me to suspect command argument problems.
Hope this helps, Jeff -----Original Message----- From: Jeffrey [mailto:[email protected]] Sent: Wednesday, July 20, 2011 2:41 AM To: [email protected] Subject: fkmeans or Cluster Dumper not working? Hi, I am trying to generate clusters using the fkmeans command line tool from my test data. Not sure if this is correct, as it only runs one iteration (output from 0.6-snapshot, gotta use some workaround to some weird bug - http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans ) $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 10 --overwrite --m 5 Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20 14:05:18 INFO common.AbstractJob: Command line arguments: {--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --emitMostLikely=true, --endPhase=2147483647, --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, --numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, --tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library11/07/20 14:05:20 INFO compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO compress.CodecPool: Got brand-new decompressor 11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to sensei/clusteredPoints/part-randomSeed 11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 1 11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1 11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021 11/07/20 14:05:31 INFO mapred.JobClient: map 0% reduce 0% 11/07/20 14:05:54 INFO mapred.JobClient: map 2% reduce 0% 11/07/20 14:05:57 INFO mapred.JobClient: map 5% reduce 0% 11/07/20 14:06:00 INFO mapred.JobClient: map 6% reduce 0% 11/07/20 14:06:03 INFO mapred.JobClient: map 7% reduce 0% 11/07/20 14:06:07 INFO mapred.JobClient: map 10% reduce 0% 11/07/20 14:06:10 INFO mapred.JobClient: map 13% reduce 0% 11/07/20 14:06:13 INFO mapred.JobClient: map 15% reduce 0% 11/07/20 14:06:16 INFO mapred.JobClient: map 17% reduce 0% 11/07/20 14:06:19 INFO mapred.JobClient: map 19% reduce 0% 11/07/20 14:06:22 INFO mapred.JobClient: map 23% reduce 0% 11/07/20 14:06:25 INFO mapred.JobClient: map 25% reduce 0% 11/07/20 14:06:28 INFO mapred.JobClient: map 27% reduce 0% 11/07/20 14:06:31 INFO mapred.JobClient: map 30% reduce 0% 11/07/20 14:06:34 INFO mapred.JobClient: map 33% reduce 0% 11/07/20 14:06:37 INFO mapred.JobClient: map 36% reduce 0% 11/07/20 14:06:40 INFO mapred.JobClient: map 37% reduce 0% 11/07/20 14:06:43 INFO mapred.JobClient: map 40% reduce 0% 11/07/20 14:06:46 INFO mapred.JobClient: map 43% reduce 0% 11/07/20 14:06:49 INFO mapred.JobClient: map 46% reduce 0% 11/07/20 14:06:52 INFO mapred.JobClient: map 48% reduce 0% 11/07/20 14:06:55 INFO mapred.JobClient: map 50% reduce 0% 11/07/20 14:06:57 INFO mapred.JobClient: map 53% reduce 0% 11/07/20 14:07:00 INFO mapred.JobClient: map 56% reduce 0% 11/07/20 14:07:03 INFO mapred.JobClient: map 58% reduce 0% 11/07/20 14:07:06 INFO mapred.JobClient: map 60% reduce 0% 11/07/20 14:07:09 INFO mapred.JobClient: map 63% reduce 0% 11/07/20 14:07:13 INFO mapred.JobClient: map 65% reduce 0% 11/07/20 14:07:16 INFO mapred.JobClient: map 67% reduce 0% 11/07/20 14:07:19 INFO mapred.JobClient: map 70% reduce 0% 11/07/20 14:07:22 INFO mapred.JobClient: map 73% reduce 0% 11/07/20 14:07:25 INFO mapred.JobClient: map 75% reduce 0% 11/07/20 14:07:28 INFO mapred.JobClient: map 77% reduce 0% 11/07/20 14:07:31 INFO mapred.JobClient: map 80% reduce 0% 11/07/20 14:07:34 INFO mapred.JobClient: map 83% reduce 0% 11/07/20 14:07:37 INFO mapred.JobClient: map 85% reduce 0% 11/07/20 14:07:40 INFO mapred.JobClient: map 87% reduce 0% 11/07/20 14:07:43 INFO mapred.JobClient: map 89% reduce 0% 11/07/20 14:07:46 INFO mapred.JobClient: map 92% reduce 0% 11/07/20 14:07:49 INFO mapred.JobClient: map 95% reduce 0% 11/07/20 14:07:55 INFO mapred.JobClient: map 98% reduce 0% 11/07/20 14:07:59 INFO mapred.JobClient: map 99% reduce 0% 11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% 11/07/20 14:08:23 INFO mapred.JobClient: map 100% reduce 100% 11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021 11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26 11/07/20 14:08:31 INFO mapred.JobClient: Job Counters 11/07/20 14:08:31 INFO mapred.JobClient: Launched reduce tasks=1 11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=149314 11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/07/20 14:08:31 INFO mapred.JobClient: Launched map tasks=1 11/07/20 14:08:31 INFO mapred.JobClient: Data-local map tasks=1 11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15618 11/07/20 14:08:31 INFO mapred.JobClient: File Output Format Counters 11/07/20 14:08:31 INFO mapred.JobClient: Bytes Written=2247222 11/07/20 14:08:31 INFO mapred.JobClient: Clustering 11/07/20 14:08:31 INFO mapred.JobClient: Converged Clusters=10 11/07/20 14:08:31 INFO mapred.JobClient: FileSystemCounters 11/07/20 14:08:31 INFO mapred.JobClient: FILE_BYTES_READ=130281382 11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_READ=254494 11/07/20 14:08:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=132572666 11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2247222 11/07/20 14:08:31 INFO mapred.JobClient: File Input Format Counters 11/07/20 14:08:31 INFO mapred.JobClient: Bytes Read=247443 11/07/20 14:08:31 INFO mapred.JobClient: Map-Reduce Framework 11/07/20 14:08:31 INFO mapred.JobClient: Reduce input groups=10 11/07/20 14:08:31 INFO mapred.JobClient: Map output materialized bytes=2246233 11/07/20 14:08:32 INFO mapred.JobClient: Combine output records=330 11/07/20 14:08:32 INFO mapred.JobClient: Map input records=1113 11/07/20 14:08:32 INFO mapred.JobClient: Reduce shuffle bytes=2246233 11/07/20 14:08:32 INFO mapred.JobClient: Reduce output records=10 11/07/20 14:08:32 INFO mapred.JobClient: Spilled Records=590 11/07/20 14:08:32 INFO mapred.JobClient: Map output bytes=2499995001 11/07/20 14:08:32 INFO mapred.JobClient: Combine input records=11450 11/07/20 14:08:32 INFO mapred.JobClient: Map output records=11130 11/07/20 14:08:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 11/07/20 14:08:32 INFO mapred.JobClient: Reduce input records=10 11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms if I increase the --numClusters argument (e.g. 50), then it will return exception after 11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% and would retry again (also reproducible using 0.6-snapshot) ... 11/07/20 14:22:25 INFO mapred.JobClient: map 100% reduce 0% 11/07/20 14:22:30 INFO mapred.JobClient: Task Id : attempt_201107201152_0022_m_000000_0, Status : FAILED org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) 11/07/20 14:22:32 INFO mapred.JobClient: map 0% reduce 0% ... Then I ran cluster dumper to dump information about the clusters, this command would work if I only care about the cluster centroids (both 0.5 release and 0.6-snapshot) $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output image-tag-clusters.txt Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar 11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp} 11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms but if I want to see the degree of membership of each points, I get another exception (yes, reproducible for both 0.5 release and 0.6-snapshot) $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output image-tag-clusters.txt --pointsDir sensei/clusteredPoints Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar 11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp} 11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library 11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) erm, would writing a short program to call the API (btw, can't seem to find the latest API doc?) be a better choice here? Or did I do anything wrong here (yes, Java is not my main language, and I am very new to Mahout.. and h)? the data is converted from an arff file with about 1000 rows (resource) and 14k columns (tag), and it is just a subset of my data. (actually made a mistake so it is now generating resource clusters instead of tag clusters, but I am just doing this as a proof of concept whether mahout is good enough for the task) Best wishes, Jeffrey04
