fkmeans or Cluster Dumper not working?

Jeffrey Tue, 19 Jul 2011 23:42:13 -0700

Hi,

I am trying to generate clusters using the fkmeans command line tool from my 
test data. Not sure if this is correct, as it only runs one iteration (output 
from 0.6-snapshot, gotta use some workaround to some weird bug 
- http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
 )


$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusteredPoints --maxIter 10 --numClusters 10 --overwrite --m 
5
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
 14:05:18 INFO common.AbstractJob: Command line arguments: 
{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
 --emitMostLikely=true, --endPhase=2147483647, 
--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: 
Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded the 
native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
 loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
compress.CodecPool: Got brand-new decompressor
11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
sensei/clusteredPoints/part-randomSeed
11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 1
11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1
11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
bytes=2246233
11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms

if I increase the --numClusters argument (e.g. 50), then it will return 
exception after 
11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%

and would retry again (also reproducible using 0.6-snapshot)

...
11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
attempt_201107201152_0022_m_000000_0, Status : FAILED
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/file.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
        at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:253)

11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
...

Then I ran cluster dumper to dump information about the clusters, this command 
would work if I only care about the cluster centroids (both 0.5 release and 
0.6-snapshot)

$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
image-tag-clusters.txt
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms

but if I want to see the degree of membership of each points, I get another 
exception (yes, reproducible for both 0.5 release and 0.6-snapshot)

$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
image-tag-clusters.txt --pointsDir sensei/clusteredPoints
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--pointsDir=sensei/clusteredPoints, --seqFileDir=sensei/clusters/clusters-1, 
--startPhase=0, --tempDir=temp}
11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: 
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
        at 
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

erm, would writing a short program to call the API (btw, can't seem to find the 
latest API doc?) be a better choice here? Or did I do anything wrong here (yes, 
Java is not my main language, and I am very new to Mahout.. and h)?

the data is converted from an arff file with about 1000 rows (resource) and 14k 
columns (tag), and it is just a subset of my data. (actually made a mistake so 
it is now generating resource clusters instead of tag clusters, but I am just 
doing this as a proof of concept whether mahout is good enough for the task)

Best wishes,
Jeffrey04

fkmeans or Cluster Dumper not working?

Reply via email to