Hi Jeffrey,

It is always difficult to debug remotely, but here are some suggestions:
- First, you are specifying both an input clusters directory --clusters and 
--numClusters clusters so the job is sampling 10 points from your input data 
set and writing them to clusteredPoints as the prior clusters for the first 
iteration. You should pick a different name for this directory, as the 
clusteredPoints directory is used by the -cl (--clustering) option (which you 
did not supply) to write out the clustered (classified) input vectors. When you 
subsequently supplied clusteredPoints to the clusterdumper it was expecting a 
different format and that caused the exception you saw. Change your --clusters 
directory (clusters-0 is good) and add a -cl argument and things should go more 
smoothly. The -cl option is not the default and so no clustering of the input 
points is performed without this (Many people get caught by this and perhaps 
the default should be changed, but clustering can be expensive and so it is not 
performed without request).
- If you still have problems, try again with k-means. The similarity to fkmeans 
is good and it will eliminate fkmeans itself if you see the same problems with 
k-means
- I don't see why changing the -k argument from 10 to 50 should cause any 
problems, unless your vectors are very large and you are getting an OME in the 
reducer. Since the reducer is calculating centroid vectors for the next 
iteration these will become more dense and memory will increase substantially.
- I can't figure out what might be causing your second exception. It is bombing 
inside of Hadoop file IO and this causes me to suspect command argument 
problems.

Hope this helps,
Jeff


-----Original Message-----
From: Jeffrey [mailto:[email protected]] 
Sent: Wednesday, July 20, 2011 2:41 AM
To: [email protected]
Subject: fkmeans or Cluster Dumper not working?

Hi,

I am trying to generate clusters using the fkmeans command line tool from my 
test data. Not sure if this is correct, as it only runs one iteration (output 
from 0.6-snapshot, gotta use some workaround to some weird bug 
- http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
 )

$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusteredPoints --maxIter 10 --numClusters 10 --overwrite --m 
5
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
 14:05:18 INFO common.AbstractJob: Command line arguments: 
{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
 --emitMostLikely=true, --endPhase=2147483647, 
--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: 
Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded the 
native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
 loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
compress.CodecPool: Got brand-new decompressor
11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
sensei/clusteredPoints/part-randomSeed
11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 1
11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1
11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters 
11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
bytes=2246233
11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms

if I increase the --numClusters argument (e.g. 50), then it will return 
exception after 
11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%

and would retry again (also reproducible using 0.6-snapshot)

...
11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
attempt_201107201152_0022_m_000000_0, Status : FAILED
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/file.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
        at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:253)

11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
...

Then I ran cluster dumper to dump information about the clusters, this command 
would work if I only care about the cluster centroids (both 0.5 release and 
0.6-snapshot)

$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
image-tag-clusters.txt
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms

but if I want to see the degree of membership of each points, I get another 
exception (yes, reproducible for both 0.5 release and 0.6-snapshot)

$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
image-tag-clusters.txt --pointsDir sensei/clusteredPoints
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--pointsDir=sensei/clusteredPoints, --seqFileDir=sensei/clusters/clusters-1, 
--startPhase=0, --tempDir=temp}
11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.ClassCastException: 
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
        at 
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

erm, would writing a short program to call the API (btw, can't seem to find the 
latest API doc?) be a better choice here? Or did I do anything wrong here (yes, 
Java is not my main language, and I am very new to Mahout.. and h)?

the data is converted from an arff file with about 1000 rows (resource) and 14k 
columns (tag), and it is just a subset of my data. (actually made a mistake so 
it is now generating resource clusters instead of tag clusters, but I am just 
doing this as a proof of concept whether mahout is good enough for the task)

Best wishes,
Jeffrey04

Reply via email to