Hi,
I've been going over the kmeans stuff the last few days to try and
understand how it works, and how I might extend it to work with
the data I'm looking to process. It's taken me a while to get a
basic understanding of things, and really appreciate having lists
like this around for support.
I need to be able to label the vectors: each vector holds (for a
document) a set of similarity scores across a number of
attributes. I did some searching around payloads (after coming
across the term in some comments) but couldn't see how I add a
payload to the Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
) that mentions the addition of the setName method to Vector. I've
tried building trunk, and although there were a few test failures
for other (seemingly unrelated) examples I continued and managed
to get the mahout-examples jar/job files built to give it a whirl.
When I run the following:
$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
I see it run the "Preparing Input", "Running Canopy to get initial
clusters", and then finally it starts "Running KMeans". But,
shortly after it breaks with the following trace:
---snip---
Running KMeans
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data
Clusters In: output/canopies Out: output Distance:
org.apache.mahout.utils.EuclideanDistanceMeasure
09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: 1 Input Vectors:
org.apache.mahout.matrix.SparseVector
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
the same.
09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths
to process : 2
09/07/13 23:49:34 INFO mapred.JobClient: Running job:
job_200907132019_0040
09/07/13 23:49:35 INFO mapred.JobClient: map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient: map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient: map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient: map 100% reduce 100%
09/07/13 23:49:50 INFO mapred.JobClient: Job complete:
job_200907132019_0040
09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient: File Systems
09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient: Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient: Local bytes
written=15674
09/07/13 23:49:50 INFO mapred.JobClient: Job Counters
09/07/13 23:49:50 INFO mapred.JobClient: Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient: Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient: Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient: Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient: Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient: Combine output
records=10
09/07/13 23:49:50 INFO mapred.JobClient: Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient: Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient: Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient: Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient: Combine input
records=600
09/07/13 23:49:50 INFO mapred.JobClient: Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient: Reduce input records=10
09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:
Cannot open filename /user/paul/output/clusters-0/_logs
java.io.IOException: Cannot open filename /user/paul/output/
clusters-0/_logs
at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.openInfo(DFSClient.java:1394)
at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.<init>(DFSClient.java:1385)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
at
org
.apache
.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:
171)
at org.apache.hadoop.io.SequenceFile
$Reader.openFile(SequenceFile.java:1437)
at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile
$Reader.<init>(SequenceFile.java:1412)
at
org
.apache
.mahout
.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
at
org
.apache
.mahout
.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
at
org
.apache
.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
at
org
.apache
.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
at
org
.apache
.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun
.reflect
.DelegatingMethodAccessorImpl
.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---
This is against revision 793689, running on my development Mac Pro
(pseudo-distributed single node) with Hadoop 0.19.1.
It's a bit late to be digging through what's going on, but will
try and take a look tomorrow- really excited about giving kmeans a
whirl on the document processing I'm playing with. In the
meantime, I was wondering whether anyone else had seen the same,
or knew a way to accomplish something similar with the released
version (or point me to a past good revision perhaps?)
Thanks again,
Paul