I looked at the Reuters example in MiA and it has not yet been updated
to reflect recent changes in the file nomenclature in trunk. It was
actually incorrect for 0.3 too, as it shows the contents of
reuters-vectors after seq2sparse to be (on p132):
$ls reuters-vectors/
dictionary.file-0
tfidf/
tokenized-documents/
vectors/
wordcount/
but then (on p144) it gives the input argument to k-means as:
-i reuters-vectors
which should have been:
-i reuters-vectors/tfidf (and maybe also /vectors after that, iirc, its
been a few months since it was changed)
As noted below, the current nomenclature after seq2sparse is:
ls reuters-out-seqdir-sparse/
df-count/
frequency.file-0
tfidf-vectors/
wordcount/
dictionary.file-0
tf-vectors/
tokenized-documents/
We will need to get the book examples and the code in synch with
whichever release coincides with its final publication. Both are moving
targets right now. Given the rate of change of Mahout we always
recommend using trunk and the trunk examples are most likely to work.
On 10/5/10 6:24 PM, Jeff Eastman wrote:
The random seed generator can't read the parts in the input folder
"reuters-vectors". What is in that directory? The program is expecting
part files containing VectorWritable points. If you ran
examples/bin/build-reuters.sh then the input to k-means (see the
script) should be:
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
I suggest running the script with the k-means clustering uncommented
before getting outside of the standard file nomenclature.
Jeff
On 10/5/10 4:17 PM, Chris Bush wrote:
Trying the kmeans clustering on reuters example data (Reuters-21578 news
collection) as covered in Mahout In Action, the following stack trace
occurs
immediately (with and without HADOOP_HOME set -- with it set, the no
HADOOP_HOME warning is omitted) :
$ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r
1 -cd
1.0 -k 20 -x 10
no HADOOP_HOME set, running locally
Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=reuters-vectors, --maxIter=10,
--maxRed=1,
--method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
--startPhase=0, --tempDir=temp}
Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting reuters-initial-clusters
Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader<clinit>
WARNING: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPool
getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.ClassCastException: class
org.apache.hadoop.io.IntWritable
at java.lang.Class.asSubclass(Class.java:3018)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
$
The org.apache.mahout.clustering.kmeans.RandomSeedGenerator class
casts the
key from SequenceFile.Reader as a org.apache.hadoop.io.Writable
successfully
but then tries to cast the value as
org.apache.mahout.math.VectorWritable
unsuccessfully.
Thanks,
Chris