Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Jeff Eastman Wed, 06 Oct 2010 08:44:50 -0700

I looked at the Reuters example in MiA and it has not yet been updatedto reflect recent changes in the file nomenclature in trunk. It wasactually incorrect for 0.3 too, as it shows the contents ofreuters-vectors after seq2sparse to be (on p132):


$ls reuters-vectors/
dictionary.file-0
tfidf/
tokenized-documents/
vectors/
wordcount/


but then (on p144) it gives the input argument to k-means as:

-i reuters-vectors

which should have been:

-i reuters-vectors/tfidf (and maybe also /vectors after that, iirc, itsbeen a few months since it was changed)


As noted below, the current nomenclature after seq2sparse is:

ls reuters-out-seqdir-sparse/
df-count/
frequency.file-0
tfidf-vectors/
wordcount/
dictionary.file-0
tf-vectors/
tokenized-documents/

We will need to get the book examples and the code in synch withwhichever release coincides with its final publication. Both are movingtargets right now. Given the rate of change of Mahout we alwaysrecommend using trunk and the trunk examples are most likely to work.


On 10/5/10 6:24 PM, Jeff Eastman wrote:

The random seed generator can't read the parts in the input folder"reuters-vectors". What is in that directory? The program is expectingpart files containing VectorWritable points. If you ranexamples/bin/build-reuters.sh then the input to k-means (see thescript) should be:
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
I suggest running the script with the k-means clustering uncommentedbefore getting outside of the standard file nomenclature.
Jeff


On 10/5/10 4:17 PM, Chris Bush wrote:
Trying the kmeans clustering on reuters example data (Reuters-21578 news
collection) as covered in Mahout In Action, the following stack traceoccurs
immediately (with and without HADOOP_HOME set -- with it set, the no
HADOOP_HOME warning is omitted) :

$ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r1 -cd
1.0 -k 20 -x 10

no HADOOP_HOME set, running locally
Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,--endPhase=2147483647, --input=reuters-vectors, --maxIter=10,--maxRed=1,
--method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
--startPhase=0, --tempDir=temp}
Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting reuters-initial-clusters
Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader<clinit>
WARNING: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPoolgetCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.ClassCastException: class
org.apache.hadoop.io.IntWritable
at java.lang.Class.asSubclass(Class.java:3018)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
$
The org.apache.mahout.clustering.kmeans.RandomSeedGenerator classcasts thekey from SequenceFile.Reader as a org.apache.hadoop.io.Writablesuccessfullybut then tries to cast the value asorg.apache.mahout.math.VectorWritable
unsuccessfully.

Thanks,

Chris

Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Reply via email to