When you specify -k to k-means it will randomly sample k values from
your input data set to use as the initial cluster centers. Those
clusters will be written to the -c directory and form the prior of the
k-means iteration. Each iteration will then produce a revised set of
clusters in clusters-x; from your clusters-5 dump it looks like the
computation has not yet converged (CL-21551 has not converged whereas
VL-21560 has). If you are running this against Reuters, I suggest your
initial -k value is too low and you need to increase the --maxIter value
to obtain convergence.
In terms of importing into Weka, I suggest the cluster dumper probably
won't be of much use and suggest you write your own job to convert the
clusters into a format you can use more directly.
On 8/30/10 11:20 AM, Valerio Ceraudo wrote:
In the folder clusters what I used there is this file: part-randomSeed
created by the command:
bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
/home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5
I need to use the files in the folder reuters-kmeans? inside it I have got some
other sub-directory called cluster-x where x is from 1 to 5.
I tried to give the cluster-5 as input and inside finalOutput I have got a file
big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^
I can read the first and the second row:
CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
and the second row:
VL-21560{n=19722 c[0:0.012 etc etc...
is now converged and correct?
is there a more comfortable way to read this file?because than i need to convert
it in data for weka.
This night i will try to convert an irff data in a sgm to see what i obtain with
it.