Re: retrieve k-means result

Jeff Eastman Tue, 31 Aug 2010 09:05:23 -0700

When you specify -k to k-means it will randomly sample k values fromyour input data set to use as the initial cluster centers. Thoseclusters will be written to the -c directory and form the prior of thek-means iteration. Each iteration will then produce a revised set ofclusters in clusters-x; from your clusters-5 dump it looks like thecomputation has not yet converged (CL-21551 has not converged whereasVL-21560 has). If you are running this against Reuters, I suggest yourinitial -k value is too low and you need to increase the --maxIter valueto obtain convergence.

In terms of importing into Weka, I suggest the cluster dumper probablywon't be of much use and suggest you write your own job to convert theclusters into a format you can use more directly.



On 8/30/10 11:20 AM, Valerio Ceraudo wrote:

In the folder clusters what I used there is this file: part-randomSeed


created by the command:

bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
/home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5

I need to use the files in the folder reuters-kmeans? inside it I have got some
other sub-directory called cluster-x where x is from 1 to 5.

I tried to give the cluster-5 as input and inside finalOutput I have got a file
big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^

I can read the first and the second row:
CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
and the second row:
VL-21560{n=19722 c[0:0.012 etc etc...

is now converged and correct?

is there a more comfortable way to read this file?because than i need to convert
it in data for weka.

This night i will try to convert an irff data in a sgm to see what i obtain with
it.

Re: retrieve k-means result

Reply via email to