Hi,

We have been using K-Means to cluster a fairly large dataset (just under a 
million 128 dimension vectors of floating point values - about 9.2GB in space 
delimited file format). We’re using Hadoop 2.2.0 and Mahout 0.9. The dataset is 
first converted from simple space delimited format into 
RandomAccessSparseVector format for K-Means using the 
org.apache.mahout.clustering.conversion.InputDriver utility.

We’re not using Canopy clustering to determine the initial clusters as we want 
a specific number of clusters (100,000) so we let K-Means create the initial 
random 100,000 centroids:

./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o 
/data/clusters_output -k 100000 -x 20 -ow -xm mapreduce

It all runs fine and we then extract the computed centroids using the 
clusterdump utility:

./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o 
./clusters.txt -of TEXT

The clusters.txt output file contains the expected 100,000 lines (once cluster 
per line) however there seem to be some idiosyncrasies in the output format…

If we add up all the values of n for each cluster, which should be the number 
of data points belonging to each cluster, we get a total of 39,160,754. But we 
expect this to be the same as the number of input points (9,769,004) as each 
input point should belong to a single cluster. We are not sure why the sum of n 
values is nearly 4 times as large as the number of input points.
We also notice that the vector output format for the cluster centroids and 
radii seem to be in a couple of different formats. The majority are a simple 
comma separated array format e.g.

c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001, 
0.003, 0.002, 0.002, 0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015, 
0.001, 0.002, 0.025, 0.007, 0.001, 0.007, 0.031, 0.004, 0.000, 0.005, 0.006, 
0.003, 0.005, 0.029, 0.023, 0.001, 0.000, 0.005, 0.032, 0.007, 0.001, 0.009, 
0.014, 0.002, 0.000, 0.004, 0.011, 0.001, 0.002, 0.010, 0.032, 0.017, 0.000, 
0.002, 0.013, 0.019, 0.008, 0.009, 0.017, 0.005, 0.001, 0.003, 0.007, 0.005, 
0.002, 0.014, 0.021, 0.002, 0.001, 0.005, 0.032, 0.006, 0.005, 0.014, 0.016, 
0.003, 0.001, 0.004, 0.006, 0.000, 0.001, 0.005, 0.031, 0.026, 0.001, 0.002, 
0.009, 0.002, 0.003, 0.004, 0.006, 0.015, 0.004, 0.006, 0.006, 0.002, 0.002, 
0.006, 0.003, 0.001, 0.003, 0.009, 0.004, 0.002, 0.005, 0.018, 0.012, 0.001, 
0.000, 0.002, 0.001, 0.000, 0.007, 0.016, 0.021, 0.006, 0.001, 0.000, 0.006, 
0.003, 0.013, 0.012, 0.003, 0.002, 0.000, 0.001]

But there are also a significant number of clusters where the format appears to 
be a sparse array representation with each value prefixed by the position index 
e.g.

c=[0:0.056, 1:0.006, 2:0.000, 3:0.000, 4:0.000, 5:0.000, 6:0.000, 7:0.004, 
8:0.057, 9:0.002, 10:0.000, 11:0.000, 12:0.000, 13:0.000, 14:0.000, 15:0.005, 
16:0.056, 17:0.004, 18:0.000, 19:0.000, 20:0.000, 23:0.002, 24:0.024, 25:0.009, 
26:0.013, 27:0.005, 28:0.001, 29:0.001, 30:0.000, 31:0.000, 32:0.057, 33:0.006, 
34:0.000, 35:0.000, 36:0.000, 37:0.000, 38:0.000, 39:0.002, 40:0.057, 41:0.007, 
42:0.000, 43:0.000, 44:0.000, 45:0.000, 46:0.000, 47:0.004, 48:0.057, 49:0.008, 
50:0.000, 51:0.000, 52:0.000, 55:0.001, 56:0.050, 57:0.007, 58:0.000, 59:0.000, 
60:0.000, 61:0.000, 62:0.000, 63:0.001, 64:0.057, 65:0.003, 66:0.000, 67:0.000, 
68:0.000, 69:0.000, 70:0.000, 71:0.006, 72:0.057, 73:0.004, 74:0.000, 75:0.000, 
76:0.000, 77:0.000, 78:0.000, 79:0.009, 80:0.057, 81:0.003, 82:0.000, 83:0.000, 
84:0.000, 87:0.006, 88:0.047, 89:0.004, 90:0.000, 91:0.000, 92:0.000, 93:0.000, 
94:0.000, 95:0.006, 96:0.056, 97:0.005, 98:0.000, 99:0.000, 100:0.000, 
101:0.000, 102:0.000, 103:0.003, 104:0.057, 105:0.003, 106:0.000, 107:0.000, 
108:0.000, 109:0.000, 110:0.000, 111:0.006, 112:0.056, 113:0.000, 114:0.000, 
115:0.000, 116:0.000, 117:0.000, 118:0.000, 119:0.008, 120:0.038, 121:0.001, 
122:0.000, 123:0.000, 124:0.000, 125:0.000, 126:0.000, 127:0.006]

In this case should values missing in this sparse vector format be interpreted 
as 0.0 e.g. the value for dimension 21 in the above example? Why are zero 
values still included in his output format (e.g. dimensions 2,3,4 etc. above) 
and it seems awkward to us that the clusterdump output contains different 
vector formats as it makes it more complex to parse. Also we find that if we 
set the clusterdump output format to CSV instead of TEXT ("-of CSV”) no output 
file is produced.

Any information or feedback on the above would be greatly appreciated.

Regards,
Oisin.



Reply via email to