Looking at the paper it doesn't seem to require MR for the final CDbw
calculation, right? For each cluster we only need to compare one of its
points with one point in each other cluster. With small numbers of
representative points per cluster that can be done easily in memory. I'd
love to see the code you have for computing representative points.
Jeff
Robin Anil wrote:
On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <[email protected]>wrote:
Hi Robin,
Interesting paper. I'm beginning to see how to MR the representative point
selection already. The rest will hopefully become clearer with more study.
Lots of MR jobs are needed to:
a) get the data into Vectors, We have something for text, missing for other
formats
b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
c) cluster the data, Done
d) iterate over the clustered data to derive representative points for each
cluster, and finally Done ;)
e) produce the CDbw.- TODO
And, of course all of this is again iterated with different values for the
clustering algorithm's parameters. Should keep the lights on at PG&E
producing power for the server farms.
Robin Anil wrote:
Hi Jeff,
This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation.
Its
pretty easy to compute. Need to make it a map/reduce job
http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin