Looking at the paper it doesn't seem to require MR for the final CDbw calculation, right? For each cluster we only need to compare one of its points with one point in each other cluster. With small numbers of representative points per cluster that can be done easily in memory. I'd love to see the code you have for computing representative points.

Jeff


Robin Anil wrote:
On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <[email protected]>wrote:

Hi Robin,

Interesting paper. I'm beginning to see how to MR the representative point
selection already. The rest will hopefully become clearer with more study.
Lots of MR jobs are needed to:



a) get the data into Vectors, We have something for text, missing for other
formats



b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done



c) cluster the data, Done



d) iterate over the clustered data to derive representative points for each
cluster, and finally Done ;)



e) produce the CDbw.- TODO




And, of course all of this is again iterated with different values for the
clustering algorithm's parameters. Should keep the lights on at PG&E
producing power for the server farms.



Robin Anil wrote:

Hi Jeff,
           This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation.
Its
pretty easy to compute. Need to make it a map/reduce job

http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin






Reply via email to