Re: MAHOUT-236 Cluster Evaluation Tools?

Jeff Eastman Thu, 08 Apr 2010 15:44:06 -0700

Looking at the paper it doesn't seem to require MR for the final CDbwcalculation, right? For each cluster we only need to compare one of itspoints with one point in each other cluster. With small numbers ofrepresentative points per cluster that can be done easily in memory. I'dlove to see the code you have for computing representative points.


Jeff



Robin Anil wrote:

On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <[email protected]>wrote:

Hi Robin,

Interesting paper. I'm beginning to see how to MR the representative point
selection already. The rest will hopefully become clearer with more study.
Lots of MR jobs are needed to:

a) get the data into Vectors, We have something for text, missing for other
formats

b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done

c) cluster the data, Done

d) iterate over the clustered data to derive representative points for each
cluster, and finally Done ;)

e) produce the CDbw.- TODO

And, of course all of this is again iterated with different values for the
clustering algorithm's parameters. Should keep the lights on at PG&E
producing power for the server farms.



Robin Anil wrote:

Hi Jeff,
           This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation.
Its
pretty easy to compute. Need to make it a map/reduce job

http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin

Re: MAHOUT-236 Cluster Evaluation Tools?

Reply via email to