Basic question: Is it correct to assume that when using cutree to set the # clusters (say k=4), cutree determines the clusters by the largest distances among all potential clusters?
I've read the R help for cutree and am using it to define the number of groups to obtain Dunn Index scores (using clValid library) for cluster analysis (using Euclidean Distance and Ward's method) More specific (if helpful): I understand that cutree is used to set the number of clusters for which the Dunn Index will base it's score on. But the r help doesn't explain how the groups are determined. Prior to measuring the Dunn Index, the cluster hierarchy formed using Euclidean Distance and Ward's provides a certain number of connected pairs of samples. For example: Say at the 1st iteration (hierarchy level 1), my n=68 samples are connected into k=32 groups. The next iteration connects these 32 into k=16 groups (hierarchy level 2). 3rd iteration = 8; 4th iteration = 4, and 5th iteration = 2. The distances from one hierarchy level to the next will differ for each group. Is it correct to assume that I could cut the tree into anywhere from k=2 to k=32+16+8+4+2=62 groups? That is, cutree(data,k=2) though cutree(data,k=62) is valid, whereas anything outside those values is not? Now say, I use cutree(data,k=3) to define 3 clusters. Will cutree look back at the cluster tree created by the Ward's method and then take the 3 largest distance values from among these 62 potential groups so that when I use Dunn index, those will be the only distances considered? I can post code and/or data if helpful. Thanks, kbrownk ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.