I checked the R procedure HCLUST (hierarchical clustering) but it looks like it requires a full triangular n x n similarity matrix as input, where n = number of observations. The number of variables is 200.
My data set has n = 50,000 observations (keywords), and I use ad-hoc similarity measures, not available in R, to measure keyword similarity. Here, the vast majority of the n x n similarities are equal to zero. So I am looking for a clustering procedure that would accept the following alternate input: x1, y1, s1 x2, y2, s2 ... xk, yk, sk where xi, yi are 2 keywords with similarity si > 0 (1 <= i <= k). This input would contain k = 10,000 rows, which is much smaller than n x n = 50,000 x 50,000 elements when using the similarity matrix. The HCLUST function would crash if it used the dissimilarity matrix as input. Do you know how to use my small data input in R, instead of a very large sparse similarity matrix? Or in SAS? I need a simple solution, otherwise I'll just write myself the code that does hierarchical clustering, in C or Perl, or use a library. It would take me 2 hours to write the hierarchical clustering code from scratch, so I'm looking for a simple solution that will take less than 2 hours to implement. Follow up at: http://www.analyticbridge.com/group/R_Packages/forum/topics/clustering-with-r-efficient ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.