[R] Clustering with R - efficient processing of large sparse data sets (text data)

dataguru Sun, 27 Sep 2009 13:36:56 -0700

I checked the R procedure HCLUST (hierarchical clustering) but it
looks like it requires a full triangular n x n similarity matrix as
input, where n = number of observations. The number of variables is
200.


My data set has n = 50,000 observations (keywords), and I use ad-hoc
similarity measures, not available in R, to measure keyword
similarity. Here, the vast majority of the n x n similarities are
equal to zero.

So I am looking for a clustering procedure that would accept the
following alternate input:

x1, y1, s1
x2, y2, s2

...

xk, yk, sk

where xi, yi are 2 keywords with similarity si > 0 (1 <= i <= k). This
input would contain k = 10,000 rows, which is much smaller than n x n
= 50,000 x 50,000 elements when using the similarity matrix. The
HCLUST function would crash if it used the dissimilarity matrix as
input.

Do you know how to use my small data input in R, instead of a very
large sparse similarity matrix? Or in SAS? I need a simple solution,
otherwise I'll just write myself the code that does hierarchical
clustering, in C or Perl, or use a library. It would take me 2 hours
to write the hierarchical clustering code from scratch, so I'm looking
for a simple solution that will take less than 2 hours to implement.

Follow up at: 
http://www.analyticbridge.com/group/R_Packages/forum/topics/clustering-with-r-efficient

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with R - efficient processing of large sparse data sets (text data)

Reply via email to