Dear all, Thanks for your responses. The biggest problem seems to be cast() for the reshape package which could not handle the dataset. Peter's solution using the mefa package worked fine. I found an other solution: table() which works fine to crosstabulate presence-only data.
After crosstabulation I tried a few clusering methods. Agnes(), diana() and hclust() gave a solution. Daisy() gave an out-of-memory error. A follow-up question: I'm looking at the group membership with cutree(). It gives me something like: > unique(cutree(test, k = 2:8)) 2 3 4 5 6 7 8 [1,] 1 1 1 1 1 1 1 [2,] 1 1 1 2 2 2 2 [3,] 1 1 2 3 3 3 3 [4,] 1 1 2 3 3 3 8 [5,] 1 1 2 3 3 4 4 [6,] 2 2 3 4 4 5 5 [7,] 2 2 3 4 5 6 6 [8,] 2 3 4 5 6 7 7 But I'm looking for a binary or dendrogram like coding of the group membership. That would be more convenient for mapping the group membership. [1,] 111 [2,] 110 [3,] 1011 [4,] 1010 [5,] 100 [6,] 011 [7,] 010 [8,] 00 Any suggestions on that? Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -----Oorspronkelijk bericht----- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens Peter Solymos Verzonden: dinsdag 7 oktober 2008 15:51 Aan: r-sig-ecology@r-project.org Onderwerp: Re: [R-sig-eco] Clustering large data Dear Thierry, the 'mefa' package should do this, and I am also interested in the testing of the package for such a large number of species. I have used it before with 75K records, but only with ~160 species and 1052 sites. So please let me know if it worked! You can do the clustering like this (SAMPLES and SPECIES are the two column in the long format, have to be the same length): x <- mefa(stcs(data.frame(SAMPLES,SPECIES))) cl <- hclust(dist(x$xtab)) Hope this works, Peter Peter Solymos, PhD Department of Mathematical and Statistical Sciences University of Alberta Edmonton, Alberta, T6G 2G1 CANADA On Tue, Oct 7, 2008 at 4:12 AM, ONKELINX, Thierry <[EMAIL PROTECTED]> wrote: > Dear all, > > We have a problem with a large dataset that we want to cluster. The > dataset is in a long format: 1154024 rows with presence data. Each row > has the name of the species and the location. We have 1381 species and > 6354 locations. > The main problem is that we need the data in wide format (one row for > each location, one column for each species) for the clustering > algorithms. But the 6354 x 1381 dataframe is too big to fit into the > memory. At least when we use cast from the reshape package to convert > the dataframe from a long to a wide format. > > Are there any clustering tools available that can work with the data in > a long format or with sparse matrices (only 13% of the matrix is > non-zero)? If the work with sparse matrices: how to convert our dataset > to a sparse matrix? Other suggestions are welcome. > > We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is > set to 2047M. > > Thanks, > > Thierry > > > ------------------------------------------------------------------------ > ---- > ir. Thierry Onkelinx > Instituut voor natuur- en bosonderzoek / Research Institute for Nature > and Forest > Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, > methodology and quality assurance > Gaverstraat 4 > 9500 Geraardsbergen > Belgium > tel. + 32 54/436 185 > [EMAIL PROTECTED] > www.inbo.be > > To call in the statistician after the experiment is done may be no more > than asking him to perform a post-mortem examination: he may be able to > say what the experiment died of. > ~ Sir Ronald Aylmer Fisher > > The plural of anecdote is not data. > ~ Roger Brinner > > The combination of some data and an aching desire for an answer does not > ensure that a reasonable answer can be extracted from a given body of > data. > ~ John Tukey > > Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer > en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is > door een geldig ondertekend document. The views expressed in this message > and any annex are purely those of the writer and may not be regarded as stating > an official position of INBO, as long as the message is not confirmed by a duly > signed document. > > _______________________________________________ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology