Thanks for the illustration of xtabs. A quibble: Doesn't the following work, substituting as.matrix() for matrix()? (Does seem to conserve the dimensions and dimension names.)
matrify<-function(datatable, formula = units~site+spp, relativize=F){ tbl<-xtabs(formula,data=datatable) mx <-as.matrix(tbl) if (relativize==T) {mx<-mx/rowSums(mx)} return(mx) } "Christian A. Parker" <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 10/07/2008 11:04 AM To "ONKELINX, Thierry" <[EMAIL PROTECTED]> cc r-sig-ecology@r-project.org Subject Re: [R-sig-eco] Clustering large data This method for converting long to wide format seems to work well with pretty large datasets and it uses only base functions. # this function will return a site*species matrix # based on the formula variable. Data does not need # to be grouped, the xtabs function will take care of # summing any rows that are equal according to the # formula. ### units are the cell value ### site is the row value ### spp is the column value matrify<-function(datatable, formula = units~site+spp, relativize=F){ tbl<-xtabs(formula,data=datatable) mx<-matrix(tbl,ncol=ncol(tbl)) colnames(mx)<-colnames(tbl) rownames(mx)<-rownames(tbl) if (relativize==T) {mx<-mx/rowSums(mx)} return(mx) } ONKELINX, Thierry wrote: > Dear all, > > We have a problem with a large dataset that we want to cluster. The > dataset is in a long format: 1154024 rows with presence data. Each row > has the name of the species and the location. We have 1381 species and > 6354 locations. > The main problem is that we need the data in wide format (one row for > each location, one column for each species) for the clustering > algorithms. But the 6354 x 1381 dataframe is too big to fit into the > memory. At least when we use cast from the reshape package to convert > the dataframe from a long to a wide format. > > Are there any clustering tools available that can work with the data in > a long format or with sparse matrices (only 13% of the matrix is > non-zero)? If the work with sparse matrices: how to convert our dataset > to a sparse matrix? Other suggestions are welcome. > > We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is > set to 2047M. > > Thanks, > > Thierry > > > ------------------------------------------------------------------------ > ---- > ir. Thierry Onkelinx > Instituut voor natuur- en bosonderzoek / Research Institute for Nature > and Forest > Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, > methodology and quality assurance > Gaverstraat 4 > 9500 Geraardsbergen > Belgium > tel. + 32 54/436 185 > [EMAIL PROTECTED] > www.inbo.be > > To call in the statistician after the experiment is done may be no more > than asking him to perform a post-mortem examination: he may be able to > say what the experiment died of. > ~ Sir Ronald Aylmer Fisher > > The plural of anecdote is not data. > ~ Roger Brinner > > The combination of some data and an aching desire for an answer does not > ensure that a reasonable answer can be extracted from a given body of > data. > ~ John Tukey > > Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer > en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is > door een geldig ondertekend document. The views expressed in this message > and any annex are purely those of the writer and may not be regarded as stating > an official position of INBO, as long as the message is not confirmed by a duly > signed document. > > _______________________________________________ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > > _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology