
> Dear all,
> We have a problem with a large dataset that we want to cluster. The
> dataset is in a long format: 1154024 rows with presence data. Each row
> has the name of the species and the location. We have 1381 species and
> 6354 locations.
> The main problem is that we need the data in wide format (one row for
> each location, one column for each species) for the clustering
> algorithms. But the 6354 x 1381 dataframe is too big to fit into the
> memory. At least when we use cast from the reshape package to convert
> the dataframe from a long to a wide format.
> Are there any clustering tools available that can work with the data in
> a long format or with sparse matrices (only 13% of the matrix is
> non-zero)? If the work with sparse matrices: how to convert our dataset
> to a sparse matrix? Other suggestions are welcome.

6354 x 1381 should be well within your memory limit, so I assume it's
the intermediate steps that are fouling you up. Maybe you can do it in

1. subset the original two-column matrix to include only the first 100 sites
2. convert this subset to wide form
3. repeat 63 times for different subsets
4. rbind the resulting matrices

