Hi Lishu, I run into the similar large-scale problems recently. I used a parallel SGD k-means described in this paper for my problem:
http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf Let n be the samples, k be the number of clusters, and m be the number of nodes, 1. First, each node reads n / m sample data, and randomly generate enough 'mini batches' (size of mini-batch and SGD iterations must be determined beforehand) 2. Sample k / m centers from the samples on each node 3. Update the centers, by using the mini-batches generated at the first step. Note that at this stage it is not necessary to hold the sample data on each node. 4. Once the centers are optimized by SGD, compute the distance matrix between samples and centers. I used spherical k-means so this step can be divided into a series of block matrix multiplication to save memory. Note that each node only needs to hold partial sample data and partial centers, so this method can work on 'regular' MPI environment and do not need the shared memory architecture. I used pbdMPI to parallelize the algorithm. hope this helps. Wuming On Wed, Jan 18, 2012 at 3:37 PM, Lishu Liu <lishu...@gmail.com> wrote: > Hi, > > I have a 60k*600k matrix, which exceed the vector length limit of 2^32-1. > But it's rather sparse, only 0.02% has value. So I save is as MarketMatrix > (mm) file, it's about 300M in size. I use readMM in Matrix package to read > it in. If do so, the data type becomes dgTMatrix in 'Matrix' package > instead of the common matrix type. > > The problem is, if I run k-means only on part of the data, to make sure the > vector length do not exceed 2^32-1, there's no problem at all. Meaning that > the kmeans in R could recognize this type of matrix. > If I run the entire matrix, R says "too many elements specified." > > I have considered the 'bigmemory' and 'biganalytics' packages. But to save > the sparse matrix as common CSV file would take approx 70G and 99% being 0. > I just don't think it's necessary or efficient to treat it as a dense > matrix. > > It there anyway to deal with the vector length limit? Can I split the whole > matrix into small ones and then do k-means? > > > > Thanks, > Lishu > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.