Hi , Sorry for my late response. Thanks Dmitry and Ted for your suggestions about smaller value of k and statistical noise. I have some knowledge about the problem I am dealing with and that’s why I expected that. It is like this: there are some inherent groups (clusters) in my dataset and items in one group are essentially not at all similar to items from other groups. That’s the reason of the sparsity. For instance, if items 10,11,12,....,30 are in one group, for these rows in the affinity matrix only non-zero columns will be 10,....30 and rest all zero. But I expected the clustering algorithms to detect these groups and in addition find some sub-groups which is my main requirement. That is maybe 10,11,...20 and 21,22,...30 are two clusters from that group. The clustering process could be executed for separate groups sequentially but I wished to exploit the parallelism of HADOOP and perform the clustering process faster. With that said, can u suggest me something for this problem.
Dan, I have tested the spectral-kmeans code comparing with my matlab script and it is running ok now ! Although, I did not test the last kmeans clustering step as it is initialization dependent and I assumed that part to be fine. After my investigations I found two issues with the code: 1. I still could not find any reason why L = D^0.5 * W * D^0.5 ? I think it should be changed to D^-0.5 (which I did in the code I am using). 2. Lanczos Solver: Other than the fact this is time consuming for large matrices, I think there is another important issue. For spectral clustering we need the largest "k" eigenvectors of the Laplacian. But the Lanczos solver gives the eigenvectors in ascending order of eigenvalues. Jake Mannix (http://lucene.472066.n3.nabble.com/Lanczos-Algorithm-td1929299.html) suggested to go for 4k-5k eigenvectors when we need "k" ones. However, if we really need the largest "k" we have to get all eigenvectors of the Laplacian. I will try the SSVD and let you know. Thanks, Aniruddha -----Original Message----- From: Dan Brickley [mailto:dan...@danbri.org] Sent: Thursday, July 19, 2012 11:12 PM To: Aniruddha Basak Subject: Re: eigendecomposition of very large matrices Hi there, I started replying below, and then eventually (hey, it's early morning for me :) ... then realised that you were the original poster here. I was going to say that we had been trying to get spectral clustering working again. What was the conclusion of your investigations? Did you get the original spectralkmeans code running ok, even if it does not suit your problem? Porting / adapting to use SSVD (and maybe Ted's new kmeans stuff) sounds useful... Dan On 20 July 2012 02:32, Aniruddha Basak <t-aba...@expedia.com> wrote: > Thanks Dmitriy for your reply. > The matrix I am working on, has 10-20 non zero entries per row. So its very > sparse. > I am trying to do spectral clustering which involves eigen-decomposition. > I am wondering whether anyone has tried to do spectral clustering > using mahout for very large affinity matrix (input). Mahout does ship with a spectral clustering component, built on Mahout's SVD/Lancszos facility. It has suffered various forms of code-rot. Shannon, Aniruddha, and I have been trying to get it running reliably again. See 'bin/mahout spectralkmeans', https://cwiki.apache.org/MAHOUT/spectral-clustering.html and nearby > Aniruddha > > > -----Original Message----- > From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] > Sent: Thursday, July 19, 2012 6:28 PM > To: user@mahout.apache.org > Subject: Re: eigendecomposition of very large matrices > > very significant sparsity may be a problem though for -q >=1 parameters. > Again, depends on the hardware you have and the # of non-zero elements in the > input. but -q=1 is still the most recommended setting here. > > > On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> you may try SSVD. >> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singula >> r >> +Value+Decomposition >> >> but 4k eigenvectors (or, rather, singular values) is kind of still a >> lot though and may push the precision out of the error estimates. I >> don't we had precision study for that many. Also need quite a bit of >> memory to compute that (not to mention flops). More realistically you >> probably may try 1k singular values . You may try more if you have >> access to more powerful hardware than we did in the studies but >> distributed computation time will grow at about k^1.5, i.e. faster >> than linear, even if you have enough nodes for the tasks. >> >> -d >> >> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <t-aba...@expedia.com> >> wrote: >>> Hi, >>> I am working on a clustering problem which involves determining the >>> largest "k" eigenvectors of a very large matrix. The matrices, I >>> work on, are typically of the order of 10^6 by 10^6. >>> Trying to do this using the Lanczos solver available in Mahout, I >>> found it is very slow and takes around 1.5 minutes to compute each >>> eigenvectors. >>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !! >>> >>> So I am looking for something faster to solve the "Eigen decomposition" >>> problem for very large sparse matrix. Please suggest me what should I use ? >>> >>> >>> Thanks, >>> Aniruddha >>>