Hi ,
Sorry for my late response. 
Thanks Dmitry and Ted for your suggestions about smaller value of k and 
statistical noise.
I have some knowledge about the problem I am dealing with and that’s why I 
expected that.
It is like this: there are some inherent groups (clusters) in my dataset and 
items in one group
are essentially not at all similar to items from other groups. That’s the 
reason of the sparsity.
For instance, if items 10,11,12,....,30 are in one group, for these rows in the 
affinity matrix only
non-zero columns will be 10,....30 and rest all zero. 
But I expected the clustering algorithms to detect these groups and in addition 
find some sub-groups
which is my main requirement. That is maybe 10,11,...20 and 21,22,...30 are two 
clusters from that group.
The clustering process could be executed for separate groups sequentially but I 
wished to exploit the
parallelism of HADOOP and perform the clustering process faster.
With that said, can u suggest me something for this problem. 

Dan, I have tested the spectral-kmeans code comparing with my matlab script and 
it is running ok now !
Although, I did not test the last kmeans clustering step as it is 
initialization dependent and I assumed that part 
to be fine. 
After my investigations I found two issues with the code:
1. I still could not find any reason why L = D^0.5 * W * D^0.5 ? I think it 
should be changed to D^-0.5 (which I did in the code
I am using). 
2. Lanczos Solver: Other than the fact this is time consuming for large 
matrices, I think there is another important issue.
For spectral clustering we need the largest "k" eigenvectors of the Laplacian. 
But the Lanczos solver gives the eigenvectors
in ascending order of eigenvalues. Jake Mannix 
(http://lucene.472066.n3.nabble.com/Lanczos-Algorithm-td1929299.html)
suggested to go for 4k-5k eigenvectors  when we need "k" ones. However, if we 
really need the largest "k" we have to get
all eigenvectors of the Laplacian. 
I will try the SSVD and let you know.

Thanks,
Aniruddha

 

-----Original Message-----
From: Dan Brickley [mailto:dan...@danbri.org] 
Sent: Thursday, July 19, 2012 11:12 PM
To: Aniruddha Basak
Subject: Re: eigendecomposition of very large matrices

Hi there,

I started replying below, and then eventually (hey, it's early morning for me 
:) ... then realised that you were the original poster here. I was going to say 
that we had been trying to get spectral clustering working again.

What was the conclusion of your investigations? Did you get the original 
spectralkmeans code running ok, even if it does not suit your problem?

Porting / adapting to use SSVD (and maybe Ted's new kmeans stuff) sounds 
useful...

Dan


On 20 July 2012 02:32, Aniruddha Basak <t-aba...@expedia.com> wrote:
> Thanks Dmitriy for your reply.
> The matrix I am working on, has 10-20 non zero entries per row. So its very 
> sparse.
> I am trying to do spectral clustering which involves eigen-decomposition.
> I am wondering whether anyone has tried to do spectral clustering 
> using mahout for very large affinity matrix (input).

Mahout does ship with a spectral clustering component, built on Mahout's 
SVD/Lancszos facility. It has suffered various forms of code-rot. Shannon, 
Aniruddha, and I have been trying to get it running reliably again. See 
'bin/mahout spectralkmeans', 
https://cwiki.apache.org/MAHOUT/spectral-clustering.html and nearby

> Aniruddha
>
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
> Sent: Thursday, July 19, 2012 6:28 PM
> To: user@mahout.apache.org
> Subject: Re: eigendecomposition of very large matrices
>
> very significant sparsity may be a problem though for -q >=1 parameters. 
> Again, depends on the hardware you have and the # of non-zero elements in the 
> input. but -q=1 is still the most recommended setting here.
>
>
> On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>> you may try SSVD.
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singula
>> r
>> +Value+Decomposition
>>
>> but 4k eigenvectors (or, rather, singular values) is kind of still a 
>> lot though and may push the precision out of the error estimates. I 
>> don't we had precision study for that many. Also need quite a bit of 
>> memory to compute that (not to mention flops). More realistically you 
>> probably may try 1k singular values . You may try more if you have 
>> access to more powerful hardware than we did in the studies but 
>> distributed computation time will grow at about k^1.5, i.e. faster 
>> than linear, even if you have enough nodes for the tasks.
>>
>> -d
>>
>> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <t-aba...@expedia.com> 
>> wrote:
>>> Hi,
>>> I am working on a clustering problem which involves determining the 
>>> largest "k" eigenvectors of a very large matrix. The matrices, I 
>>> work on, are typically of the order of 10^6 by 10^6.
>>> Trying to do this using the Lanczos solver available in Mahout, I 
>>> found it is very slow and takes around 1.5 minutes to compute each 
>>> eigenvectors.
>>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !!
>>>
>>> So I am looking for something faster to solve the "Eigen decomposition"
>>> problem for very large sparse matrix. Please suggest me what should I use ?
>>>
>>>
>>> Thanks,
>>> Aniruddha
>>>

Reply via email to