Anonymous rows in clusters after SSVD

Pat Ferrel Sun, 09 Sep 2012 11:09:45 -0700

Regarding SSVD + clustering

I tried the command line version of kmeans on U*Sigma and don't get row IDs in 
clusteredPoints there either. Using the command line kmeans on the input matrix 
A does generate row IDs. There must be some difference in the two that causes 
this to happen.


I used seq2sparse to create the NamedVectors and rowid to turn them into a DRM 
= A. Rowid creates a file "docIndex" which maps the row IDs of A (actually Keys 
in the vector TFIDF files) so does not put NamedVectors into A, relying on Keys 
to identify rows. Then kmeans on A creates row IDs in clusteredPoints.

Using the output of SSVD = U*Sigma as input to the same command line version of 
kmeans produces no row IDs in "clusterePoints". As I said earlier this makes it 
impossible to tie clustered vectors back to pre-SSVD input vectors.

This leads me to think there is some significant difference between A and 
U*Sigma, which is causing this. It looks like both A and U*Sigma are 
<IntWritable, VectorWritable>. So I need to dig deeper.

Anonymous rows in clusters after SSVD

Reply via email to