Regarding SSVD + clustering I tried the command line version of kmeans on U*Sigma and don't get row IDs in clusteredPoints there either. Using the command line kmeans on the input matrix A does generate row IDs. There must be some difference in the two that causes this to happen.
I used seq2sparse to create the NamedVectors and rowid to turn them into a DRM = A. Rowid creates a file "docIndex" which maps the row IDs of A (actually Keys in the vector TFIDF files) so does not put NamedVectors into A, relying on Keys to identify rows. Then kmeans on A creates row IDs in clusteredPoints. Using the output of SSVD = U*Sigma as input to the same command line version of kmeans produces no row IDs in "clusterePoints". As I said earlier this makes it impossible to tie clustered vectors back to pre-SSVD input vectors. This leads me to think there is some significant difference between A and U*Sigma, which is causing this. It looks like both A and U*Sigma are <IntWritable, VectorWritable>. So I need to dig deeper.