Looks like they apply whitening, which is not implemented in TruncatedSVD.
I guess we could add that option. It's equivalent to using a StandardScaler after the TruncatedSVD.
Can you try and see if that reproduces the results?


On 08/26/2016 10:09 AM, Roman Yurchak wrote:
Hi all,

I have a question about using the TruncatedSVD method for performing
Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
but I'm wondering about that.

As far as I understood for LSA one computes a truncated SVD
decomposition of the tf-idf matrix X (n_features x n_samples),
       X ≈ U @ Sigma @ V.T
and then for a document vector d, the projection is computed as,
       d_proj = d.T @ U @ Sigma⁻¹
(source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
However, TruncatedSVD.fit_transform only computes,
       d_proj = d.T @ U
and what's more does not store the singular values (Sigma) internally,
so it cannot be easily applied afterwards.
(the above notation are transposed with respect to those in the scikit
learn docs).

For instance, I have tried reproducing LSA decomposition from literature
and I'm not getting the expected results unless I perform an additional
normalization by the Sigma matrix:
https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d

I was wondering if I am missing something here?
Thank you,

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to