Maybe my mind is not in its right place but how is that different from
using the PCA transformer?


On Thu, Jan 3, 2013 at 10:48 PM, Lars Buitinck <[email protected]> wrote:

> 2013/1/3 Jack Alan <[email protected]>:
> > I'm working in document classification and I wonder if there is a way of
> > having the feature vector calculated based on Latent Semantic Indexing
> (LSI)
> > instead of tf or tf-idf. As you know with LSI or Latent Dirichlet
> Allocation
> > (LDA), semantic features are captured.
>
> LSI is trivial to implement, apart from some scipy.sparse trickery. If
> you have a tf-idf matrix X as produced by
>
>     from sklearn.feature_extraction.text import TfidfVectorizer
>     vect = TfidfVectorizer()
>     X = v.fit_transform(documents)
>
> then LSI is just the singular value decomposition of that matrix
> restricted to the k largest singular values. In Scipy, that's
>
>     from scipy.sparse.linalg import svds
>
>     U, sigma, V = svds(X, k=10)
>
> U is now an LSI feature matrix. To perform the same transformation on
> a test matrix Xtest, do
>
>     from scipy.linalg import inv
>
>     Xtest = vect.transform(testdocs)
>     Utest = np.dot(Xtest * V.T, inv(np.diag(Sigma)))
>
> (Note to self, package this in a transformer object and submit a pull
> request.)
>
> > I found an online Python library to do so called gensim. The point is,
> how
> > to merge gensim with sklearn to fullfill the requirement? or any
> > alternatives?
>
> I've been fantasizing about a scikit-learn/gensim bridge, but I never
> got round to implementing it. If you hack one up, I'm interested.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122712
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to