2013/1/3 Jack Alan <[email protected]>:
> I'm working in document classification and I wonder if there is a way of
> having the feature vector calculated based on Latent Semantic Indexing (LSI)
> instead of tf or tf-idf. As you know with LSI or Latent Dirichlet Allocation
> (LDA), semantic features are captured.

LSI is trivial to implement, apart from some scipy.sparse trickery. If
you have a tf-idf matrix X as produced by

    from sklearn.feature_extraction.text import TfidfVectorizer
    vect = TfidfVectorizer()
    X = v.fit_transform(documents)

then LSI is just the singular value decomposition of that matrix
restricted to the k largest singular values. In Scipy, that's

    from scipy.sparse.linalg import svds

    U, sigma, V = svds(X, k=10)

U is now an LSI feature matrix. To perform the same transformation on
a test matrix Xtest, do

    from scipy.linalg import inv

    Xtest = vect.transform(testdocs)
    Utest = np.dot(Xtest * V.T, inv(np.diag(Sigma)))

(Note to self, package this in a transformer object and submit a pull request.)

> I found an online Python library to do so called gensim. The point is, how
> to merge gensim with sklearn to fullfill the requirement? or any
> alternatives?

I've been fantasizing about a scikit-learn/gensim bridge, but I never
got round to implementing it. If you hack one up, I'm interested.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to