Thank you for your answer. I checked it with many documents. Both totaly 
different and similar documents. You can see an example of the text I used here 
https://dl.dropboxusercontent.com/u/37124455/documents.txt

Another script I wrote with only tf-idf shows me 69% similarity on those 
documents.

> From: [email protected]
> Date: Sun, 29 Sep 2013 11:58:05 +0200
> To: [email protected]
> Subject: Re: [Scikit-learn-general] LSA for documents similarity
> 
> 2013/9/29 Tasos Ventouris <[email protected]>:
> > I am trying to create a script to compute the similarity for only two
> > documents. I wrote this code but if I use two docs on the data set, the
> > results is a 2x2 matrix with [[1,0],[0,1]]. If I use more than 2 documents,
> > the results are almost correct. Any suggestion?
> 
> Have you inspected the vocabulary of the vectorizer? Do you have any
> reason to think the documents are similar in any way?
> 
> >  def lsa(doc1,doc2):
> >     dataset = [doc1,doc2]
> >     vectorizer = TfidfVectorizer(stop_words='english')
> >     X = vectorizer.fit_transform(dataset)
> >     lsa = TruncatedSVD(n_components=100)
> >     X = lsa.fit_transform(X)
> >     X = Normalizer(copy=False).fit_transform(X)
> >
> >     return cosine_similarity(X)
> 
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
                                          
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to