Just another remark regarding this:

I guess I can not circumvent the negative cosine similarity values. Maybe LSA 
is a better approach? (TruncatedSVD)

Am 08.08.2014 um 10:35 schrieb Philipp Singer <[email protected]>:

> Hi,
> 
> I asked a question about the sparse random projection a few days ago, but 
> thought I should start a new topic regarding my current problem.
> 
> I am calculating TFIDF weights for my text documents and then calculate 
> cosine similarity between documents for determining the similarity between 
> documents. For dimensionality reduction I am using the Sparse Random 
> Projection class.
> 
> My current process looks like the following:
> 
> docs = [text1, text2,…]
> vec = TfidfVectorizer(max_df=0.8)
> X = vec.fit_transform(docs)
> proj = SparseRandomProjection()
> X2 = proj.fit_transform(X)
> X2 = normalize(X2) #for L2 normalization
> sim = X2 * X2.T
> 
> It works reasonable well. However, I found out that the sparse random 
> projection sets many weights to a negative value. Hence, also many similarity 
> scores end up being negative. Given the original intention of tfidf weights 
> (which should never be negative) and corresponding cosine similarity scores 
> (which then should always only range between zero and one), I do not know 
> whether this is an appropriate approach for my task.
> 
> Hope someone has some advice. Maybe I am also doing something wrong here.
> 
> Best,
> Philipp
> 

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to