[Scikit-learn-general] Sparse Random Projection negative weights

Philipp Singer Fri, 08 Aug 2014 01:42:13 -0700

Hi,

I asked a question about the sparse random projection a few days ago, but 
thought I should start a new topic regarding my current problem.


I am calculating TFIDF weights for my text documents and then calculate cosine 
similarity between documents for determining the similarity between documents. 
For dimensionality reduction I am using the Sparse Random Projection class.

My current process looks like the following:

docs = [text1, text2,…]
vec = TfidfVectorizer(max_df=0.8)
X = vec.fit_transform(docs)
proj = SparseRandomProjection()
X2 = proj.fit_transform(X)
X2 = normalize(X2) #for L2 normalization
sim = X2 * X2.T

It works reasonable well. However, I found out that the sparse random 
projection sets many weights to a negative value. Hence, also many similarity 
scores end up being negative. Given the original intention of tfidf weights 
(which should never be negative) and corresponding cosine similarity scores 
(which then should always only range between zero and one), I do not know 
whether this is an appropriate approach for my task.

Hope someone has some advice. Maybe I am also doing something wrong here.

Best,
Philipp

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Sparse Random Projection negative weights

Reply via email to