Hi,
I asked a question about the sparse random projection a few days ago, but
thought I should start a new topic regarding my current problem.
I am calculating TFIDF weights for my text documents and then calculate cosine
similarity between documents for determining the similarity between documents.
For dimensionality reduction I am using the Sparse Random Projection class.
My current process looks like the following:
docs = [text1, text2,…]
vec = TfidfVectorizer(max_df=0.8)
X = vec.fit_transform(docs)
proj = SparseRandomProjection()
X2 = proj.fit_transform(X)
X2 = normalize(X2) #for L2 normalization
sim = X2 * X2.T
It works reasonable well. However, I found out that the sparse random
projection sets many weights to a negative value. Hence, also many similarity
scores end up being negative. Given the original intention of tfidf weights
(which should never be negative) and corresponding cosine similarity scores
(which then should always only range between zero and one), I do not know
whether this is an appropriate approach for my task.
Hope someone has some advice. Maybe I am also doing something wrong here.
Best,
Philipp
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general