Have you tried to increase the number of components or epsilon parameter and
density of the SparseRandomProjection?
Have you tried to normalise X prior the random projection?
Best regards,
Arnaud
On 08 Aug 2014, at 12:19, Philipp Singer <[email protected]> wrote:
> Just another remark regarding this:
>
> I guess I can not circumvent the negative cosine similarity values. Maybe LSA
> is a better approach? (TruncatedSVD)
>
> Am 08.08.2014 um 10:35 schrieb Philipp Singer <[email protected]>:
>
>> Hi,
>>
>> I asked a question about the sparse random projection a few days ago, but
>> thought I should start a new topic regarding my current problem.
>>
>> I am calculating TFIDF weights for my text documents and then calculate
>> cosine similarity between documents for determining the similarity between
>> documents. For dimensionality reduction I am using the Sparse Random
>> Projection class.
>>
>> My current process looks like the following:
>>
>> docs = [text1, text2,…]
>> vec = TfidfVectorizer(max_df=0.8)
>> X = vec.fit_transform(docs)
>> proj = SparseRandomProjection()
>> X2 = proj.fit_transform(X)
>> X2 = normalize(X2) #for L2 normalization
>> sim = X2 * X2.T
>>
>> It works reasonable well. However, I found out that the sparse random
>> projection sets many weights to a negative value. Hence, also many
>> similarity scores end up being negative. Given the original intention of
>> tfidf weights (which should never be negative) and corresponding cosine
>> similarity scores (which then should always only range between zero and
>> one), I do not know whether this is an appropriate approach for my task.
>>
>> Hope someone has some advice. Maybe I am also doing something wrong here.
>>
>> Best,
>> Philipp
>>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general