Re: tf-idf + svd + cosine similarity

Sean Owen Tue, 14 Jun 2011 11:55:57 -0700

It is a similarity, not a distance. Higher values mean more
similarity, not less.


I agree that similarity ought to decrease with more dimensions. That
is what you observe -- except that you see quite high average
similarity with no dimension reduction!

An average cosine similarity of 0.87 sounds "high" to me for anything
but a few dimensions. What's the dimensionality of the input without
dimension reduction?

Something is amiss in this pipeline. It is an interesting question!

On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert <[email protected]> wrote:
> Actually I'm using  RowSimilarityJob() with
> --input input
> --output output
> --numberOfColumns documentCount
> --maxSimilaritiesPerRow documentCount
> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>
> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
> calculates...
> the source says: "distributed implementation of cosine similarity that
> does not center its data"
>
> So... this seems to be the similarity and not the distance?
>
> Cheers,
> Stefan
>
>
>
> 2011/6/14 Stefan Wienert <[email protected]>:
>> but... why do I get the different results with cosine similarity with
>> no dimension reduction (with 100,000 dimensions) ?
>>
>> 2011/6/14 Fernando Fernández <[email protected]>:
>>> Actually that's what your results are showing, aren't they? With rank 1000
>>> the similarity avg is the lowest...
>>>
>>>
>>> 2011/6/14 Jake Mannix <[email protected]>
>>>
>>>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>>>> higher
>>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>>> hand,
>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>
>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]>
>>>> wrote:
>>>>
>>>> > Hey Guys,
>>>> >
>>>> > I have some strange results in my LSA-Pipeline.
>>>> >
>>>> > First, I explain the steps my data is making:
>>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>>>> > weighter
>>>> > 2) Transposing TDM
>>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>> > 3c) Using no dimension reduction (for testing purpose)
>>>> > 4) Transpose result (ONLY none / svd)
>>>> > 5) Calculating Cosine Similarty (from Mahout)
>>>> >
>>>> > Now... Some strange thinks happen:
>>>> > First of all: The demo data shows the similarity from document 1 to
>>>> > all other documents.
>>>> >
>>>> > the results using only cosine similarty (without dimension reduction):
>>>> > http://the-lord.de/img/none.png
>>>> >
>>>> > the result using svd, rank 10
>>>> > http://the-lord.de/img/svd-10.png
>>>> > some points falling down to the bottom.
>>>> >
>>>> > the results using ssvd rank 10
>>>> > http://the-lord.de/img/ssvd-10.png
>>>> >
>>>> > the result using svd, rank 100
>>>> > http://the-lord.de/img/svd-100.png
>>>> > more points falling down to the bottom.
>>>> >
>>>> > the results using ssvd rank 100
>>>> > http://the-lord.de/img/ssvd-100.png
>>>> >
>>>> > the results using svd rank 200
>>>> > http://the-lord.de/img/svd-200.png
>>>> > even more points falling down to the bottom.
>>>> >
>>>> > the results using svd rank 1000
>>>> > http://the-lord.de/img/svd-1000.png
>>>> > most points are at the bottom
>>>> >
>>>> > please beware of the scale:
>>>> > - the avg from none: 0,8712
>>>> > - the avg from svd rank 10: 0,2648
>>>> > - the avg from svd rank 100: 0,0628
>>>> > - the avg from svd rank 200: 0,0238
>>>> > - the avg from svd rank 1000: 0,0116
>>>> >
>>>> > so my question is:
>>>> > Can you explain this behavior? Why are the documents getting more
>>>> > equal with more ranks in svd. I thought it was the opposite.
>>>> >
>>>> > Cheers
>>>> > Stefan
>>>> >
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> [email protected]
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> [email protected]
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Reply via email to