Re: tf-idf + svd + cosine similarity

Stefan Wienert Tue, 14 Jun 2011 11:45:40 -0700

Actually I'm using  RowSimilarityJob() with
--input input
--output output
--numberOfColumns documentCount
--maxSimilaritiesPerRow documentCount
--similarityClassname SIMILARITY_UNCENTERED_COSINE


Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
calculates...
the source says: "distributed implementation of cosine similarity that
does not center its data"

So... this seems to be the similarity and not the distance?

Cheers,
Stefan



2011/6/14 Stefan Wienert <ste...@wienert.cc>:
> but... why do I get the different results with cosine similarity with
> no dimension reduction (with 100,000 dimensions) ?
>
> 2011/6/14 Fernando Fernández <fernando.fernandez.gonza...@gmail.com>:
>> Actually that's what your results are showing, aren't they? With rank 1000
>> the similarity avg is the lowest...
>>
>>
>> 2011/6/14 Jake Mannix <jake.man...@gmail.com>
>>
>>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>>> higher
>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>> hand,
>>> *similarity* (1-cos(angle)) should go toward 0.
>>>
>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <ste...@wienert.cc>
>>> wrote:
>>>
>>> > Hey Guys,
>>> >
>>> > I have some strange results in my LSA-Pipeline.
>>> >
>>> > First, I explain the steps my data is making:
>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>>> > weighter
>>> > 2) Transposing TDM
>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>> > 3c) Using no dimension reduction (for testing purpose)
>>> > 4) Transpose result (ONLY none / svd)
>>> > 5) Calculating Cosine Similarty (from Mahout)
>>> >
>>> > Now... Some strange thinks happen:
>>> > First of all: The demo data shows the similarity from document 1 to
>>> > all other documents.
>>> >
>>> > the results using only cosine similarty (without dimension reduction):
>>> > http://the-lord.de/img/none.png
>>> >
>>> > the result using svd, rank 10
>>> > http://the-lord.de/img/svd-10.png
>>> > some points falling down to the bottom.
>>> >
>>> > the results using ssvd rank 10
>>> > http://the-lord.de/img/ssvd-10.png
>>> >
>>> > the result using svd, rank 100
>>> > http://the-lord.de/img/svd-100.png
>>> > more points falling down to the bottom.
>>> >
>>> > the results using ssvd rank 100
>>> > http://the-lord.de/img/ssvd-100.png
>>> >
>>> > the results using svd rank 200
>>> > http://the-lord.de/img/svd-200.png
>>> > even more points falling down to the bottom.
>>> >
>>> > the results using svd rank 1000
>>> > http://the-lord.de/img/svd-1000.png
>>> > most points are at the bottom
>>> >
>>> > please beware of the scale:
>>> > - the avg from none: 0,8712
>>> > - the avg from svd rank 10: 0,2648
>>> > - the avg from svd rank 100: 0,0628
>>> > - the avg from svd rank 200: 0,0238
>>> > - the avg from svd rank 1000: 0,0116
>>> >
>>> > so my question is:
>>> > Can you explain this behavior? Why are the documents getting more
>>> > equal with more ranks in svd. I thought it was the opposite.
>>> >
>>> > Cheers
>>> > Stefan
>>> >
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> ste...@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>



-- 
Stefan Wienert

http://www.wienert.cc
ste...@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Reply via email to