Re: tf-idf + svd + cosine similarity

Fernando Fernández Tue, 14 Jun 2011 10:52:57 -0700

Actually that's what your results are showing, aren't they? With rank 1000
the similarity avg is the lowest...



2011/6/14 Jake Mannix <[email protected]>

> actually, wait - are your graphs showing *similarity*, or *distance*?  In
> higher
> dimensions, *distance* (and cosine angle) should grow, but on the other
> hand,
> *similarity* (1-cos(angle)) should go toward 0.
>
> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]>
> wrote:
>
> > Hey Guys,
> >
> > I have some strange results in my LSA-Pipeline.
> >
> > First, I explain the steps my data is making:
> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> > weighter
> > 2) Transposing TDM
> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> > 3c) Using no dimension reduction (for testing purpose)
> > 4) Transpose result (ONLY none / svd)
> > 5) Calculating Cosine Similarty (from Mahout)
> >
> > Now... Some strange thinks happen:
> > First of all: The demo data shows the similarity from document 1 to
> > all other documents.
> >
> > the results using only cosine similarty (without dimension reduction):
> > http://the-lord.de/img/none.png
> >
> > the result using svd, rank 10
> > http://the-lord.de/img/svd-10.png
> > some points falling down to the bottom.
> >
> > the results using ssvd rank 10
> > http://the-lord.de/img/ssvd-10.png
> >
> > the result using svd, rank 100
> > http://the-lord.de/img/svd-100.png
> > more points falling down to the bottom.
> >
> > the results using ssvd rank 100
> > http://the-lord.de/img/ssvd-100.png
> >
> > the results using svd rank 200
> > http://the-lord.de/img/svd-200.png
> > even more points falling down to the bottom.
> >
> > the results using svd rank 1000
> > http://the-lord.de/img/svd-1000.png
> > most points are at the bottom
> >
> > please beware of the scale:
> > - the avg from none: 0,8712
> > - the avg from svd rank 10: 0,2648
> > - the avg from svd rank 100: 0,0628
> > - the avg from svd rank 200: 0,0238
> > - the avg from svd rank 1000: 0,0116
> >
> > so my question is:
> > Can you explain this behavior? Why are the documents getting more
> > equal with more ranks in svd. I thought it was the opposite.
> >
> > Cheers
> > Stefan
> >
>

Re: tf-idf + svd + cosine similarity

Reply via email to