tf-idf + svd + cosine similarity

Stefan Wienert Tue, 14 Jun 2011 10:16:21 -0700

Hey Guys,

I have some strange results in my LSA-Pipeline.


First, I explain the steps my data is making:
1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as weighter
2) Transposing TDM
3a) Using Mahout SVD (Lanczos) with the transposed TDM
3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
3c) Using no dimension reduction (for testing purpose)
4) Transpose result (ONLY none / svd)
5) Calculating Cosine Similarty (from Mahout)

Now... Some strange thinks happen:
First of all: The demo data shows the similarity from document 1 to
all other documents.

the results using only cosine similarty (without dimension reduction):
http://the-lord.de/img/none.png

the result using svd, rank 10
http://the-lord.de/img/svd-10.png
some points falling down to the bottom.

the results using ssvd rank 10
http://the-lord.de/img/ssvd-10.png

the result using svd, rank 100
http://the-lord.de/img/svd-100.png
more points falling down to the bottom.

the results using ssvd rank 100
http://the-lord.de/img/ssvd-100.png

the results using svd rank 200
http://the-lord.de/img/svd-200.png
even more points falling down to the bottom.

the results using svd rank 1000
http://the-lord.de/img/svd-1000.png
most points are at the bottom

please beware of the scale:
- the avg from none: 0,8712
- the avg from svd rank 10: 0,2648
- the avg from svd rank 100: 0,0628
- the avg from svd rank 200: 0,0238
- the avg from svd rank 1000: 0,0116

so my question is:
Can you explain this behavior? Why are the documents getting more
equal with more ranks in svd. I thought it was the opposite.

Cheers
Stefan

tf-idf + svd + cosine similarity

Reply via email to