Hey Guys, I have some strange results in my LSA-Pipeline.
First, I explain the steps my data is making: 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as weighter 2) Transposing TDM 3a) Using Mahout SVD (Lanczos) with the transposed TDM 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM 3c) Using no dimension reduction (for testing purpose) 4) Transpose result (ONLY none / svd) 5) Calculating Cosine Similarty (from Mahout) Now... Some strange thinks happen: First of all: The demo data shows the similarity from document 1 to all other documents. the results using only cosine similarty (without dimension reduction): http://the-lord.de/img/none.png the result using svd, rank 10 http://the-lord.de/img/svd-10.png some points falling down to the bottom. the results using ssvd rank 10 http://the-lord.de/img/ssvd-10.png the result using svd, rank 100 http://the-lord.de/img/svd-100.png more points falling down to the bottom. the results using ssvd rank 100 http://the-lord.de/img/ssvd-100.png the results using svd rank 200 http://the-lord.de/img/svd-200.png even more points falling down to the bottom. the results using svd rank 1000 http://the-lord.de/img/svd-1000.png most points are at the bottom please beware of the scale: - the avg from none: 0,8712 - the avg from svd rank 10: 0,2648 - the avg from svd rank 100: 0,0628 - the avg from svd rank 200: 0,0238 - the avg from svd rank 1000: 0,0116 so my question is: Can you explain this behavior? Why are the documents getting more equal with more ranks in svd. I thought it was the opposite. Cheers Stefan
