actually, wait - are your graphs showing *similarity*, or *distance*? In higher dimensions, *distance* (and cosine angle) should grow, but on the other hand, *similarity* (1-cos(angle)) should go toward 0.
On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]> wrote: > Hey Guys, > > I have some strange results in my LSA-Pipeline. > > First, I explain the steps my data is making: > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as > weighter > 2) Transposing TDM > 3a) Using Mahout SVD (Lanczos) with the transposed TDM > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM > 3c) Using no dimension reduction (for testing purpose) > 4) Transpose result (ONLY none / svd) > 5) Calculating Cosine Similarty (from Mahout) > > Now... Some strange thinks happen: > First of all: The demo data shows the similarity from document 1 to > all other documents. > > the results using only cosine similarty (without dimension reduction): > http://the-lord.de/img/none.png > > the result using svd, rank 10 > http://the-lord.de/img/svd-10.png > some points falling down to the bottom. > > the results using ssvd rank 10 > http://the-lord.de/img/ssvd-10.png > > the result using svd, rank 100 > http://the-lord.de/img/svd-100.png > more points falling down to the bottom. > > the results using ssvd rank 100 > http://the-lord.de/img/ssvd-100.png > > the results using svd rank 200 > http://the-lord.de/img/svd-200.png > even more points falling down to the bottom. > > the results using svd rank 1000 > http://the-lord.de/img/svd-1000.png > most points are at the bottom > > please beware of the scale: > - the avg from none: 0,8712 > - the avg from svd rank 10: 0,2648 > - the avg from svd rank 100: 0,0628 > - the avg from svd rank 200: 0,0238 > - the avg from svd rank 1000: 0,0116 > > so my question is: > Can you explain this behavior? Why are the documents getting more > equal with more ranks in svd. I thought it was the opposite. > > Cheers > Stefan >
