Actually that's what your results are showing, aren't they? With rank 1000 the similarity avg is the lowest...
2011/6/14 Jake Mannix <[email protected]> > actually, wait - are your graphs showing *similarity*, or *distance*? In > higher > dimensions, *distance* (and cosine angle) should grow, but on the other > hand, > *similarity* (1-cos(angle)) should go toward 0. > > On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]> > wrote: > > > Hey Guys, > > > > I have some strange results in my LSA-Pipeline. > > > > First, I explain the steps my data is making: > > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as > > weighter > > 2) Transposing TDM > > 3a) Using Mahout SVD (Lanczos) with the transposed TDM > > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM > > 3c) Using no dimension reduction (for testing purpose) > > 4) Transpose result (ONLY none / svd) > > 5) Calculating Cosine Similarty (from Mahout) > > > > Now... Some strange thinks happen: > > First of all: The demo data shows the similarity from document 1 to > > all other documents. > > > > the results using only cosine similarty (without dimension reduction): > > http://the-lord.de/img/none.png > > > > the result using svd, rank 10 > > http://the-lord.de/img/svd-10.png > > some points falling down to the bottom. > > > > the results using ssvd rank 10 > > http://the-lord.de/img/ssvd-10.png > > > > the result using svd, rank 100 > > http://the-lord.de/img/svd-100.png > > more points falling down to the bottom. > > > > the results using ssvd rank 100 > > http://the-lord.de/img/ssvd-100.png > > > > the results using svd rank 200 > > http://the-lord.de/img/svd-200.png > > even more points falling down to the bottom. > > > > the results using svd rank 1000 > > http://the-lord.de/img/svd-1000.png > > most points are at the bottom > > > > please beware of the scale: > > - the avg from none: 0,8712 > > - the avg from svd rank 10: 0,2648 > > - the avg from svd rank 100: 0,0628 > > - the avg from svd rank 200: 0,0238 > > - the avg from svd rank 1000: 0,0116 > > > > so my question is: > > Can you explain this behavior? Why are the documents getting more > > equal with more ranks in svd. I thought it was the opposite. > > > > Cheers > > Stefan > > >
