Actually I'm using RowSimilarityJob() with --input input --output output --numberOfColumns documentCount --maxSimilaritiesPerRow documentCount --similarityClassname SIMILARITY_UNCENTERED_COSINE
Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE calculates... the source says: "distributed implementation of cosine similarity that does not center its data" So... this seems to be the similarity and not the distance? Cheers, Stefan 2011/6/14 Stefan Wienert <ste...@wienert.cc>: > but... why do I get the different results with cosine similarity with > no dimension reduction (with 100,000 dimensions) ? > > 2011/6/14 Fernando Fernández <fernando.fernandez.gonza...@gmail.com>: >> Actually that's what your results are showing, aren't they? With rank 1000 >> the similarity avg is the lowest... >> >> >> 2011/6/14 Jake Mannix <jake.man...@gmail.com> >> >>> actually, wait - are your graphs showing *similarity*, or *distance*? In >>> higher >>> dimensions, *distance* (and cosine angle) should grow, but on the other >>> hand, >>> *similarity* (1-cos(angle)) should go toward 0. >>> >>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <ste...@wienert.cc> >>> wrote: >>> >>> > Hey Guys, >>> > >>> > I have some strange results in my LSA-Pipeline. >>> > >>> > First, I explain the steps my data is making: >>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as >>> > weighter >>> > 2) Transposing TDM >>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>> > 3c) Using no dimension reduction (for testing purpose) >>> > 4) Transpose result (ONLY none / svd) >>> > 5) Calculating Cosine Similarty (from Mahout) >>> > >>> > Now... Some strange thinks happen: >>> > First of all: The demo data shows the similarity from document 1 to >>> > all other documents. >>> > >>> > the results using only cosine similarty (without dimension reduction): >>> > http://the-lord.de/img/none.png >>> > >>> > the result using svd, rank 10 >>> > http://the-lord.de/img/svd-10.png >>> > some points falling down to the bottom. >>> > >>> > the results using ssvd rank 10 >>> > http://the-lord.de/img/ssvd-10.png >>> > >>> > the result using svd, rank 100 >>> > http://the-lord.de/img/svd-100.png >>> > more points falling down to the bottom. >>> > >>> > the results using ssvd rank 100 >>> > http://the-lord.de/img/ssvd-100.png >>> > >>> > the results using svd rank 200 >>> > http://the-lord.de/img/svd-200.png >>> > even more points falling down to the bottom. >>> > >>> > the results using svd rank 1000 >>> > http://the-lord.de/img/svd-1000.png >>> > most points are at the bottom >>> > >>> > please beware of the scale: >>> > - the avg from none: 0,8712 >>> > - the avg from svd rank 10: 0,2648 >>> > - the avg from svd rank 100: 0,0628 >>> > - the avg from svd rank 200: 0,0238 >>> > - the avg from svd rank 1000: 0,0116 >>> > >>> > so my question is: >>> > Can you explain this behavior? Why are the documents getting more >>> > equal with more ranks in svd. I thought it was the opposite. >>> > >>> > Cheers >>> > Stefan >>> > >>> >> > > > > -- > Stefan Wienert > > http://www.wienert.cc > ste...@wienert.cc > > Telefon: +495251-2026838 > Mobil: +49176-40170270 > -- Stefan Wienert http://www.wienert.cc ste...@wienert.cc Telefon: +495251-2026838 Mobil: +49176-40170270