It is a similarity, not a distance. Higher values mean more similarity, not less.
I agree that similarity ought to decrease with more dimensions. That is what you observe -- except that you see quite high average similarity with no dimension reduction! An average cosine similarity of 0.87 sounds "high" to me for anything but a few dimensions. What's the dimensionality of the input without dimension reduction? Something is amiss in this pipeline. It is an interesting question! On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert <[email protected]> wrote: > Actually I'm using RowSimilarityJob() with > --input input > --output output > --numberOfColumns documentCount > --maxSimilaritiesPerRow documentCount > --similarityClassname SIMILARITY_UNCENTERED_COSINE > > Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE > calculates... > the source says: "distributed implementation of cosine similarity that > does not center its data" > > So... this seems to be the similarity and not the distance? > > Cheers, > Stefan > > > > 2011/6/14 Stefan Wienert <[email protected]>: >> but... why do I get the different results with cosine similarity with >> no dimension reduction (with 100,000 dimensions) ? >> >> 2011/6/14 Fernando Fernández <[email protected]>: >>> Actually that's what your results are showing, aren't they? With rank 1000 >>> the similarity avg is the lowest... >>> >>> >>> 2011/6/14 Jake Mannix <[email protected]> >>> >>>> actually, wait - are your graphs showing *similarity*, or *distance*? In >>>> higher >>>> dimensions, *distance* (and cosine angle) should grow, but on the other >>>> hand, >>>> *similarity* (1-cos(angle)) should go toward 0. >>>> >>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]> >>>> wrote: >>>> >>>> > Hey Guys, >>>> > >>>> > I have some strange results in my LSA-Pipeline. >>>> > >>>> > First, I explain the steps my data is making: >>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as >>>> > weighter >>>> > 2) Transposing TDM >>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>>> > 3c) Using no dimension reduction (for testing purpose) >>>> > 4) Transpose result (ONLY none / svd) >>>> > 5) Calculating Cosine Similarty (from Mahout) >>>> > >>>> > Now... Some strange thinks happen: >>>> > First of all: The demo data shows the similarity from document 1 to >>>> > all other documents. >>>> > >>>> > the results using only cosine similarty (without dimension reduction): >>>> > http://the-lord.de/img/none.png >>>> > >>>> > the result using svd, rank 10 >>>> > http://the-lord.de/img/svd-10.png >>>> > some points falling down to the bottom. >>>> > >>>> > the results using ssvd rank 10 >>>> > http://the-lord.de/img/ssvd-10.png >>>> > >>>> > the result using svd, rank 100 >>>> > http://the-lord.de/img/svd-100.png >>>> > more points falling down to the bottom. >>>> > >>>> > the results using ssvd rank 100 >>>> > http://the-lord.de/img/ssvd-100.png >>>> > >>>> > the results using svd rank 200 >>>> > http://the-lord.de/img/svd-200.png >>>> > even more points falling down to the bottom. >>>> > >>>> > the results using svd rank 1000 >>>> > http://the-lord.de/img/svd-1000.png >>>> > most points are at the bottom >>>> > >>>> > please beware of the scale: >>>> > - the avg from none: 0,8712 >>>> > - the avg from svd rank 10: 0,2648 >>>> > - the avg from svd rank 100: 0,0628 >>>> > - the avg from svd rank 200: 0,0238 >>>> > - the avg from svd rank 1000: 0,0116 >>>> > >>>> > so my question is: >>>> > Can you explain this behavior? Why are the documents getting more >>>> > equal with more ranks in svd. I thought it was the opposite. >>>> > >>>> > Cheers >>>> > Stefan >>>> > >>>> >>> >> >> >> >> -- >> Stefan Wienert >> >> http://www.wienert.cc >> [email protected] >> >> Telefon: +495251-2026838 >> Mobil: +49176-40170270 >> > > > > -- > Stefan Wienert > > http://www.wienert.cc > [email protected] > > Telefon: +495251-2026838 > Mobil: +49176-40170270 >
