Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see http://the-lord.de/img/beispielwerte.pdf for better results.
First... U or V are the singular values not the eigenvectors ;) Lanczos-SVD in mahout is computing the eigenvectors of M*M (it multiplies the input matrix with the transposed one) As a fact, I don't need U, just V, so I need to transpose M (because the eigenvectors of MM* = V). So... normalizing the eigenvectors: Is the cosine similarity not doing this? or ignoring the length of the vectors? http://en.wikipedia.org/wiki/Cosine_similarity my parameters for ssvd: --rank 100 --oversampling 10 --blockHeight 227 --computeU false --input --output the rest should be on default. acutally I do not really know what these oversampling parameter means... 2011/6/14 Dmitriy Lyubimov <dlie...@gmail.com>: > Interesting. > > (I have one confusion of mine RE: lanczos -- is it computing U > eigenvectors or V? The doc says "eigenvectors" but doesn't say left or > right. if it's V (right eigenvectors) this sequence should be fine). > > With ssvd i don't do transpose, i just do coputation of U which will > produce document singular vectors directly. > > Also, i am not sure that Lanczos actually normalizes the eigenvectors, > but SSVD does (or multiplies normalized version by square root of a > singlular value, whichever requested). So depending on which space > your rotate results in, cosine similarities may be different. I assume > you used normalized (true) eigenvectors from ssvd. > > Also would be interesting to know what oversampling parameter you (p) you > used. > > Thanks. > -d > > > On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <ste...@wienert.cc> wrote: >> So... lets check the dimensions: >> >> First step: Lucene Output: >> 227 rows (=docs) and 107909 cols (=tems) >> >> transposed to: >> 107909 rows and 227 cols >> >> reduced with svd (rank 100) to: >> 99 rows and 227 cols >> >> transposed to: (actually there was a bug (with no effect on the SVD >> result but on NONE result)) >> 227 rows and 99 cols >> >> So... now the cosine results are very similar to SVD 200. >> >> Results are added. >> >> @Sebastian: I will check if the bug affects my results. >> >> 2011/6/14 Fernando Fernández <fernando.fernandez.gonza...@gmail.com>: >>> Hi Stefan, >>> >>> Are you sure you need to transpose the input marix? I thought that what you >>> get from lucene index was already document(rows)-term(columns) matrix, but >>> you say that you obtain term-document matrix and transpose it. Is this >>> correct? What are you using to obtain this matrix from Lucene? Is it >>> possible that you are calculating similarities with the wrong matrix in some >>> of the two cases? (With/without dimension reduction). >>> >>> Best, >>> Fernando. >>> >>> 2011/6/14 Sebastian Schelter <s...@apache.org> >>> >>>> Hi Stefan, >>>> >>>> I checked the implementation of RowSimilarityJob and we might still have a >>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by >>>> that, but the similarity scores might not be correct... >>>> >>>> We had this issue in 0.4 already, when someone realized that cooccurrences >>>> were mapped out inconsistently, so for 0.5 we made sure that we always map >>>> the smaller row as first value. But apparently I did not adjust the value >>>> setting for the Cooccurrence object... >>>> >>>> In 0.5 the code is: >>>> >>>> if (rowA <= rowB) { >>>> rowPair.set(rowA, rowB, weightA, weightB); >>>> } else { >>>> rowPair.set(rowB, rowA, weightB, weightA); >>>> } >>>> coocurrence.set(column.get(), valueA, valueB); >>>> >>>> But I should be (already fixed in current trunk some days ago): >>>> >>>> if (rowA <= rowB) { >>>> rowPair.set(rowA, rowB, weightA, weightB); >>>> coocurrence.set(column.get(), valueA, valueB); >>>> } else { >>>> rowPair.set(rowB, rowA, weightB, weightA); >>>> coocurrence.set(column.get(), valueB, valueA); >>>> } >>>> >>>> Maybe you could rerun your test with the current trunk? >>>> >>>> --sebastian >>>> >>>> >>>> On 14.06.2011 20:54, Sean Owen wrote: >>>> >>>>> It is a similarity, not a distance. Higher values mean more >>>>> similarity, not less. >>>>> >>>>> I agree that similarity ought to decrease with more dimensions. That >>>>> is what you observe -- except that you see quite high average >>>>> similarity with no dimension reduction! >>>>> >>>>> An average cosine similarity of 0.87 sounds "high" to me for anything >>>>> but a few dimensions. What's the dimensionality of the input without >>>>> dimension reduction? >>>>> >>>>> Something is amiss in this pipeline. It is an interesting question! >>>>> >>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<ste...@wienert.cc> >>>>> wrote: >>>>> >>>>>> Actually I'm using RowSimilarityJob() with >>>>>> --input input >>>>>> --output output >>>>>> --numberOfColumns documentCount >>>>>> --maxSimilaritiesPerRow documentCount >>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE >>>>>> >>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE >>>>>> calculates... >>>>>> the source says: "distributed implementation of cosine similarity that >>>>>> does not center its data" >>>>>> >>>>>> So... this seems to be the similarity and not the distance? >>>>>> >>>>>> Cheers, >>>>>> Stefan >>>>>> >>>>>> >>>>>> >>>>>> 2011/6/14 Stefan Wienert<ste...@wienert.cc>: >>>>>> >>>>>>> but... why do I get the different results with cosine similarity with >>>>>>> no dimension reduction (with 100,000 dimensions) ? >>>>>>> >>>>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonza...@gmail.com>: >>>>>>> >>>>>>>> Actually that's what your results are showing, aren't they? With rank >>>>>>>> 1000 >>>>>>>> the similarity avg is the lowest... >>>>>>>> >>>>>>>> >>>>>>>> 2011/6/14 Jake Mannix<jake.man...@gmail.com> >>>>>>>> >>>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? >>>>>>>>> In >>>>>>>>> higher >>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the >>>>>>>>> other >>>>>>>>> hand, >>>>>>>>> *similarity* (1-cos(angle)) should go toward 0. >>>>>>>>> >>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<ste...@wienert.cc> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hey Guys, >>>>>>>>>> >>>>>>>>>> I have some strange results in my LSA-Pipeline. >>>>>>>>>> >>>>>>>>>> First, I explain the steps my data is making: >>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF >>>>>>>>>> as >>>>>>>>>> weighter >>>>>>>>>> 2) Transposing TDM >>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>>>>>>>>> 3c) Using no dimension reduction (for testing purpose) >>>>>>>>>> 4) Transpose result (ONLY none / svd) >>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout) >>>>>>>>>> >>>>>>>>>> Now... Some strange thinks happen: >>>>>>>>>> First of all: The demo data shows the similarity from document 1 to >>>>>>>>>> all other documents. >>>>>>>>>> >>>>>>>>>> the results using only cosine similarty (without dimension >>>>>>>>>> reduction): >>>>>>>>>> http://the-lord.de/img/none.png >>>>>>>>>> >>>>>>>>>> the result using svd, rank 10 >>>>>>>>>> http://the-lord.de/img/svd-10.png >>>>>>>>>> some points falling down to the bottom. >>>>>>>>>> >>>>>>>>>> the results using ssvd rank 10 >>>>>>>>>> http://the-lord.de/img/ssvd-10.png >>>>>>>>>> >>>>>>>>>> the result using svd, rank 100 >>>>>>>>>> http://the-lord.de/img/svd-100.png >>>>>>>>>> more points falling down to the bottom. >>>>>>>>>> >>>>>>>>>> the results using ssvd rank 100 >>>>>>>>>> http://the-lord.de/img/ssvd-100.png >>>>>>>>>> >>>>>>>>>> the results using svd rank 200 >>>>>>>>>> http://the-lord.de/img/svd-200.png >>>>>>>>>> even more points falling down to the bottom. >>>>>>>>>> >>>>>>>>>> the results using svd rank 1000 >>>>>>>>>> http://the-lord.de/img/svd-1000.png >>>>>>>>>> most points are at the bottom >>>>>>>>>> >>>>>>>>>> please beware of the scale: >>>>>>>>>> - the avg from none: 0,8712 >>>>>>>>>> - the avg from svd rank 10: 0,2648 >>>>>>>>>> - the avg from svd rank 100: 0,0628 >>>>>>>>>> - the avg from svd rank 200: 0,0238 >>>>>>>>>> - the avg from svd rank 1000: 0,0116 >>>>>>>>>> >>>>>>>>>> so my question is: >>>>>>>>>> Can you explain this behavior? Why are the documents getting more >>>>>>>>>> equal with more ranks in svd. I thought it was the opposite. >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Stefan Wienert >>>>>>> >>>>>>> http://www.wienert.cc >>>>>>> ste...@wienert.cc >>>>>>> >>>>>>> Telefon: +495251-2026838 >>>>>>> Mobil: +49176-40170270 >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Stefan Wienert >>>>>> >>>>>> http://www.wienert.cc >>>>>> ste...@wienert.cc >>>>>> >>>>>> Telefon: +495251-2026838 >>>>>> Mobil: +49176-40170270 >>>>>> >>>>>> >>>> >>> >> >> >> >> -- >> Stefan Wienert >> >> http://www.wienert.cc >> ste...@wienert.cc >> >> Telefon: +495251-2026838 >> Mobil: +49176-40170270 >> > -- Stefan Wienert http://www.wienert.cc ste...@wienert.cc Telefon: +495251-2026838 Mobil: +49176-40170270