that actually looks more like it. Not so many documents similar to a randomly picked one.
On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[email protected]> wrote: > Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, > see > http://the-lord.de/img/beispielwerte.pdf > for better results. > > First... U or V are the singular values not the eigenvectors ;) > > Lanczos-SVD in mahout is computing the eigenvectors of M*M (it > multiplies the input matrix with the transposed one) > > As a fact, I don't need U, just V, so I need to transpose M (because > the eigenvectors of MM* = V). > > So... normalizing the eigenvectors: Is the cosine similarity not doing > this? or ignoring the length of the vectors? > http://en.wikipedia.org/wiki/Cosine_similarity > > my parameters for ssvd: > --rank 100 > --oversampling 10 > --blockHeight 227 > --computeU false > --input > --output > > the rest should be on default. > > acutally I do not really know what these oversampling parameter means... > > 2011/6/14 Dmitriy Lyubimov <[email protected]>: >> Interesting. >> >> (I have one confusion of mine RE: lanczos -- is it computing U >> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or >> right. if it's V (right eigenvectors) this sequence should be fine). >> >> With ssvd i don't do transpose, i just do coputation of U which will >> produce document singular vectors directly. >> >> Also, i am not sure that Lanczos actually normalizes the eigenvectors, >> but SSVD does (or multiplies normalized version by square root of a >> singlular value, whichever requested). So depending on which space >> your rotate results in, cosine similarities may be different. I assume >> you used normalized (true) eigenvectors from ssvd. >> >> Also would be interesting to know what oversampling parameter you (p) you >> used. >> >> Thanks. >> -d >> >> >> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[email protected]> wrote: >>> So... lets check the dimensions: >>> >>> First step: Lucene Output: >>> 227 rows (=docs) and 107909 cols (=tems) >>> >>> transposed to: >>> 107909 rows and 227 cols >>> >>> reduced with svd (rank 100) to: >>> 99 rows and 227 cols >>> >>> transposed to: (actually there was a bug (with no effect on the SVD >>> result but on NONE result)) >>> 227 rows and 99 cols >>> >>> So... now the cosine results are very similar to SVD 200. >>> >>> Results are added. >>> >>> @Sebastian: I will check if the bug affects my results. >>> >>> 2011/6/14 Fernando Fernández <[email protected]>: >>>> Hi Stefan, >>>> >>>> Are you sure you need to transpose the input marix? I thought that what >>>> you >>>> get from lucene index was already document(rows)-term(columns) matrix, but >>>> you say that you obtain term-document matrix and transpose it. Is this >>>> correct? What are you using to obtain this matrix from Lucene? Is it >>>> possible that you are calculating similarities with the wrong matrix in >>>> some >>>> of the two cases? (With/without dimension reduction). >>>> >>>> Best, >>>> Fernando. >>>> >>>> 2011/6/14 Sebastian Schelter <[email protected]> >>>> >>>>> Hi Stefan, >>>>> >>>>> I checked the implementation of RowSimilarityJob and we might still have a >>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused >>>>> by >>>>> that, but the similarity scores might not be correct... >>>>> >>>>> We had this issue in 0.4 already, when someone realized that cooccurrences >>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map >>>>> the smaller row as first value. But apparently I did not adjust the value >>>>> setting for the Cooccurrence object... >>>>> >>>>> In 0.5 the code is: >>>>> >>>>> if (rowA <= rowB) { >>>>> rowPair.set(rowA, rowB, weightA, weightB); >>>>> } else { >>>>> rowPair.set(rowB, rowA, weightB, weightA); >>>>> } >>>>> coocurrence.set(column.get(), valueA, valueB); >>>>> >>>>> But I should be (already fixed in current trunk some days ago): >>>>> >>>>> if (rowA <= rowB) { >>>>> rowPair.set(rowA, rowB, weightA, weightB); >>>>> coocurrence.set(column.get(), valueA, valueB); >>>>> } else { >>>>> rowPair.set(rowB, rowA, weightB, weightA); >>>>> coocurrence.set(column.get(), valueB, valueA); >>>>> } >>>>> >>>>> Maybe you could rerun your test with the current trunk? >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> On 14.06.2011 20:54, Sean Owen wrote: >>>>> >>>>>> It is a similarity, not a distance. Higher values mean more >>>>>> similarity, not less. >>>>>> >>>>>> I agree that similarity ought to decrease with more dimensions. That >>>>>> is what you observe -- except that you see quite high average >>>>>> similarity with no dimension reduction! >>>>>> >>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything >>>>>> but a few dimensions. What's the dimensionality of the input without >>>>>> dimension reduction? >>>>>> >>>>>> Something is amiss in this pipeline. It is an interesting question! >>>>>> >>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Actually I'm using RowSimilarityJob() with >>>>>>> --input input >>>>>>> --output output >>>>>>> --numberOfColumns documentCount >>>>>>> --maxSimilaritiesPerRow documentCount >>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE >>>>>>> >>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE >>>>>>> calculates... >>>>>>> the source says: "distributed implementation of cosine similarity that >>>>>>> does not center its data" >>>>>>> >>>>>>> So... this seems to be the similarity and not the distance? >>>>>>> >>>>>>> Cheers, >>>>>>> Stefan >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2011/6/14 Stefan Wienert<[email protected]>: >>>>>>> >>>>>>>> but... why do I get the different results with cosine similarity with >>>>>>>> no dimension reduction (with 100,000 dimensions) ? >>>>>>>> >>>>>>>> 2011/6/14 Fernando Fernández<[email protected]>: >>>>>>>> >>>>>>>>> Actually that's what your results are showing, aren't they? With rank >>>>>>>>> 1000 >>>>>>>>> the similarity avg is the lowest... >>>>>>>>> >>>>>>>>> >>>>>>>>> 2011/6/14 Jake Mannix<[email protected]> >>>>>>>>> >>>>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? >>>>>>>>>> In >>>>>>>>>> higher >>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the >>>>>>>>>> other >>>>>>>>>> hand, >>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0. >>>>>>>>>> >>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hey Guys, >>>>>>>>>>> >>>>>>>>>>> I have some strange results in my LSA-Pipeline. >>>>>>>>>>> >>>>>>>>>>> First, I explain the steps my data is making: >>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF >>>>>>>>>>> as >>>>>>>>>>> weighter >>>>>>>>>>> 2) Transposing TDM >>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose) >>>>>>>>>>> 4) Transpose result (ONLY none / svd) >>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout) >>>>>>>>>>> >>>>>>>>>>> Now... Some strange thinks happen: >>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to >>>>>>>>>>> all other documents. >>>>>>>>>>> >>>>>>>>>>> the results using only cosine similarty (without dimension >>>>>>>>>>> reduction): >>>>>>>>>>> http://the-lord.de/img/none.png >>>>>>>>>>> >>>>>>>>>>> the result using svd, rank 10 >>>>>>>>>>> http://the-lord.de/img/svd-10.png >>>>>>>>>>> some points falling down to the bottom. >>>>>>>>>>> >>>>>>>>>>> the results using ssvd rank 10 >>>>>>>>>>> http://the-lord.de/img/ssvd-10.png >>>>>>>>>>> >>>>>>>>>>> the result using svd, rank 100 >>>>>>>>>>> http://the-lord.de/img/svd-100.png >>>>>>>>>>> more points falling down to the bottom. >>>>>>>>>>> >>>>>>>>>>> the results using ssvd rank 100 >>>>>>>>>>> http://the-lord.de/img/ssvd-100.png >>>>>>>>>>> >>>>>>>>>>> the results using svd rank 200 >>>>>>>>>>> http://the-lord.de/img/svd-200.png >>>>>>>>>>> even more points falling down to the bottom. >>>>>>>>>>> >>>>>>>>>>> the results using svd rank 1000 >>>>>>>>>>> http://the-lord.de/img/svd-1000.png >>>>>>>>>>> most points are at the bottom >>>>>>>>>>> >>>>>>>>>>> please beware of the scale: >>>>>>>>>>> - the avg from none: 0,8712 >>>>>>>>>>> - the avg from svd rank 10: 0,2648 >>>>>>>>>>> - the avg from svd rank 100: 0,0628 >>>>>>>>>>> - the avg from svd rank 200: 0,0238 >>>>>>>>>>> - the avg from svd rank 1000: 0,0116 >>>>>>>>>>> >>>>>>>>>>> so my question is: >>>>>>>>>>> Can you explain this behavior? Why are the documents getting more >>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite. >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Stefan Wienert >>>>>>>> >>>>>>>> http://www.wienert.cc >>>>>>>> [email protected] >>>>>>>> >>>>>>>> Telefon: +495251-2026838 >>>>>>>> Mobil: +49176-40170270 >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Stefan Wienert >>>>>>> >>>>>>> http://www.wienert.cc >>>>>>> [email protected] >>>>>>> >>>>>>> Telefon: +495251-2026838 >>>>>>> Mobil: +49176-40170270 >>>>>>> >>>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Stefan Wienert >>> >>> http://www.wienert.cc >>> [email protected] >>> >>> Telefon: +495251-2026838 >>> Mobil: +49176-40170270 >>> >> > > > > -- > Stefan Wienert > > http://www.wienert.cc > [email protected] > > Telefon: +495251-2026838 > Mobil: +49176-40170270 >
