Hi Sebastian, the bug does not affect me with: NONE > bugcheck.pdf SVD > bugcheck2.pdf (although it was active)
Cheers, Stefan 2011/6/14 Sebastian Schelter <[email protected]>: > Hi Stefan, > > I checked the implementation of RowSimilarityJob and we might still have a > bug in the 0.5 release... (f**k). I don't know if your problem is caused by > that, but the similarity scores might not be correct... > > We had this issue in 0.4 already, when someone realized that cooccurrences > were mapped out inconsistently, so for 0.5 we made sure that we always map > the smaller row as first value. But apparently I did not adjust the value > setting for the Cooccurrence object... > > In 0.5 the code is: > > if (rowA <= rowB) { > rowPair.set(rowA, rowB, weightA, weightB); > } else { > rowPair.set(rowB, rowA, weightB, weightA); > } > coocurrence.set(column.get(), valueA, valueB); > > But I should be (already fixed in current trunk some days ago): > > if (rowA <= rowB) { > rowPair.set(rowA, rowB, weightA, weightB); > coocurrence.set(column.get(), valueA, valueB); > } else { > rowPair.set(rowB, rowA, weightB, weightA); > coocurrence.set(column.get(), valueB, valueA); > } > > Maybe you could rerun your test with the current trunk? > > --sebastian > > On 14.06.2011 20:54, Sean Owen wrote: >> >> It is a similarity, not a distance. Higher values mean more >> similarity, not less. >> >> I agree that similarity ought to decrease with more dimensions. That >> is what you observe -- except that you see quite high average >> similarity with no dimension reduction! >> >> An average cosine similarity of 0.87 sounds "high" to me for anything >> but a few dimensions. What's the dimensionality of the input without >> dimension reduction? >> >> Something is amiss in this pipeline. It is an interesting question! >> >> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]> wrote: >>> >>> Actually I'm using RowSimilarityJob() with >>> --input input >>> --output output >>> --numberOfColumns documentCount >>> --maxSimilaritiesPerRow documentCount >>> --similarityClassname SIMILARITY_UNCENTERED_COSINE >>> >>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE >>> calculates... >>> the source says: "distributed implementation of cosine similarity that >>> does not center its data" >>> >>> So... this seems to be the similarity and not the distance? >>> >>> Cheers, >>> Stefan >>> >>> >>> >>> 2011/6/14 Stefan Wienert<[email protected]>: >>>> >>>> but... why do I get the different results with cosine similarity with >>>> no dimension reduction (with 100,000 dimensions) ? >>>> >>>> 2011/6/14 Fernando Fernández<[email protected]>: >>>>> >>>>> Actually that's what your results are showing, aren't they? With rank >>>>> 1000 >>>>> the similarity avg is the lowest... >>>>> >>>>> >>>>> 2011/6/14 Jake Mannix<[email protected]> >>>>> >>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? >>>>>> In >>>>>> higher >>>>>> dimensions, *distance* (and cosine angle) should grow, but on the >>>>>> other >>>>>> hand, >>>>>> *similarity* (1-cos(angle)) should go toward 0. >>>>>> >>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hey Guys, >>>>>>> >>>>>>> I have some strange results in my LSA-Pipeline. >>>>>>> >>>>>>> First, I explain the steps my data is making: >>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF >>>>>>> as >>>>>>> weighter >>>>>>> 2) Transposing TDM >>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>>>>>> 3c) Using no dimension reduction (for testing purpose) >>>>>>> 4) Transpose result (ONLY none / svd) >>>>>>> 5) Calculating Cosine Similarty (from Mahout) >>>>>>> >>>>>>> Now... Some strange thinks happen: >>>>>>> First of all: The demo data shows the similarity from document 1 to >>>>>>> all other documents. >>>>>>> >>>>>>> the results using only cosine similarty (without dimension >>>>>>> reduction): >>>>>>> http://the-lord.de/img/none.png >>>>>>> >>>>>>> the result using svd, rank 10 >>>>>>> http://the-lord.de/img/svd-10.png >>>>>>> some points falling down to the bottom. >>>>>>> >>>>>>> the results using ssvd rank 10 >>>>>>> http://the-lord.de/img/ssvd-10.png >>>>>>> >>>>>>> the result using svd, rank 100 >>>>>>> http://the-lord.de/img/svd-100.png >>>>>>> more points falling down to the bottom. >>>>>>> >>>>>>> the results using ssvd rank 100 >>>>>>> http://the-lord.de/img/ssvd-100.png >>>>>>> >>>>>>> the results using svd rank 200 >>>>>>> http://the-lord.de/img/svd-200.png >>>>>>> even more points falling down to the bottom. >>>>>>> >>>>>>> the results using svd rank 1000 >>>>>>> http://the-lord.de/img/svd-1000.png >>>>>>> most points are at the bottom >>>>>>> >>>>>>> please beware of the scale: >>>>>>> - the avg from none: 0,8712 >>>>>>> - the avg from svd rank 10: 0,2648 >>>>>>> - the avg from svd rank 100: 0,0628 >>>>>>> - the avg from svd rank 200: 0,0238 >>>>>>> - the avg from svd rank 1000: 0,0116 >>>>>>> >>>>>>> so my question is: >>>>>>> Can you explain this behavior? Why are the documents getting more >>>>>>> equal with more ranks in svd. I thought it was the opposite. >>>>>>> >>>>>>> Cheers >>>>>>> Stefan >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Stefan Wienert >>>> >>>> http://www.wienert.cc >>>> [email protected] >>>> >>>> Telefon: +495251-2026838 >>>> Mobil: +49176-40170270 >>>> >>> >>> >>> >>> -- >>> Stefan Wienert >>> >>> http://www.wienert.cc >>> [email protected] >>> >>> Telefon: +495251-2026838 >>> Mobil: +49176-40170270 >>> > > -- Stefan Wienert http://www.wienert.cc [email protected] Telefon: +495251-2026838 Mobil: +49176-40170270
