Re: tf-idf + svd + cosine similarity

Dmitriy Lyubimov Tue, 14 Jun 2011 15:38:54 -0700

that actually looks more like it. Not so many documents  similar to a
randomly picked one.



On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[email protected]> wrote:
> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, 
> see
> http://the-lord.de/img/beispielwerte.pdf
> for better results.
>
> First... U or V are the singular values not the eigenvectors ;)
>
> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
> multiplies the input matrix with the transposed one)
>
> As a fact, I don't need U, just V, so I need to transpose M (because
> the eigenvectors of MM* = V).
>
> So... normalizing the eigenvectors: Is the cosine similarity not doing
> this? or ignoring the length of the vectors?
> http://en.wikipedia.org/wiki/Cosine_similarity
>
> my parameters for ssvd:
> --rank 100
> --oversampling 10
> --blockHeight 227
> --computeU false
> --input
> --output
>
> the rest should be on default.
>
> acutally I do not really know what these oversampling parameter means...
>
> 2011/6/14 Dmitriy Lyubimov <[email protected]>:
>> Interesting.
>>
>> (I have one confusion of mine RE: lanczos -- is it computing U
>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
>> right. if it's V (right eigenvectors) this sequence should be fine).
>>
>> With ssvd i don't do transpose, i just do coputation of U which will
>> produce document singular vectors directly.
>>
>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
>> but SSVD does (or multiplies normalized version by square root of a
>> singlular value, whichever requested). So depending on which space
>> your rotate results in, cosine similarities may be different. I assume
>> you used normalized (true) eigenvectors from ssvd.
>>
>> Also would be interesting to know what oversampling parameter you (p) you 
>> used.
>>
>> Thanks.
>> -d
>>
>>
>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[email protected]> wrote:
>>> So... lets check the dimensions:
>>>
>>> First step: Lucene Output:
>>> 227 rows (=docs) and 107909 cols (=tems)
>>>
>>> transposed to:
>>> 107909 rows and 227 cols
>>>
>>> reduced with svd (rank 100) to:
>>> 99 rows and 227 cols
>>>
>>> transposed to: (actually there was a bug (with no effect on the SVD
>>> result but on NONE result))
>>> 227 rows and 99 cols
>>>
>>> So... now the cosine results are very similar to SVD 200.
>>>
>>> Results are added.
>>>
>>> @Sebastian: I will check if the bug affects my results.
>>>
>>> 2011/6/14 Fernando Fernández <[email protected]>:
>>>> Hi Stefan,
>>>>
>>>> Are  you sure you need to transpose the input marix? I thought that what 
>>>> you
>>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>>> you say that you obtain term-document matrix and transpose it. Is this
>>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>>> possible that you are calculating similarities with the wrong matrix in 
>>>> some
>>>> of the two cases? (With/without dimension reduction).
>>>>
>>>> Best,
>>>> Fernando.
>>>>
>>>> 2011/6/14 Sebastian Schelter <[email protected]>
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused 
>>>>> by
>>>>> that, but the similarity scores might not be correct...
>>>>>
>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>>> the smaller row as first value. But apparently I did not adjust the value
>>>>> setting for the Cooccurrence object...
>>>>>
>>>>> In 0.5 the code is:
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>  }
>>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>>
>>>>> But I should be (already fixed in current trunk some days ago):
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>>  }
>>>>>
>>>>> Maybe you could rerun your test with the current trunk?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>>
>>>>>> It is a similarity, not a distance. Higher values mean more
>>>>>> similarity, not less.
>>>>>>
>>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>>> is what you observe -- except that you see quite high average
>>>>>> similarity with no dimension reduction!
>>>>>>
>>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>>> dimension reduction?
>>>>>>
>>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]>
>>>>>>  wrote:
>>>>>>
>>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>>> --input input
>>>>>>> --output output
>>>>>>> --numberOfColumns documentCount
>>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>>
>>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>>> calculates...
>>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>>> does not center its data"
>>>>>>>
>>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Stefan Wienert<[email protected]>:
>>>>>>>
>>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>>
>>>>>>>> 2011/6/14 Fernando Fernández<[email protected]>:
>>>>>>>>
>>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>>> 1000
>>>>>>>>> the similarity avg is the lowest...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2011/6/14 Jake Mannix<[email protected]>
>>>>>>>>>
>>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>>  In
>>>>>>>>>> higher
>>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>>> other
>>>>>>>>>> hand,
>>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hey Guys,
>>>>>>>>>>>
>>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>>
>>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>>> as
>>>>>>>>>>> weighter
>>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>>
>>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>>> all other documents.
>>>>>>>>>>>
>>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>>> reduction):
>>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 200
>>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>>> most points are at the bottom
>>>>>>>>>>>
>>>>>>>>>>> please beware of the scale:
>>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>>
>>>>>>>>>>> so my question is:
>>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> [email protected]
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> [email protected]
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> [email protected]
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Reply via email to