Re: tf-idf + svd + cosine similarity

Dmitriy Lyubimov Tue, 14 Jun 2011 15:36:45 -0700

I beg to differ... U and V are left and right eigenvectors, and
singular values is denoted as Sigma (which is a square root of eigen
values of the AA' as you correctly pointed out) .


Yes so i figured Lanczos must be doing V (otherwise your dimensions
wouldn't match) . Also i guess eigenvector implies the right ones not
the left ones by default.

Normalization means that second norm of columns in the eigenvector
matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if
it is a thin one, U and V are orthonormal.  I might be wrong but i was
under impression that i saw some discussion saying Lanczos singular
vector matrix is not necessarily orthonormal (although columns do form
orthogonal basis). I might be wrong about it.

Anyway i know for sure that SSVD gives option to rotate in both
eigenspace and the space scaled by square roots of eigenvalues. The
latter allows single space for row items and column items and enables
similarity measures among them.

Oversampling parameter is parameter -p you give to SSVD. (didn't you
give it? ) What's your command line for SSVD was?

Basically it means that for 10-rank thin SVD you need to give
something like k=10 p=90 which means the algorithm actually computes
100 dimentional random projection and computes SVD on it (or rather
actaully indeed eigendecomposition of BB') and then throws away 90
singular values and 90 latent factors as well from the result.


On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <ste...@wienert.cc> wrote:
> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, 
> see
> http://the-lord.de/img/beispielwerte.pdf
> for better results.
>
> First... U or V are the singular values not the eigenvectors ;)
>
> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
> multiplies the input matrix with the transposed one)
>
> As a fact, I don't need U, just V, so I need to transpose M (because
> the eigenvectors of MM* = V).
>
> So... normalizing the eigenvectors: Is the cosine similarity not doing
> this? or ignoring the length of the vectors?
> http://en.wikipedia.org/wiki/Cosine_similarity
>
> my parameters for ssvd:
> --rank 100
> --oversampling 10
> --blockHeight 227
> --computeU false
> --input
> --output
>
> the rest should be on default.
>
> acutally I do not really know what these oversampling parameter means...
>
> 2011/6/14 Dmitriy Lyubimov <dlie...@gmail.com>:
>> Interesting.
>>
>> (I have one confusion of mine RE: lanczos -- is it computing U
>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
>> right. if it's V (right eigenvectors) this sequence should be fine).
>>
>> With ssvd i don't do transpose, i just do coputation of U which will
>> produce document singular vectors directly.
>>
>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
>> but SSVD does (or multiplies normalized version by square root of a
>> singlular value, whichever requested). So depending on which space
>> your rotate results in, cosine similarities may be different. I assume
>> you used normalized (true) eigenvectors from ssvd.
>>
>> Also would be interesting to know what oversampling parameter you (p) you 
>> used.
>>
>> Thanks.
>> -d
>>
>>
>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <ste...@wienert.cc> wrote:
>>> So... lets check the dimensions:
>>>
>>> First step: Lucene Output:
>>> 227 rows (=docs) and 107909 cols (=tems)
>>>
>>> transposed to:
>>> 107909 rows and 227 cols
>>>
>>> reduced with svd (rank 100) to:
>>> 99 rows and 227 cols
>>>
>>> transposed to: (actually there was a bug (with no effect on the SVD
>>> result but on NONE result))
>>> 227 rows and 99 cols
>>>
>>> So... now the cosine results are very similar to SVD 200.
>>>
>>> Results are added.
>>>
>>> @Sebastian: I will check if the bug affects my results.
>>>
>>> 2011/6/14 Fernando Fernández <fernando.fernandez.gonza...@gmail.com>:
>>>> Hi Stefan,
>>>>
>>>> Are  you sure you need to transpose the input marix? I thought that what 
>>>> you
>>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>>> you say that you obtain term-document matrix and transpose it. Is this
>>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>>> possible that you are calculating similarities with the wrong matrix in 
>>>> some
>>>> of the two cases? (With/without dimension reduction).
>>>>
>>>> Best,
>>>> Fernando.
>>>>
>>>> 2011/6/14 Sebastian Schelter <s...@apache.org>
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused 
>>>>> by
>>>>> that, but the similarity scores might not be correct...
>>>>>
>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>>> the smaller row as first value. But apparently I did not adjust the value
>>>>> setting for the Cooccurrence object...
>>>>>
>>>>> In 0.5 the code is:
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>  }
>>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>>
>>>>> But I should be (already fixed in current trunk some days ago):
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>>  }
>>>>>
>>>>> Maybe you could rerun your test with the current trunk?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>>
>>>>>> It is a similarity, not a distance. Higher values mean more
>>>>>> similarity, not less.
>>>>>>
>>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>>> is what you observe -- except that you see quite high average
>>>>>> similarity with no dimension reduction!
>>>>>>
>>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>>> dimension reduction?
>>>>>>
>>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<ste...@wienert.cc>
>>>>>>  wrote:
>>>>>>
>>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>>> --input input
>>>>>>> --output output
>>>>>>> --numberOfColumns documentCount
>>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>>
>>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>>> calculates...
>>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>>> does not center its data"
>>>>>>>
>>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Stefan Wienert<ste...@wienert.cc>:
>>>>>>>
>>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>>
>>>>>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonza...@gmail.com>:
>>>>>>>>
>>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>>> 1000
>>>>>>>>> the similarity avg is the lowest...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2011/6/14 Jake Mannix<jake.man...@gmail.com>
>>>>>>>>>
>>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>>  In
>>>>>>>>>> higher
>>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>>> other
>>>>>>>>>> hand,
>>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<ste...@wienert.cc>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hey Guys,
>>>>>>>>>>>
>>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>>
>>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>>> as
>>>>>>>>>>> weighter
>>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>>
>>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>>> all other documents.
>>>>>>>>>>>
>>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>>> reduction):
>>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 200
>>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>>> most points are at the bottom
>>>>>>>>>>>
>>>>>>>>>>> please beware of the scale:
>>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>>
>>>>>>>>>>> so my question is:
>>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> ste...@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> ste...@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> ste...@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> ste...@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Reply via email to