Re: tf-idf + svd + cosine similarity

Stefan Wienert Tue, 14 Jun 2011 15:11:01 -0700

Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see
http://the-lord.de/img/beispielwerte.pdf
for better results.


First... U or V are the singular values not the eigenvectors ;)

Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
multiplies the input matrix with the transposed one)

As a fact, I don't need U, just V, so I need to transpose M (because
the eigenvectors of MM* = V).

So... normalizing the eigenvectors: Is the cosine similarity not doing
this? or ignoring the length of the vectors?
http://en.wikipedia.org/wiki/Cosine_similarity

my parameters for ssvd:
--rank 100
--oversampling 10
--blockHeight 227
--computeU false
--input
--output

the rest should be on default.

acutally I do not really know what these oversampling parameter means...

2011/6/14 Dmitriy Lyubimov <dlie...@gmail.com>:
> Interesting.
>
> (I have one confusion of mine RE: lanczos -- is it computing U
> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
> right. if it's V (right eigenvectors) this sequence should be fine).
>
> With ssvd i don't do transpose, i just do coputation of U which will
> produce document singular vectors directly.
>
> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
> but SSVD does (or multiplies normalized version by square root of a
> singlular value, whichever requested). So depending on which space
> your rotate results in, cosine similarities may be different. I assume
> you used normalized (true) eigenvectors from ssvd.
>
> Also would be interesting to know what oversampling parameter you (p) you 
> used.
>
> Thanks.
> -d
>
>
> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <ste...@wienert.cc> wrote:
>> So... lets check the dimensions:
>>
>> First step: Lucene Output:
>> 227 rows (=docs) and 107909 cols (=tems)
>>
>> transposed to:
>> 107909 rows and 227 cols
>>
>> reduced with svd (rank 100) to:
>> 99 rows and 227 cols
>>
>> transposed to: (actually there was a bug (with no effect on the SVD
>> result but on NONE result))
>> 227 rows and 99 cols
>>
>> So... now the cosine results are very similar to SVD 200.
>>
>> Results are added.
>>
>> @Sebastian: I will check if the bug affects my results.
>>
>> 2011/6/14 Fernando Fernández <fernando.fernandez.gonza...@gmail.com>:
>>> Hi Stefan,
>>>
>>> Are  you sure you need to transpose the input marix? I thought that what you
>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>> you say that you obtain term-document matrix and transpose it. Is this
>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>> possible that you are calculating similarities with the wrong matrix in some
>>> of the two cases? (With/without dimension reduction).
>>>
>>> Best,
>>> Fernando.
>>>
>>> 2011/6/14 Sebastian Schelter <s...@apache.org>
>>>
>>>> Hi Stefan,
>>>>
>>>> I checked the implementation of RowSimilarityJob and we might still have a
>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>>> that, but the similarity scores might not be correct...
>>>>
>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>>> the smaller row as first value. But apparently I did not adjust the value
>>>> setting for the Cooccurrence object...
>>>>
>>>> In 0.5 the code is:
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>  }
>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>
>>>> But I should be (already fixed in current trunk some days ago):
>>>>
>>>>  if (rowA <= rowB) {
>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>  } else {
>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>  }
>>>>
>>>> Maybe you could rerun your test with the current trunk?
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>
>>>>> It is a similarity, not a distance. Higher values mean more
>>>>> similarity, not less.
>>>>>
>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>> is what you observe -- except that you see quite high average
>>>>> similarity with no dimension reduction!
>>>>>
>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>> dimension reduction?
>>>>>
>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>
>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<ste...@wienert.cc>
>>>>>  wrote:
>>>>>
>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>> --input input
>>>>>> --output output
>>>>>> --numberOfColumns documentCount
>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>
>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>> calculates...
>>>>>> the source says: "distributed implementation of cosine similarity that
>>>>>> does not center its data"
>>>>>>
>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/6/14 Stefan Wienert<ste...@wienert.cc>:
>>>>>>
>>>>>>> but... why do I get the different results with cosine similarity with
>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>
>>>>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonza...@gmail.com>:
>>>>>>>
>>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>>> 1000
>>>>>>>> the similarity avg is the lowest...
>>>>>>>>
>>>>>>>>
>>>>>>>> 2011/6/14 Jake Mannix<jake.man...@gmail.com>
>>>>>>>>
>>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>>  In
>>>>>>>>> higher
>>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>>> other
>>>>>>>>> hand,
>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<ste...@wienert.cc>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Hey Guys,
>>>>>>>>>>
>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>
>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>>> as
>>>>>>>>>> weighter
>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>
>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>>> all other documents.
>>>>>>>>>>
>>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>>> reduction):
>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>
>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 200
>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>
>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>> most points are at the bottom
>>>>>>>>>>
>>>>>>>>>> please beware of the scale:
>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>
>>>>>>>>>> so my question is:
>>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> ste...@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> ste...@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> ste...@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
ste...@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Reply via email to