Hi Ted,

Thanks for the help below. I ended using the binary representation
approach. First, I had to convert the similarity scores to percentages by
multiplying by 100 to turn fractions into decimals. After I computed the
binary representation of the two scores I took the OR then divided by 100
to convert to a similarity back. This seems to be working fine.

I also learned over the weekend that there is a theory to compute such a
score from scores computed from different domain. The theory is called
"Utility Theory and it uses a method called Kappa Statistics". I thought I
would share with everyone here

Thanks again for you help with this. It is very much appreciated.

-Ahmed

On Fri, Mar 23, 2012 at 6:32 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> My own recommendation is to reduce both scores to binary form using
> whatever sound statistical method you care to adopt and then use OR.
>
> A viable alternative that is relatively good is to convert both scores to
> percentiles with the same polarity (i.e. 99-th %-ile is very close or very
> similar).  Then transform both percentiles using the logit function to get
> unbounded real numbers.  The logit of p is just log(p / (1-p)) where p is
> in the range (0,1).  These transformed percentiles can be added with
> reasonable impunity and the result can interpreted pretty easily.  The time
> that this doesn't work so well is when one of the values is heavily
> quantized near the interesting end of the scale, but that problem is
> inherent in the data, not in the method.
>
> A similar result can be had by using -log(1-p) where p is the percentile
> in question.  For values of p near 1, this is approximately the same as
> using the logit function.  For values far from 1, we don't care what it
> means.
>
>
> On Fri, Mar 23, 2012 at 1:52 PM, Sean Owen <sro...@gmail.com> wrote:
>
>> On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed
>> <ahmed.elma...@gmail.com> wrote:
>> > As for merging the scores, I need an OR rule, which translates to the
>> > addition. If I used AND that will make the likelihood smaller because
>> the
>> > probabilities will be multiplied. This will restrict the clusters to
>> items
>> > that appears in the intersection of content-based similarity AND sales
>> > correlations. Does this sound right to you?
>>
>> Not really, because of course you multiply probabilities in all cases.
>> Yes, all similarities would be smaller in absolute term, but that's
>> fine -- the absolute value does not matter.
>>
>> The problem with adding is that again it assumes the two terms are in
>> the same "units" and that is not clear here. The product doesn't
>> contain that assumption, at least.
>>
>> >
>> > A very important issue I am having now is about evaluation. How do we
>> > evaluate these clusters resulting from a TreeClusteringRecommender?
>> >
>>
>> In the context of recommenders, you don't. The clusters are not the
>> output, just a possible implementation by-product. You could compute
>> metrics like intra-cluster distance vs inter-cluster distance but I
>> don't know what it says about the quality of the recs.
>>
>> You should start with the standard rec eval code if you can.
>>
>
>

Reply via email to