Of course it's possible. It does mean you need to make sure your
similarity metrics are meaningfully comparable.

This ranges from basic stuff, like making sure they are both
outputting the same range, to more subtle stuff like making sure an
"0.5" from both mean something comparable. This is not true of all
metrics -- consider that 0.5 in Euclidean distance similarity
(1/(1+d)) mean "distance is 1" while 0.5 in Jaccard/Tanimoto means
"half their items overlap". Are these about the same? I don't know.

If the similarity is construed as a probability in both cases, you
should likely multiply rather than add.


On Mon, Jul 9, 2012 at 7:55 AM, bangbig <lizhongliangg...@163.com> wrote:
> I have thought about this problem before, and I read several posts talking 
> about this. Sean Owen is right that the math doesn't care about what the 
> things are. But in practice I think a better way is that you can evaluate the 
> individual similarity of different kinds of data, and then combine the 
> individual similarities into the final one.
> That means that for the users with two different kinds data, first you can 
> derive two kinds of similarity, the similarity from the amazon data and the 
> similarity from the youtube video data, and then you can add the similarity 
> with weight to get the final similarity matrix of the users.
> linkedin's example:
> http://www.quora.com/How-does-LinkedIns-recommendation-system-work
> when they compute the similarity of people's profiles in linkedin, the 
> speaker said this
> " Here in order to compute overall of similarity between me and Adil, we are 
> first computing similarity between our specialties, our skills, our titles 
> and other attribute."
> and "Now we somehow need to combine the similarity score in the vector to a 
> single number " . there are some pictures in the post, which can help you 
> understand it.
> I wonder if any of you agree with me?
> thanks!
> zhongliang
> At 2012-07-04 15:42:16,"Sean Owen" <sro...@gmail.com> wrote:
>>The best default answer is to put them all in one model. The math
>>doesn't care what the things are. Unless you have a strong reason to
>>weight one data set I wouldn't. If you do, then two models is best. It
>>is hard to weight a subset of the data within most similarity
>>functions. I don't think it would in Pearson for instance but could
>>work in Tanimoto.
>>
>>On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler <kkrugler_li...@transpac.com> 
>>wrote:
>>> Hi all,
>>>
>>> I'm curious what approaches are recommended for generating user-user 
>>> similarity, when I've got two (or more) distinct types of item data, both 
>>> of which are fairly large.
>>>
>>> E.g. let's say I had a set of users where I knew both (a) what books they 
>>> had bought on Amazon, and (b) what YouTube videos they had watched.
>>>
>>> For each user, I want to find the 10 most similar other users.
>>>
>>>  - I could create two separate models, find the nearest 30 users for each 
>>> user, and combine (maybe with weighting) the results.
>>>  - I could toss all of the data into one model - and I could use a value of 
>>> < 1.0 for whichever type of preference is less important.
>>>
>>> Any other suggestions? Input on the above two approaches?
>>>
>>> Thanks!
>>>
>>> -- Ken
>>>
>>> --------------------------
>>> Ken Krugler
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Mahout & Solr
>>>
>>>
>>>
>>>

Reply via email to