The best default answer is to put them all in one model. The math doesn't care what the things are. Unless you have a strong reason to weight one data set I wouldn't. If you do, then two models is best. It is hard to weight a subset of the data within most similarity functions. I don't think it would in Pearson for instance but could work in Tanimoto.
On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > Hi all, > > I'm curious what approaches are recommended for generating user-user > similarity, when I've got two (or more) distinct types of item data, both of > which are fairly large. > > E.g. let's say I had a set of users where I knew both (a) what books they had > bought on Amazon, and (b) what YouTube videos they had watched. > > For each user, I want to find the 10 most similar other users. > > - I could create two separate models, find the nearest 30 users for each > user, and combine (maybe with weighting) the results. > - I could toss all of the data into one model - and I could use a value of < > 1.0 for whichever type of preference is less important. > > Any other suggestions? Input on the above two approaches? > > Thanks! > > -- Ken > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > >