What exactly does 'didnt buy' mean here ? Was the user shown the item or its just an item they never considered?
To find the 'best' metric here you could simply run an offline evaluation across your dataset. But what appears to be the most important thing is what does each representation actually mean? This link is to a paper about a binary approach, read the evaluation section -> [PDF] from psu.edu<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.2430&rep=rep1&type=pdf> On Tue, Apr 26, 2011 at 8:16 AM, Sean Owen <[email protected]> wrote: > I think my comment mostly addressed his comments. Yes, this is the > definition of cosine distance, and is implemented. No it doesn't work over > true binary data. There is no "0", only "1" or non-existent. > > What is the remaining question? > > On Tue, Apr 26, 2011 at 3:21 AM, Chris Waggoner <[email protected] > >wrote: > > > > > > I've never used Mahout but what this @allclaws wants sounds like a simple > > proposition. Given a vector like > > > > bought > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > didn't buy > > bought > > didn't buy > > bought > > bought > > bought > > > > > > > > define "bought" == 1 and "didn't buy" == 0. Define distance between two > > such vectors to be { A dot B } over { |A| times |B| }. Not that I find > this > > compelling as a definition of similarity but @allclaws called this a > first, > > rough pass. >
