Setting didn't-buy to 0 and getting a valid cosine distance is pretty common
in these scenarios.

I still prefer what Sean is recommending in terms of LLR for item to item
links, but the cosine version does make sense to support, especially for
purchase histories.

Even better would be to remember number of times an item was offered as well
as the number of times it was purchased.  This allows regression techniques
to be applied, often with good results.

On Tue, Apr 26, 2011 at 12:16 AM, Sean Owen <[email protected]> wrote:

> I think my comment mostly addressed his comments. Yes, this is the
> definition of cosine distance, and is implemented. No it doesn't work over
> true binary data. There is no "0", only "1" or non-existent.
>
> What is the remaining question?
>
> On Tue, Apr 26, 2011 at 3:21 AM, Chris Waggoner <[email protected]
> >wrote:
> >
> >
> > I've never used Mahout but what this @allclaws wants sounds like a simple
> > proposition. Given a vector like
> >
> > bought
> > didn't buy
> > didn't buy
> > didn't buy
> > didn't buy
> > didn't buy
> > didn't buy
> > bought
> > didn't buy
> > bought
> > bought
> > bought
> >
> >
> >
> > define "bought" == 1 and "didn't buy" == 0. Define distance between two
> > such vectors to be { A dot B } over { |A| times |B| }. Not that I find
> this
> > compelling as a definition of similarity but @allclaws called this a
> first,
> > rough pass.
>

Reply via email to