Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Ted Dunning Mon, 19 Jan 2009 22:46:33 -0800

I can supply code for computing the measure itself, but not for the
map-reduce computation of the counts involved.

In my experience, this only requires about 10-15 lines of pig but rather a
larger amount of native map-reduce code.  At Veoh, we used this and other
mechanisms to reduce very large amounts of data (7 months at billions of
events per month) into form usable for recommendation.  Even with a
relatively small cluster, this is not an extremely long computation.

The four inputs to the log-likelihood ratio test for independence are all
counts.   For item A and item B, the necessary counts are the number of
users who interacted with both item A item B, the number of users who
interacted A, but not B, with B but not A and the number of users who
interacted with interacted with neither item.  To minimize issues with click
spam it is customary to count only one interaction per user so all of the
counts can be considered a count of users rather than events.

If you view your set of of histories to be a binary matrix H containing rows
that correspond to users and columns that correspond to items, then H' H is
the matrix of coocurrence counts for all possible A's and B's.  Columns of
H' H provide information needed to get the A-not-B and B-not-A counts and
the total of the matrix gives the information for the the not-A-not-B
counts.

This matrix multiplication is, in fact, the same as a join.

I have a blog posting on the subject of computing log-likelihood ratios
here:

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

If need be, I can add a worked example of how to compute co-occurrence using
map-reduce.

On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur <[email protected]>wrote:

> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Reply via email to