I can supply code for computing the measure itself, but not for the map-reduce computation of the counts involved.
In my experience, this only requires about 10-15 lines of pig but rather a larger amount of native map-reduce code. At Veoh, we used this and other mechanisms to reduce very large amounts of data (7 months at billions of events per month) into form usable for recommendation. Even with a relatively small cluster, this is not an extremely long computation. The four inputs to the log-likelihood ratio test for independence are all counts. For item A and item B, the necessary counts are the number of users who interacted with both item A item B, the number of users who interacted A, but not B, with B but not A and the number of users who interacted with interacted with neither item. To minimize issues with click spam it is customary to count only one interaction per user so all of the counts can be considered a count of users rather than events. If you view your set of of histories to be a binary matrix H containing rows that correspond to users and columns that correspond to items, then H' H is the matrix of coocurrence counts for all possible A's and B's. Columns of H' H provide information needed to get the A-not-B and B-not-A counts and the total of the matrix gives the information for the the not-A-not-B counts. This matrix multiplication is, in fact, the same as a join. I have a blog posting on the subject of computing log-likelihood ratios here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html If need be, I can add a worked example of how to compute co-occurrence using map-reduce. On Mon, Jan 19, 2009 at 9:30 PM, Goel, Ankur <[email protected]>wrote: > About Tanimoto measure, I thought of using it in hierarchical clustering > but Ted suggested it might not solve the purpose. He suggested that we > can try computing the log-likelihood of co-occurrence of items. > > I would like to try out both the item based recommender you suggested > and also the log-likelihood approach. Do we have the map-red version of > log-likelihood code in Mahout? > > Ted, any thoughts? > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
