At first this indeed sounds like a CF problem. You can use clustering to solve a CF problem (see for instance TreeClusteringRecommender)
But you could use other algorithms just as well - see any of the other Recommenders. You don't have ratings for a URL, just a binary 'yes, visited' or nothing. You can take advantage of that by using the 'Boolean*' classes and the Tanimoto similarity metric. This doesn't capture the fact that there is an ordering that is important - URL A was clicked just before B so when I am on A we should recommend B (but not necessarily the reverse). To capture this I think you want to try an item-based recommender with an item-item similarity that captures this relation. It won't be symmetric which messes up some other things - may need more tweaking of existing code to get right. But then again is this a CF problem? Sounds like markov chains... given the last 1 or 2 or 3 URLs visited, which URL has been next, most often? I think that's relatively easy and fast, does that work? As for data I would indeed consider throwing out data you believe is just noise. Sean On 16 Jan 2009, 12:25 PM, "Goel, Ankur" <[email protected]> wrote: Ted / Karl, Thank you both for your comments and suggestions. Continuing on the comments from Ted... The end goal is definitely not clustering but rather recommendations. Thist can be broken down into 2 separate tasks typical to a recommendation engine. 1. Given a URL show other URLs people have liked. 2. Given a User session and the URL he is seeing, suggest other URLs he might like. I experimented a bit with clustering but couldn't get good recommendations. >From your advice Log-likelihood ratio sounds like a potential solution for the first one. I remember having a discussion with you and Sean long time back where you pointed to a useful paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 Please pardon me if I am asking the question again but do you think it's a promising approach for problem 1? Do we have an implementation for this in Mahout? If no then I can open it and work on it (given my other work commitments allow enough time). Also since I have no formal statistics background, I am working on 'rebuilding' my statistics knowledge so that I can grasp these concepts better. As for the data rates, I really don't know in the context of these techniques what's low and what's high but what I have learnt after accumulating weeks of data is that there are few users who have good engagement (sufficient clicks) over a period of time, moderate number of users who have small number of clicks and large number of users that have very few clicks and are just casual surfers. Also regarding building a user model as a simple mixture, I am not sure which one you are referring to. Is it the LDA JIRA that Jeff is working on? Once again thanks for all the help, much appreciated. Regards -Ankur -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Thursday, Januar...
