Not all URLs represent unique items / entities of interest. For e.g. a lot of URLs would be just site specific search/listing pages or pages that have a lot of navigational information but do not actually represent an entity or item of interest.
Given such a page we do not want to recommend links to items already on the page but items that were far ahead (listing page 3, 4) and were also liked most by the users on the site. Also for a URL that does represent a unique entity (For e.g. a book on Amazon), we do not want to recommend other search/listing/navigational pages but pages with actual items that people have liked w.r.t the current page. The intent is to gauge the relative popularity or model the co-occurrence of items with respect to each other and also remove the anomalies. Lets say A = book1, C = listing-page, B=book2, D=book3 So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be both recommended for each other, given that one does not have the link for the other already on the page. Whether or not Markov chain will work? I do not know as I need to read about Markov chain and find out. As for log-likelihood ratio tests that sounds like a reasonable candidate but I am a bit worried about scalability. Ted, what's your thought on this? Thanks -Ankur -----Original Message----- From: Sean Owen [mailto:[email protected]] Sent: Monday, January 19, 2009 3:18 PM To: [email protected] Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer Maybe a dumb question, but why subtract the links? I can only get from A to B via a hyperlink (well, if I navigate directly to B, is the fact that I was on A meaningful?) Normalizing for transitions that correspond to a link seems to do nothing. Maybe I do not understand the problem fully. An A-C-B transition doesn't suggest that A should be recommended from B right, but, say, A-B-A would. My point was only that it is not always symmetric of course, and so applying CF gets a little trickier since the algorithms would assume symmetry. Would a short Markov chain work and scale? For 3 elements, it needs storage proportional to the cube of the average number of links per page. I don't think CF will scale nearly as well here; it is not feeling like quite the right tool for the job. Sean On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <[email protected]> wrote: Ted / Sean, The link structure should definitely be subtracted. From the original dataset or from the recommended item-set is left to the implementation. I think it will be easier to do this from the recommended item-set. As for not recommending urls in reverse order (B for A but not A for B, given B appeared after A) one will have to keep track of his current browsing history and remove those that user has already seen. Although if user does reach B through some other link C then it does make sense to recommend A. Given the size of the data-set what kind of algorithm and keeping in mind that it could grow in future what algorithms would you try out? -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Sunday, January 18, 2009 2:06 AM To: [email protected] Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer > Predicting next URL is an i...
