Not all URLs represent unique items / entities of interest. For e.g. a
lot of URLs would be just site specific search/listing pages or pages
that have a lot of navigational information but do not actually
represent an entity or item of interest.

Given such a page we do not want to recommend links to items already on
the page but items that were far ahead (listing page 3, 4) and were also
liked most by the users on the site.

Also for a URL that does represent a unique entity (For e.g. a book on
Amazon), we do not want to recommend other search/listing/navigational
pages but pages with actual items that people have liked w.r.t the
current page.
 
The intent is to gauge the relative popularity or model the
co-occurrence of items with respect to each other and also remove the
anomalies.

Lets say A = book1, C = listing-page, B=book2, D=book3

So if we have patterns like A-C-B, B-C-D-A, A-C-D-B, then A and B can be
both recommended for each other, given that one does not have the link
for the other already on the page.

Whether or not Markov chain will work? I do not know as I need to read
about Markov chain and find out.

As for log-likelihood ratio tests that sounds like a reasonable
candidate but I am a bit worried about scalability. 

Ted, what's your thought on this?

Thanks
-Ankur

-----Original Message-----
From: Sean Owen [mailto:[email protected]] 
Sent: Monday, January 19, 2009 3:18 PM
To: [email protected]
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Maybe a dumb question, but why subtract the links? I can only get from A
to
B via a hyperlink (well, if I navigate directly to B, is the fact that I
was
on A meaningful?)  Normalizing for transitions that correspond to a link
seems to do nothing. Maybe I do not understand the problem fully.

An A-C-B transition doesn't suggest that A should be recommended from B
right, but, say, A-B-A would. My point was only that it is not always
symmetric of course, and so applying CF gets a little trickier since the
algorithms would assume symmetry.

Would a short Markov chain work and scale? For 3 elements, it needs
storage
proportional to the cube of the average number of links per page. I
don't
think CF will scale nearly as well here; it is not feeling like quite
the
right tool for the job.

Sean

On 19 Jan 2009, 8:45 AM, "Goel, Ankur" <[email protected]> wrote:



Ted / Sean,

The link structure should definitely be subtracted. From the original
dataset or from the recommended item-set is left to the implementation.
I think it will be easier to do this from the recommended item-set.

As for not recommending urls in reverse order (B for A but not A for B,
given B appeared after A) one will have to keep track of his current
browsing history and remove those that user has already seen. Although
if user does reach B through some other link C then it does make sense
to recommend A.

Given the size of the data-set what kind of algorithm and keeping in
mind that it could grow in future what algorithms would you try out?

-----Original Message----- From: Ted Dunning
[mailto:[email protected]]

Sent: Sunday, January 18, 2009 2:06 AM To: [email protected]

Subject: Re: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer >
Predicting next URL is an i...

Reply via email to