Hi nutch-dev,

I know that we have RegexUrlNormalizer already for removing session- ids from URLs, but lately I've been wondering if there isn't a more general way to solve this, without relying on pre-built patterns.

I think I have an answer that will work. I haven't seen this approach published anywhere, so any failings are entirely my fault. ;) What I'm wondering is: - Does this seem like a good (effective, efficient) algorithm for catching session-id URLs?
- If so, where is the best place to implement it within Nutch?

Basic idea: session ids within URLs only cause problems for crawlers when they change. This typically occurs when a server-side session expires and a new id is issued. So, rather than looking for URL argument patterns (as RegexUrlNormalizer does), look for a value- transition pattern.

Algorithm:

1) Iterate over each page in a fetched segment

2) For each successful fetch, extract:
 - The fetched URL. Call this (u0)
- All links on the page that refer to the same site/domain. Call this set (u1..N)

3) Parse u0 into parameters (p0) as follows:
 - named parameters: add (key,value) to Map
 - positional (path) params: add (position,value) to Map

So for the url "http://foo.bar/spam/eggs?x=true&y=2";, pseudocode would look like:
 p0 = new HashMap();
 p0.put(new Integer(1), "spam");
 p0.put(new Integer(2), "eggs");
 p0.put("x", "true");
 p0.put("y", "2");

4) Parse u1..N into (p1..N) using the same method

5) Compare p0 with p1..N. Look for the following pattern:
 - keys that are present for all p0..N, and
 - values that are identical for all p1..N, and
 - the value in p0 is _different_

If you see this condition, flag the page as "contains session id that just changed" and deal with it accordingly. (Delete from crawldb, etc)

So... for anyone who's still reading ;), does this seem like it would work for catching session-ids? What corner-cases would trip it up? Can you think of cases when it would fall flat? And if it still seems worthwhile, where's the best place within Nutch to put it? (Perhaps a new ExtensionPoint that is used by "nutch updatedb"?)

--Matt

--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to