I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering. I do that for Simpy while normalizing URLs, in order not to have duplicates like this.
Otis ----- Original Message ---- From: Ken Krugler <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thu 05 Jan 2006 04:40:07 PM EST Subject: Normalizing URLs with anchors Hi all, The default regex-normalize.xml currently strips out PHP session ids. I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different: <http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and <http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it safe to always strip # followed by (valid anchor characters) at the end of a URL? Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200