I think it's safe to strip anchors, as they simply point to a different portion 
of the same page for browser rendering.  I do that for Simpy while normalizing 
URLs, in order not to have duplicates like this.

Otis

----- Original Message ----
From: Ken Krugler <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thu 05 Jan 2006 04:40:07 PM EST
Subject: Normalizing URLs with anchors

Hi all,

The default regex-normalize.xml currently strips out PHP session ids.

I'm wondering whether it would also make sense to remove anchor text 
from URLs. For example, currently these two URLs are treated as 
different:

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex

and

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html

Is it safe to always strip # followed by (valid anchor characters) at 
the end of a URL?

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200


Reply via email to