Take a look under conf/regex-normalize.xml

I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.

 - Juho Mäkinen, http://www.juhonkoti.net

On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I am crawling some sites that use session ids. As the crawler does not
> use cookies, they are put in the url's querystring. This results in
> thousands of pages that are - based on the visible content - duplicates,
> but are detected as such, because the urls contained in the html are
> different.
> 
> Has anybody found a solution to this problem? Is there a way to activate
> cookies for the crawler?
> 
> --
> Kind regards,
> 
> Hans Benedict
> 
> _________________________________________________________________
> Chemie.DE Information Service GmbH     Hans Benedict
> Seydelstraße 28                        mailto: [EMAIL PROTECTED]
> 10117 Berlin, Germany                  Tel +49 30 204568-40
>                                        Fax +49 30 204568-70
> 
> www.Chemie.DE               |          www.ChemieKarriere.NET
> www.Bionity.COM             |          www.BioKarriere.NET
> 
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to