Take a look under conf/regex-normalize.xml I don't know how it works, but it seems to do just what you need, removing session data from GET urls. It's been configured to remove PHPSESSID variables on default, but you should be easily able to figure how to custome it for your needs.
- Juho Mäkinen, http://www.juhonkoti.net On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote: > Hi, > > I am crawling some sites that use session ids. As the crawler does not > use cookies, they are put in the url's querystring. This results in > thousands of pages that are - based on the visible content - duplicates, > but are detected as such, because the urls contained in the html are > different. > > Has anybody found a solution to this problem? Is there a way to activate > cookies for the crawler? > > -- > Kind regards, > > Hans Benedict > > _________________________________________________________________ > Chemie.DE Information Service GmbH Hans Benedict > Seydelstraße 28 mailto: [EMAIL PROTECTED] > 10117 Berlin, Germany Tel +49 30 204568-40 > Fax +49 30 204568-70 > > www.Chemie.DE | www.ChemieKarriere.NET > www.Bionity.COM | www.BioKarriere.NET > > ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
