Thank you Luke. Is there a file out there that has rules for transformation, exclusions that work with Nutch or Egothor? I'll be more than happy to take lead and maintain one for use with Nutch. We currently have one that has about 20+ rules but I images that's a small subset.
Also is the regular expression engine optimized or does it take O(n) if one has n rules? Thankx CC- On 10/19/2004 05:34 PM, CC Chaman wrote: > At what point does the URL normalization happen? > > Example: I give the fetcher 100 starting URLs, and it get the 100 > pages which have another 1000 URL in them. Based on rules specified in > the RegexUrlNormalizer, all those URL will get modified. > > Does the WebDB hold the normalized URLs or the Raw URL? > Or does it normalize the URL just before doing a fetch? > > Thankx for taking the time to answer > > CC- URL normalization occurs when new URLs are injected or found by crawling. So, prior to being stored in WebDB, the URLs are normalized. Luke Baker ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
