Thank you Luke.

Is there a file out there that has rules for transformation, exclusions that
work with Nutch or Egothor?
I'll be more than happy to take lead and maintain one for use with Nutch. We
currently have one that has about 20+ rules but I images that's a small
subset.

Also is the regular expression engine optimized or does it take O(n) if one
has n rules?

Thankx
CC-


On 10/19/2004 05:34 PM, CC Chaman wrote:
> At what point does the URL normalization happen?
> 
> Example: I give the fetcher 100 starting URLs, and it get the 100 
> pages which have another 1000 URL in them. Based on rules specified in 
> the RegexUrlNormalizer, all those URL will get modified.
> 
> Does the WebDB hold the normalized URLs or the Raw URL?
> Or does it normalize the URL just before doing a fetch?
> 
> Thankx for taking the time to answer
> 
> CC-

URL normalization occurs when new URLs are injected or found by crawling.
So, prior to being stored in WebDB, the URLs are normalized.

Luke Baker


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use
IT products in your business? Tell us what you think of them. Give us Your
Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers





-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to