Over the weekend the fetcher crashed and kept crashing. The culprit was a site which was pointing to bad links -- http://:80/ and http://:0/ etc.
These links were getting thru -- thus we changed the URL filter to only accept valid URL. As someone else may face the same issue, here is the RE -- this should go towards the end of your regex-urlfilter.txt. It would be nice if one of the committers could add this to the default file and comment it out. # accept http only - valid URLs only +^http://[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+[\:0-9]* NOTE: This is only good for Web crawling, if you need intranet crawling do not use this as it will not let any URL thru without at least one period. CC- -------------------------------------------- Filangy, Inc. www.filangy.com ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
