[Nutch-dev] Bad URLs causing SEVERE exception

Chirag Chaman Tue, 05 Jul 2005 13:56:30 -0700

Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.


These links were getting thru -- thus we changed the URL filter to only
accept valid URL.

As someone else may face the same issue, here is the RE -- this should go
towards the end of your regex-urlfilter.txt.   It would be nice if one of
the committers could add this to the default file and comment it out.

# accept http only - valid URLs only
+^http://[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+[\:0-9]*


NOTE: This is only good for Web crawling, if you need intranet crawling do
not use this as it will not let any URL thru without at least one period.


CC-
--------------------------------------------
Filangy, Inc.
www.filangy.com




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Bad URLs causing SEVERE exception

Reply via email to