Yes, I think this should definitely be in the default. Also, here a few that I've used and found to be very useful to avoid junk from getting in the DB. Maybe helpful for someone doing internet crawling -- these need not go in the default file, but it would be nice if people contributed regex/filters and an example file is made. # ignore local IPs -^http://(127|10|192\.168|172)\.* -^http://localhost.*
#Blocked sites - major ad servers -atwola\.com -servedby\.advertising\. -ad\.doubleclick\. -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, April 22, 2005 4:21 PM To: [email protected] Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn Chirag Chaman wrote: > I like this solution, simple and elegant The credit should go to Gordon Mohr, of the Heritrix crawler. He suggested this to me yesterday. > Just a modification which might make it faster for longer URLs. This > makes the RE non-greedy, thereby causing it to match without having to > examine the whole string. > > -http://.*(/.+?)/.*?\1/.*?\1.*?/ Should we put something like this in the default url filter config file? Doug ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
