I trying to crawl a wikipedia site. I want to skip any url which has the term Special:
Eg: https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page https://wiki.mydomain.com/index.php/Special:Watchlist https://wiki.mydomain.com/index.php/Special:Contributions/SName https://wiki.mydomain.com/index.php/Special:Recentchanges This is my crawl-urlfilter.txt -^http://wiki.mydomain.com/index.php/Special: -^http://wiki.mydomain.com/index.php/Special:* -^http://wiki.mydomain.com/index.php/Special:*/ -^http://wiki.mydomain.com/index.php/Special:*/* -^https://wiki.mydomain.com/index.php/Special:Upload +^https://wiki.mydomain.com/index.php -. But I still see the fetcher logs. 2007-03-22 12:52:15,387 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php 2007-03-22 12:52:32,128 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Telecom 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Contributions/SName 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Watchlist 2007-03-22 12:52:32,179 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Preferences 2007-03-22 12:52:32,198 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Recentchanges 2007-03-22 12:52:32,322 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Talk:Main_Page 2007-03-22 12:52:32,323 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page 2007-03-22 12:52:32,326 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/BCP 2007-03-22 12:52:32,339 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page 2007-03-22 12:52:32,343 INFO fetcher.Fetcher - fetching https://wiki.mydomain.com/index.php/Network_Engineering Not sure whats wrong in my regular expression. Any help please. -- View this message in context: http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
