your webpage could be defined in the robots.txt of yahoo website as no_index, no_follow
and review your regular expression ! the character '.' means any caracter, you have to add the '\' beside the '.' like this \. +^http://answers.yahoo.com/dir/index;_ylt=* should be like this +^http://answers\.yahoo\.com/dir/index;_ylt=* > Date: Wed, 4 Nov 2009 07:06:54 -0800 > From: saravanan-2.krishnamoorth...@cognizant.com > To: nutch-user@lucene.apache.org > Subject: How to fetch URLs with special charaters '?' & '=' > > > I am trying to crawl the URL: > http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327 > with special characters '?' and '='. This URL belongs to Dining-out category > of answers.yahoo.com. And I want to crawl the URLs that fall under this sub > category. But it seemed to get skipped. I have attached my urllist.txt, > regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done > similar kind of crawling before? > http://old.nabble.com/file/p26197881/regex-urlfilter.txt regex-urlfilter.txt > http://old.nabble.com/file/p26197881/crawl-urlfilter.txt crawl-urlfilter.txt > http://old.nabble.com/file/p26197881/urllist.txt urllist.txt > -- > View this message in context: > http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html > Sent from the Nutch - User mailing list archive at Nabble.com. > _________________________________________________________________ Windows Live: Keep your friends up to date with what you do online. http://go.microsoft.com/?linkid=9691815