your webpage could be defined in the robots.txt of yahoo website  as no_index, 
no_follow

and review your regular expression ! the character '.' means any caracter, you 
have to add the '\' beside the '.'  like this \.
+^http://answers.yahoo.com/dir/index;_ylt=*    should be like this 

+^http://answers\.yahoo\.com/dir/index;_ylt=*






> Date: Wed, 4 Nov 2009 07:06:54 -0800
> From: saravanan-2.krishnamoorth...@cognizant.com
> To: nutch-user@lucene.apache.org
> Subject: How to fetch URLs with special charaters '?' & '='
> 
> 
> I am trying to crawl the URL:
> http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327
> with special characters '?' and '='. This URL belongs to Dining-out category
> of answers.yahoo.com. And I want to crawl the URLs that fall under this sub
> category. But it seemed to get skipped. I have attached my urllist.txt,
> regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done
> similar kind of crawling before? 
> http://old.nabble.com/file/p26197881/regex-urlfilter.txt regex-urlfilter.txt 
> http://old.nabble.com/file/p26197881/crawl-urlfilter.txt crawl-urlfilter.txt 
> http://old.nabble.com/file/p26197881/urllist.txt urllist.txt 
> -- 
> View this message in context: 
> http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
                                          
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://go.microsoft.com/?linkid=9691815

Reply via email to