[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil resolved NUTCH-1418. -------------------------------- Resolution: Fixed Fix Version/s: 2.2 After the robots handling has been delegated to crawler commons (NUTCH-1031), this issue is NOT reproducible. The url in question gets crawled: {noformat}http://en.wikipedia.org/wiki/Districts_of_India Version: 7 Status: 2 (db_fetched) Fetch time: Tue Jun 11 04:47:14 PDT 2013 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.4599998 Signature: b0ec6daf534d9d28f3b49ad7915af89c Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {noformat} > error parsing robots rules- can't decode path: > /wiki/Wikipedia%3Mediation_Committee/ > ------------------------------------------------------------------------------------ > > Key: NUTCH-1418 > URL: https://issues.apache.org/jira/browse/NUTCH-1418 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.4 > Reporter: Arijit Mukherjee > Fix For: 1.7, 2.2 > > > Since learning that nutch will be unable to crawl the javascript function > calls in href, I started looking for other alternatives. I decided to crawl > http://en.wikipedia.org/wiki/Districts_of_India. > I first tried injecting this URL and follow the step-by-step approach > till fetcher - when I realized, nutch did not fetch anything from this > website. I tried looking into logs/hadoop.log and found the following 3 lines > - which I believe could be saying that nutch is unable to parse the > robots.txt in the website and ttherefore, fetcher stopped? > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ > I tried checking the URL using parsechecker and no issues there! I think > it means that the robots.txt is malformed for this website, which is > preventing fetcher from fetching anything. Is there a way to get around this > problem, as parsechecker seems to go on its merry way parsing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira