Hello Everyone: Just have a question about an issue I discovered while trying to crawl the macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and crawler-commons 0.4. This is the robots.txt file from macys
User-agent: * Crawl-delay: 120 Disallow: /compare Disallow: /registry/wedding/compare Disallow: /catalog/product/zoom.jsp Disallow: /search Disallow: /shop/search Disallow: /shop/registry/wedding/search Disallow: *natuzzi* noindex: *natuzzi* Disallow: *Natuzzi* noindex: *Natuzzi* Disallow: /bag/add* When I run this robots.txt through the RobotsRulesParser with this url (http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=) I get the following exceptions 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *natuzzi* 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *Natuzzi* 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *natuzzi* 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *Natuzzi* Is there anything I can do to solve this problem? Is this a problem with nutch or does macys.com have a really bad robots.txt file? <http://www.popsugar.com> Nima Falaki Software Engineer [email protected]

