From wikipedia:
The/Robot Exclusion Standard/does not mention anything about the
"*" character in the|Disallow:|statement. Some crawlers like Googlebot
recognize strings containing "*", while MSNbot and Teoma interpret it in
different ways
So the 'problem' is with Macy's. Really, there is no problem for you:
presumably that line is just ignored from robots.txt.
One might also question the craw-delay setting of 120 seconds, but
that's another issue...
On 31/05/2014 12:16 AM, Nima Falaki wrote:
Hello Everyone:
Just have a question about an issue I discovered while trying to crawl the
macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
crawler-commons 0.4. This is the robots.txt file from macys
User-agent: *
Crawl-delay: 120
Disallow: /compare
Disallow: /registry/wedding/compare
Disallow: /catalog/product/zoom.jsp
Disallow: /search
Disallow: /shop/search
Disallow: /shop/registry/wedding/search
Disallow: *natuzzi*
noindex: *natuzzi*
Disallow: *Natuzzi*
noindex: *Natuzzi*
Disallow: /bag/add*
When I run this robots.txt through the RobotsRulesParser with this url
(http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
I get the following exceptions
2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*
2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*
2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*
2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*
Is there anything I can do to solve this problem? Is this a problem
with nutch or does macys.com have a really bad robots.txt file?
<http://www.popsugar.com>
Nima Falaki
Software Engineer
[email protected]