Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "<robots-file> <url-file> <agent-name>+" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney
Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
===================================================================
--- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java	(revision 507852)
+++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java	(working copy)
@@ -389,15 +389,17 @@
       } else if ( (line.length() >= 12)
                   && (line.substring(0, 12).equalsIgnoreCase("Crawl-Delay:"))) {
         doneAgents = true;
-        long crawlDelay = -1;
-        String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
-        if (delay.length() > 0) {
-          try {
-            crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
-          } catch (Exception e) {
-            LOG.info("can not parse Crawl-Delay:" + e.toString());
+        if (addRules) {
+          long crawlDelay = -1;
+          String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
+          if (delay.length() > 0) {
+            try {
+              crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
+            } catch (Exception e) {
+              LOG.info("can not parse Crawl-Delay:" + e.toString());
+            }
+            currentRules.setCrawlDelay(crawlDelay);
           }
-          currentRules.setCrawlDelay(crawlDelay);
         }
       }
     }
@@ -500,7 +502,7 @@
 
   /** command-line main for testing */
   public static void main(String[] argv) {
-    if (argv.length != 3) {
+    if (argv.length < 3) {
       System.out.println("Usage:");
       System.out.println("   java <robots-file> <url-file> <agent-name>+");
       System.out.println("");
@@ -513,7 +515,7 @@
     try { 
       FileInputStream robotsIn= new FileInputStream(argv[0]);
       LineNumberReader testsIn= new LineNumberReader(new FileReader(argv[1]));
-      String[] robotNames= new String[argv.length - 1];
+      String[] robotNames= new String[argv.length - 2];
 
       for (int i= 0; i < argv.length - 2; i++) 
         robotNames[i]= argv[i+2];
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to