Hi,
There seems to be two small bugs in lib-http's RobotRulesParser.
First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:
User-agent: foobot
Crawl-delay: 3600
User-agent: *
Disallow:
In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.
Second is about main method. RobotRulesParser.main advertises its usage
as "<robots-file> <url-file> <agent-name>+" but if you give it more than
one agent time it refuses it.
Trivial patch attached.
--
Doğacan Güney
Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
===================================================================
--- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (revision 507852)
+++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (working copy)
@@ -389,15 +389,17 @@
} else if ( (line.length() >= 12)
&& (line.substring(0, 12).equalsIgnoreCase("Crawl-Delay:"))) {
doneAgents = true;
- long crawlDelay = -1;
- String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
- if (delay.length() > 0) {
- try {
- crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
- } catch (Exception e) {
- LOG.info("can not parse Crawl-Delay:" + e.toString());
+ if (addRules) {
+ long crawlDelay = -1;
+ String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
+ if (delay.length() > 0) {
+ try {
+ crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
+ } catch (Exception e) {
+ LOG.info("can not parse Crawl-Delay:" + e.toString());
+ }
+ currentRules.setCrawlDelay(crawlDelay);
}
- currentRules.setCrawlDelay(crawlDelay);
}
}
}
@@ -500,7 +502,7 @@
/** command-line main for testing */
public static void main(String[] argv) {
- if (argv.length != 3) {
+ if (argv.length < 3) {
System.out.println("Usage:");
System.out.println(" java <robots-file> <url-file> <agent-name>+");
System.out.println("");
@@ -513,7 +515,7 @@
try {
FileInputStream robotsIn= new FileInputStream(argv[0]);
LineNumberReader testsIn= new LineNumberReader(new FileReader(argv[1]));
- String[] robotNames= new String[argv.length - 1];
+ String[] robotNames= new String[argv.length - 2];
for (int i= 0; i < argv.length - 2; i++)
robotNames[i]= argv[i+2];
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers