I am working on the scenario you just pointed out. By Apache Nutch, you mean the current codebase with CC or version before that ?
CC differs from original nutch code as CC has kinda a greedy approach wherein it tries to get a match / mismatch after every line is sees from the robots file. While the time I was working on delegation of robots parsing to Crawler commons (CC), I remember that there was difference in the semantics of original parsing code and CC's implementation for multiple robots agents. Here was my observation at that time: https://issues.apache.org/jira/browse/NUTCH-1031?focusedCommentId=13558217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13558217 ~tejas On Fri, Jan 24, 2014 at 8:11 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Tejas, the problem exists in Apache Nutch as well. We'll take localhost as > example and the following config and robots.txt > > > # cat /var/www/robots.txt > User-agent: * > Disallow: / > > User-agent: nutch > Allow: / > > > > config: > <property> > <name>http.agent.name</name> > <value>Mozilla</value> > </property> > <property> > <name>http.agent.version</name> > <value>5.0</value> > </property> > <property> > <name>http.robots.agents</name> > <value>nutch,*</value> > </property> > <property> > <name>http.agent.description</name> > <value>compatible; NutchCrawler</value> > </property> > <property> > <name>http.agent.url</name> > <value>+http://example.org/</value> > </property> > > > URL: http://localhost/ > Version: 7 > Status: 3 (db_gone) > Fetch time: Mon Mar 10 15:36:48 CET 2014 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 0.0 > Signature: null > Metadata: > _pst_=robots_denied(18), lastModified=0 > > > > Can you confirm? > > > > > > -----Original message----- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Friday 24th January 2014 15:29 > > To: user@nutch.apache.org > > Subject: RE: Order of robots file > > > > Hi, sorry for being unclear. You understand correctly, i had to change > the robots.txt order and put our crawler ABOVE User-Agent: *. > > > > I have tried a unit test for lib-http to demonstrate the problem but i > does not fail. I think i did something wrong in merging the code base. I'll > look further. > > > > markus@midas:~/projects/apache/nutch/trunk$ svn diff > src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java > > Index: > src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java > > =================================================================== > > --- > src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java > (revision 1560984) > > +++ > src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java > (working copy) > > @@ -50,6 +50,23 @@ > > + "" + CR > > + "User-Agent: *" + CR > > + "Disallow: /foo/bar/" + CR; // no crawl delay for other agents > > + > > + private static final String ROBOTS_STRING_REVERSE = > > + "User-Agent: *" + CR > > + + "Disallow: /foo/bar/" + CR // no crawl delay for other agents > > + + "" + CR > > + + "User-Agent: Agent1 #foo" + CR > > + + "Disallow: /a" + CR > > + + "Disallow: /b/a" + CR > > + + "#Disallow: /c" + CR > > + + "Crawl-delay: 10" + CR // set crawl delay for Agent1 as 10 sec > > + + "" + CR > > + + "" + CR > > + + "User-Agent: Agent2" + CR > > + + "Disallow: /a/bloh" + CR > > + + "Disallow: /c" + CR > > + + "Disallow: /foo" + CR > > + + "Crawl-delay: 20" + CR; > > > > private static final String[] TEST_PATHS = new String[] { > > "http://example.com/a", > > @@ -80,6 +97,29 @@ > > /** > > * Test that the robots rules are interpreted correctly by the robots > rules parser. > > */ > > + public void testRobotsAgentReverse() { > > + rules = parser.parseRules("testRobotsAgent", > ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, SINGLE_AGENT); > > + > > + for(int counter = 0; counter < TEST_PATHS.length; counter++) { > > + assertTrue("testing on agent (" + SINGLE_AGENT + "), and " > > + + "path " + TEST_PATHS[counter] > > + + " got " + rules.isAllowed(TEST_PATHS[counter]), > > + rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]); > > + } > > + > > + rules = parser.parseRules("testRobotsAgent", > ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, MULTIPLE_AGENTS); > > + > > + for(int counter = 0; counter < TEST_PATHS.length; counter++) { > > + assertTrue("testing on agents (" + MULTIPLE_AGENTS + "), and " > > + + "path " + TEST_PATHS[counter] > > + + " got " + rules.isAllowed(TEST_PATHS[counter]), > > + rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]); > > + } > > + } > > + > > + /** > > + * Test that the robots rules are interpreted correctly by the robots > rules parser. > > + */ > > public void testRobotsAgent() { > > rules = parser.parseRules("testRobotsAgent", > ROBOTS_STRING.getBytes(), CONTENT_TYPE, SINGLE_AGENT); > > > > > > -----Original message----- > > > From:Tejas Patil <tejas.patil...@gmail.com> > > > Sent: Friday 24th January 2014 15:24 > > > To: user@nutch.apache.org > > > Subject: Re: Order of robots file > > > > > > Hi Markus, > > > I am trying to understand the problem you described. You meant that > with > > > the original Nutch's robots parsing code, the robots file below allowed > > > your crawler to crawl stuff: > > > > > > User-agent: * > > > Disallow: / > > > > > > User-agent: our_crawler > > > Allow: / > > > > > > But now that started using the change from NUTCH-1031 [0], (ie. > delegation > > > of robots parsing to crawler commons), it blocked your crawler. To make > > > things work, you had to change your robots file to this: > > > > > > User-agent: our_crawler > > > Allow: / > > > > > > User-agent: * > > > Disallow: / > > > > > > Did I understand the problem correctly ? > > > > > > [0] : https://issues.apache.org/jira/browse/NUTCH-1031 > > > > > > Thanks, > > > Tejas > > > > > > > > > On Fri, Jan 24, 2014 at 7:29 PM, Markus Jelsma > > > <markus.jel...@openindex.io>wrote: > > > > > > > Hi, > > > > > > > > I am attempting to merge some Nutch changes back to our own. We > aren't > > > > using Nutch' CrawlerCommons impl but the old stuff. But because of > > > > recording of response time and rudimentary SSL support i decided to > move it > > > > back to our version. Suddenly i realized a local crawl does not work > > > > anymore, it seems because of the order of the robots definitions. > > > > > > > > For example: > > > > > > > > User-agent: * > > > > Disallow: / > > > > > > > > User-agent: our_crawler > > > > Allow: / > > > > > > > > Does not allow our crawler to fetch URL's. But > > > > > > > > User-agent: our_crawler > > > > Allow: / > > > > > > > > User-agent: * > > > > Disallow: / > > > > > > > > Does! This was not the case before, anyone here aware of this? By > design? > > > > Or is it a flaw? > > > > > > > > Thanks > > > > Markus > > > > > > > > > >