Re: Order of robots file

Tejas Patil Fri, 24 Jan 2014 06:50:41 -0800

I am working on the scenario you just pointed out. By Apache Nutch, you
mean the current codebase with CC or version before that ?


CC differs from original nutch code as CC has kinda a greedy approach
wherein it tries to get a match / mismatch after every line is sees from
the robots file. While the time I was working on delegation of robots
parsing to Crawler commons (CC), I remember that there was difference in
the semantics of original parsing code and CC's implementation for multiple
robots agents.
Here was my observation at that time:
https://issues.apache.org/jira/browse/NUTCH-1031?focusedCommentId=13558217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13558217

~tejas


On Fri, Jan 24, 2014 at 8:11 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Tejas, the problem exists in Apache Nutch as well. We'll take localhost as
> example and the following config and robots.txt
>
>
> # cat /var/www/robots.txt
> User-agent: *
> Disallow: /
>
> User-agent: nutch
> Allow: /
>
>
>
> config:
>   <property>
>     <name>http.agent.name</name>
>     <value>Mozilla</value>
>   </property>
>   <property>
>     <name>http.agent.version</name>
>     <value>5.0</value>
>   </property>
>   <property>
>     <name>http.robots.agents</name>
>     <value>nutch,*</value>
>   </property>
>   <property>
>     <name>http.agent.description</name>
>     <value>compatible; NutchCrawler</value>
>   </property>
>   <property>
>     <name>http.agent.url</name>
>     <value>+http://example.org/</value>
>   </property>
>
>
> URL: http://localhost/
> Version: 7
> Status: 3 (db_gone)
> Fetch time: Mon Mar 10 15:36:48 CET 2014
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 0.0
> Signature: null
> Metadata:
>         _pst_=robots_denied(18), lastModified=0
>
>
>
> Can you confirm?
>
>
>
>
>
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Friday 24th January 2014 15:29
> > To: user@nutch.apache.org
> > Subject: RE: Order of robots file
> >
> > Hi, sorry for being unclear. You understand correctly, i had to change
> the robots.txt order and put our crawler ABOVE User-Agent: *.
> >
> > I have tried a unit test for lib-http to demonstrate the problem but i
> does not fail. I think i did something wrong in merging the code base. I'll
> look further.
> >
> > markus@midas:~/projects/apache/nutch/trunk$ svn diff
> src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
> > Index:
> src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
> > ===================================================================
> > ---
> src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
>   (revision 1560984)
> > +++
> src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
>   (working copy)
> > @@ -50,6 +50,23 @@
> >        + "" + CR
> >        + "User-Agent: *" + CR
> >        + "Disallow: /foo/bar/" + CR;   // no crawl delay for other agents
> > +
> > +  private static final String ROBOTS_STRING_REVERSE =
> > +      "User-Agent: *" + CR
> > +      + "Disallow: /foo/bar/" + CR   // no crawl delay for other agents
> > +      + "" + CR
> > +      + "User-Agent: Agent1 #foo" + CR
> > +      + "Disallow: /a" + CR
> > +      + "Disallow: /b/a" + CR
> > +      + "#Disallow: /c" + CR
> > +      + "Crawl-delay: 10" + CR  // set crawl delay for Agent1 as 10 sec
> > +      + "" + CR
> > +      + "" + CR
> > +      + "User-Agent: Agent2" + CR
> > +      + "Disallow: /a/bloh" + CR
> > +      + "Disallow: /c" + CR
> > +      + "Disallow: /foo" + CR
> > +      + "Crawl-delay: 20" + CR;
> >
> >    private static final String[] TEST_PATHS = new String[] {
> >      "http://example.com/a";,
> > @@ -80,6 +97,29 @@
> >    /**
> >    * Test that the robots rules are interpreted correctly by the robots
> rules parser.
> >    */
> > +  public void testRobotsAgentReverse() {
> > +    rules = parser.parseRules("testRobotsAgent",
> ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, SINGLE_AGENT);
> > +
> > +    for(int counter = 0; counter < TEST_PATHS.length; counter++) {
> > +      assertTrue("testing on agent (" + SINGLE_AGENT + "), and "
> > +              + "path " + TEST_PATHS[counter]
> > +              + " got " + rules.isAllowed(TEST_PATHS[counter]),
> > +              rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]);
> > +    }
> > +
> > +    rules = parser.parseRules("testRobotsAgent",
> ROBOTS_STRING_REVERSE.getBytes(), CONTENT_TYPE, MULTIPLE_AGENTS);
> > +
> > +    for(int counter = 0; counter < TEST_PATHS.length; counter++) {
> > +      assertTrue("testing on agents (" + MULTIPLE_AGENTS + "), and "
> > +              + "path " + TEST_PATHS[counter]
> > +              + " got " + rules.isAllowed(TEST_PATHS[counter]),
> > +              rules.isAllowed(TEST_PATHS[counter]) == RESULTS[counter]);
> > +    }
> > +  }
> > +
> > +  /**
> > +  * Test that the robots rules are interpreted correctly by the robots
> rules parser.
> > +  */
> >    public void testRobotsAgent() {
> >      rules = parser.parseRules("testRobotsAgent",
> ROBOTS_STRING.getBytes(), CONTENT_TYPE, SINGLE_AGENT);
> >
> >
> > -----Original message-----
> > > From:Tejas Patil <tejas.patil...@gmail.com>
> > > Sent: Friday 24th January 2014 15:24
> > > To: user@nutch.apache.org
> > > Subject: Re: Order of robots file
> > >
> > > Hi Markus,
> > > I am trying to understand the problem you described. You meant that
> with
> > > the original Nutch's robots parsing code, the robots file below allowed
> > > your crawler to crawl stuff:
> > >
> > > User-agent: *
> > > Disallow: /
> > >
> > > User-agent: our_crawler
> > > Allow: /
> > >
> > > But now that started using the change from NUTCH-1031 [0], (ie.
> delegation
> > > of robots parsing to crawler commons), it blocked your crawler. To make
> > > things work, you had to change your robots file to this:
> > >
> > > User-agent: our_crawler
> > > Allow: /
> > >
> > > User-agent: *
> > > Disallow: /
> > >
> > > Did I understand the problem correctly ?
> > >
> > > [0] : https://issues.apache.org/jira/browse/NUTCH-1031
> > >
> > > Thanks,
> > > Tejas
> > >
> > >
> > > On Fri, Jan 24, 2014 at 7:29 PM, Markus Jelsma
> > > <markus.jel...@openindex.io>wrote:
> > >
> > > > Hi,
> > > >
> > > > I am attempting to merge some Nutch changes back to our own. We
> aren't
> > > > using Nutch' CrawlerCommons impl but the old stuff. But because of
> > > > recording of response time and rudimentary SSL support i decided to
> move it
> > > > back to our version. Suddenly i realized a local crawl does not work
> > > > anymore, it seems because of the order of the robots definitions.
> > > >
> > > > For example:
> > > >
> > > > User-agent: *
> > > > Disallow: /
> > > >
> > > > User-agent: our_crawler
> > > > Allow: /
> > > >
> > > > Does not allow our crawler to fetch URL's. But
> > > >
> > > > User-agent: our_crawler
> > > > Allow: /
> > > >
> > > > User-agent: *
> > > > Disallow: /
> > > >
> > > > Does! This was not the case before, anyone here aware of this? By
> design?
> > > > Or is it a flaw?
> > > >
> > > > Thanks
> > > > Markus
> > > >
> > >
> >
>

Re: Order of robots file

Reply via email to