[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1718: --- Attachment: NUTCH-1718-trunk.v2.patch Updated patch: * for backward compatibility: take care that agent name itself is not given twice * removed obsolete check whether http.agent.name is included in http.robots.agents from Fetcher Backward compatibility with old nutch-site.xml has been tested. I'll continue testing. But would opt for applying this: the behavior of Nutch regarding robots.txt has changed significantly with NUTCH-1759, and we can accept this minor change. update description of property http.robots.agent Key: NUTCH-1718 URL: https://issues.apache.org/jira/browse/NUTCH-1718 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.7, 2.2, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.9 Attachments: NUTCH-1718-trunk.v1.patch, NUTCH-1718-trunk.v2.patch The description of property http.robots.agent in nutch-default.xml recommends to add a '*' to the list of agent names. This will cause the same problem as described in NUTCH-1715. The description should be updated. Also regarding order of precedence which is dictated since NUTCH-1031 only by ordering of user agents in robots.txt. {code:xml} property namehttp.robots.agents/name value*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1718: --- Attachment: NUTCH-1718-trunk.v1.patch Thanks [~wastl-nagel] for bringing this up. I should have updated the documentation with NUTCH-1715 but lost track of the same. In addition to having a documentation, I am proposing this: Instead of making users to have 'http.agent.name' as the first agent in 'http.robots.agents', make the program do that automatically. So users would make use of 'http.robots.agents' to specify any additional agents apart from 'http.agent.name'. Here is a patch for the same. update description of property http.robots.agent Key: NUTCH-1718 URL: https://issues.apache.org/jira/browse/NUTCH-1718 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.7, 2.2, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.8 Attachments: NUTCH-1718-trunk.v1.patch The description of property http.robots.agent in nutch-default.xml recommends to add a '*' to the list of agent names. This will cause the same problem as described in NUTCH-1715. The description should be updated. Also regarding order of precedence which is dictated since NUTCH-1031 only by ordering of user agents in robots.txt. {code:xml} property namehttp.robots.agents/name value*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)