[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

2014-05-16 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1718:
---

Attachment: NUTCH-1718-trunk.v2.patch

Updated patch:
* for backward compatibility: take care that agent name itself is not given 
twice
* removed obsolete check whether http.agent.name  is included in 
http.robots.agents from Fetcher

Backward compatibility with old nutch-site.xml has been tested. I'll continue 
testing. But would opt for applying this: the behavior of Nutch regarding 
robots.txt has changed significantly with NUTCH-1759, and we can accept this 
minor change.

 update description of property http.robots.agent
 

 Key: NUTCH-1718
 URL: https://issues.apache.org/jira/browse/NUTCH-1718
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.7, 2.2, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.9

 Attachments: NUTCH-1718-trunk.v1.patch, NUTCH-1718-trunk.v2.patch


 The description of property http.robots.agent in nutch-default.xml recommends 
 to add a '*' to the list of agent names. This will cause the same problem as 
 described in NUTCH-1715. The description should be updated. Also regarding 
 order of precedence which is dictated since NUTCH-1031 only by ordering of 
 user agents in robots.txt.
 {code:xml}
 property
   namehttp.robots.agents/name
   value*/value
   descriptionThe agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

2014-01-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1718:
---

Attachment: NUTCH-1718-trunk.v1.patch

Thanks [~wastl-nagel] for bringing this up. I should have updated the 
documentation with NUTCH-1715 but lost track of the same.

In addition to having a documentation, I am proposing this: 
Instead of making users to have 'http.agent.name' as the first agent in 
'http.robots.agents', make the program do that automatically. So users would 
make use of 'http.robots.agents' to specify any additional agents apart from 
'http.agent.name'. Here is a patch for the same.

 update description of property http.robots.agent
 

 Key: NUTCH-1718
 URL: https://issues.apache.org/jira/browse/NUTCH-1718
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.7, 2.2, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1718-trunk.v1.patch


 The description of property http.robots.agent in nutch-default.xml recommends 
 to add a '*' to the list of agent names. This will cause the same problem as 
 described in NUTCH-1715. The description should be updated. Also regarding 
 order of precedence which is dictated since NUTCH-1031 only by ordering of 
 user agents in robots.txt.
 {code:xml}
 property
   namehttp.robots.agents/name
   value*/value
   descriptionThe agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)