[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Asitang Mishra updated NUTCH-1941: ---------------------------------- Attachment: NUTCH-1941-itr3.patch Added NUTCH-1941-itr3.patch As suggested by Sebastian Nagel I did the following things: 1. Did formatting and corrected syntax 2. Now to use this feature we need to incluse the following features in the nutch-site.xml file: <property> <name>agent.rotate</name> <value>false</value> <description> If this is true then the agent name will be rotated from the file set in the property: agent.rotate.file. </description> </property> <property> <name>agent.rotate.file</name> <value>agents.txt</value> <description> give the file with aan agent name in each line to rotate from. If not set then it will search a file called agents.txt in conf folder by default (Only when agent.rotate is set true) </description> </property> <property> <name>agent.rotate.interval</name> <value>10</value> <description>The default value for this is 50 (if nothing is set here and agent.rotate is true). This number is used to get a random number between 1 and this number. So, the rotator will wait for this many url responses before again chaning the name of the agent. A new random number will be generated again. </description> </property> 3. resources will be loaded from class path. 4. removed static variables. 5. made the code non-synchronized. > Optional rolling http.agent.name's > ---------------------------------- > > Key: NUTCH-1941 > URL: https://issues.apache.org/jira/browse/NUTCH-1941 > Project: Nutch > Issue Type: New Feature > Components: fetcher, protocol > Reporter: Lewis John McGibbney > Priority: Trivial > Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch, > NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch > > > In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins > can block your fetcher based merely on your crawler name. > I propose the ability to implement rolling http.agent.name's which could be > substituted every 5 seconds for example. This would mean that successive > requests to the same domain would be sent with different http.agent.name. > This behavior should be off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)