[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375693#comment-14375693 ]
Sebastian Nagel commented on NUTCH-1941: ---------------------------------------- Hi [~asitangm], thanks! The patch needs some rework: formatting (see []), Java syntax (e.g., {{(word)!="\n"}}), and: * resources have to be loaded from class path, e.g. {code} Reader reader = conf.getConfResourceAsReader(agentFile); {code} If Nutch is run via Hadoop, resources and configuration files are wrapped into one single job file. * also the name of the agent file should be configurable via a property. The default should be not to rotate agent names (because it's not polite!), see [~markus.jel...@openindex.io]'s comment. So, the rotation has to be explicitly turned on, e.g., by setting the agent file. * the way how agents are rotated seems to be quite complex: why not just take a random element from the list? It also does not require thread synchronization because it does not matter if accidentally the same agent name is used by two threads in sequence. There is also no need for that many variables: the list of agent names and a ThreadLocalRandom should be enough. * "HttpBase class( A single instance of this is used by different fetcher thread": yes, correct. So, there is no need for static variables. > Optional rolling http.agent.name's > ---------------------------------- > > Key: NUTCH-1941 > URL: https://issues.apache.org/jira/browse/NUTCH-1941 > Project: Nutch > Issue Type: New Feature > Components: fetcher, protocol > Reporter: Lewis John McGibbney > Priority: Trivial > Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-ver1.patch, > agent.names.txt, nutch.patch > > > In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins > can block your fetcher based merely on your crawler name. > I propose the ability to implement rolling http.agent.name's which could be > substituted every 5 seconds for example. This would mean that successive > requests to the same domain would be sent with different http.agent.name. > This behavior should be off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)