[ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375693#comment-14375693
 ] 

Sebastian Nagel commented on NUTCH-1941:
----------------------------------------

Hi [~asitangm], thanks! The patch needs some rework: formatting (see []), Java 
syntax (e.g., {{(word)!="\n"}}), and:
* resources have to be loaded from class path, e.g.
{code}
Reader reader = conf.getConfResourceAsReader(agentFile);
{code}
If Nutch is run via Hadoop, resources and configuration files are wrapped into 
one single job file.
* also the name of the agent file should be configurable via a property. The 
default should be not to rotate agent names (because it's not polite!), see 
[~markus.jel...@openindex.io]'s comment. So, the rotation has to be explicitly 
turned on, e.g., by setting the agent file.
* the way how agents are rotated seems to be quite complex: why not just take a 
random element from the list? It also does not require thread synchronization 
because it does not matter if accidentally the same agent name is used by two 
threads in sequence. There is also no need for that many variables: the list of 
agent names and a ThreadLocalRandom should be enough.
* "HttpBase class( A single instance of this is used by different fetcher 
thread": yes, correct. So, there is no need for static variables.

> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-ver1.patch, 
> agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to