Hello,

I am currently trying crawl the web using nutch 1.11 trunk version from
https://github.com/apache/nutch

I am trying to use a particular property from the nutch-default.xml named:

http.agent.rotate
false

If true, instead of http.agent.name, alternating agent names are
chosen from a list provided via http.agent.rotate.file.




http.agent.rotate.file
agents.txt

File containing alternative user agent names to be used instead of
http.agent.name on a rotating basis if http.agent.rotate is true.
Each line of the file should contain exactly one agent
specification including name, version, description, URL, etc.




This is how I have modified my nutch-site.xml (not including other basic
properties)


http.agent.rotate
true

If true, instead of http.agent.name, alternating agent names are
chosen from a list provided via http.agent.rotate.file.




http.agent.rotate.file
agents.txt

File containing alternative user agent names to be used instead of
http.agent.name on a rotating basis if http.agent.rotate is true.
Each line of the file should contain exactly one agent
specification including name, version, description, URL, etc.




plugin.includes
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library. Set parsefilter-naivebayes for
classification based focused crawler.



This is is how my agents.txt looks like:
NutchTry1
NutchTry2
NutchTry3
NutchTry4
NutchTry5

and it is stored inside the runtime/local/conf folder.

But when i check my logs, it doesn't seem to change the agent name. Though
protocol-http is activated via the plugin.includes property.

Could you please suggest what changes I could try or correct something that
I may have configured incorrectly.

Thanks,
Manali

Reply via email to