Hi,
I have added my plugin (called "recommended") to nutch-site.xml but it seems
that Nutch is not using it.
I say this because when search for "recom" I get no results, but there is a
page that has the meta-tag:
I have attached my nutch-site.xml and nutch-default.xml files, maybe you see
something wrong.
Apart from that, my plugin compiles ok, but when I run "ant test" I get
errors. I have also attached the output for "ant test".
On Sun, May 11, 2008 at 8:08 PM, <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Yes, you have to add your plugin to nutch-site.xml, along with other
> plugins you probably already have defined there. If you don't have them in
> nutch-site.xml, look at nutch-default.xml
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message
> > From: Pau <[EMAIL PROTECTED]>
> > To: nutch-dev@lucene.apache.org
> > Sent: Sunday, May 11, 2008 8:28:53 AM
> > Subject: Writing a plugin
> >
> > Hello,
> > I am following the WritingPluginExample-0.9 and I am a bit confused
> about
> > how to get nutch to use my plugin.
> > In the section called "Getting Ant to Compile Your Plugin" it says:
> > "The next time you run a crawl your parser and index filter should get
> > used".
> > But at the end of the document, there is another section called "Getting
> > Nutch to Use Your Plugin".
> > Do I have to edit the nutch-site.xml file as "Getting Nutch to Use Your
> > Plugin" says? Or it is not necessary?
> > Thank you.
>
>
http.agent.name
PauSpider
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.agent.description
Nutch Crawler
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
http.agent.email
[EMAIL PROTECTED]
Description
plugin.includes
recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
file.content.limit
65536
The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
file.content.ignored
true
If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
http.agent.name
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.robots.agents
*
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
http.robots.403.allow
true
Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.
http.agent.description
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
http.agent.url
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
http.agent.email
An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
http.agent.version
Nutch-0.9
A version string to advertise in the User-Agent
header.
http.timeout
1
The default network timeout, in milliseconds.
http.max.delays
100
The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After h