date:20080512

[Nutch Wiki] Update of "PublicServers" by Finbar Dineen

2008-05-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by Finbar Dineen:
http://wiki.apache.org/nutch/PublicServers

--
* [http://www.bigsearch.ca/ Bigsearch.ca] uses nutch open source software 
to deliver its search results.
  
* [http://busytonight.com/ BusyTonight]: Search for any event in the United 
States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
+ 
+   * [http://www.centralbudapest.com/search Central Budapest Search] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.

* [http://circuitscout.com Circuit Scout] is a search engine for electrical 
circuits.

Re: Writing a plugin

2008-05-12 Thread Pau

Hi,
I have added my plugin (called "recommended") to nutch-site.xml but it seems
that Nutch is not using it.
I say this because when search for "recom" I get no results, but there is a
page that has the meta-tag:

I have attached my nutch-site.xml and nutch-default.xml files, maybe you see
something wrong.
Apart from that, my plugin compiles ok, but when I run "ant test" I get
errors. I have also attached the output for "ant test".

On Sun, May 11, 2008 at 8:08 PM, <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Yes, you have to add your plugin to nutch-site.xml, along with other
> plugins you probably already have defined there.  If you don't have them in
> nutch-site.xml, look at nutch-default.xml
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Pau <[EMAIL PROTECTED]>
> > To: nutch-dev@lucene.apache.org
> > Sent: Sunday, May 11, 2008 8:28:53 AM
> > Subject: Writing a plugin
> >
> > Hello,
> > I am following the WritingPluginExample-0.9 and I am a bit confused
> about
> > how to get nutch to use my plugin.
> > In the section called "Getting Ant to Compile Your Plugin" it says:
> > "The next time you run a crawl your parser and index filter should get
> > used".
> > But at the end of the document, there is another section called "Getting
> > Nutch to Use Your Plugin".
> > Do I have to edit the nutch-site.xml file as "Getting Nutch to Use Your
> > Plugin" says? Or it is not necessary?
> > Thank you.
>
>

  http.agent.name
  PauSpider
  HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  http.agent.description
  Nutch Crawler
  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.

  http.agent.email
  [EMAIL PROTECTED]
  Description

  plugin.includes
  recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
  Regular expression naming plugin id names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.

  file.content.limit
  65536
  The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.

  file.content.ignored
  true
  If true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.
  !! NO IMPLEMENTED YET !!

  http.agent.name

  HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  http.robots.agents
  *
  The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*

  http.robots.403.allow
  true
  Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.

  http.agent.description

  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.

  http.agent.url

  A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.

  http.agent.email

  An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.

  http.agent.version
  Nutch-0.9
  A version string to advertise in the User-Agent 
   header.

  http.timeout
  1
  The default network timeout, in milliseconds.

  http.max.delays
  100
  The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After h

[jira] Updated: (NUTCH-442) Integrate Solr/Nutch

2008-05-12 Thread Caspar MacRae (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caspar MacRae updated NUTCH-442:


Attachment: Crawl.patch

Hi,

Tried with nutch svn rev:  650750  and solr svn rev: 652571 -  has been working 
perfectly for around a week in dev env  :-)

Attaching a trivial change for 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java

just adds the -solr argument.



thanks,
Caspar

> Integrate Solr/Nutch
> 
>
> Key: NUTCH-442
> URL: https://issues.apache.org/jira/browse/NUTCH-442
> Project: Nutch
>  Issue Type: New Feature
> Environment: Ubuntu linux
>Reporter: rubdabadub
> Attachments: Crawl.patch, NUTCH-442_v4.patch, NUTCH-442_v5.patch, 
> NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here 
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
>  and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I 
> am trying to eliminate my python based crawler which post documents to solr. 
> As I am in the corporate enviornment I can't install trunk version in the 
> production enviornment thus I am asking this to be included in 0.9 release. I 
> hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[Nutch Wiki] Update of "PublicServers" by Finbar Dineen

Re: Writing a plugin

[jira] Updated: (NUTCH-442) Integrate Solr/Nutch

3 matches

Site Navigation

Mail list logo

Footer information