Bugs item #954964, was opened at 2004-05-16 20:38
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=954964&group_id=59548
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Stefan Groschupf (joa23)
Assigned to: Nobody/Anonymous (nobody)
Summary: nutch plugin system
Initial Comment:
The patch contains:
1.) a about.txt see text below
2.) the nutch plugin system code
3.) the required undated ant build script
4.) a content extractor plugin with
4.a.) html content extractor
4.b.) text content extractor
4.c.) rtf content extractor
4.d.) SWF content extractor (I _think_ nutch is the first search
engine that can parse flash movies)
5.) The required re-factored code of fetcher and outputthread to
use the content extrator extension point
6.) A new FetchHandler object that contains the duplicated
handleFetch methods and is used by fetcher and outputthread.
7.) the dom4j.jar library that is required by the plugin system and
should be copied to the nutch/lib folder.
All new code comes with working unit tests as well both crawling
tutorial working without any problem.
===================================
Here is the text of the about.txt file:
===================================
Stefan Groschupf
www.media-style.com
This patch contains the long discussed plugin-system for nutch and
the first nutch-plugin.
Why a plugin-system?
The idea behind the plugin-system is to provide more flexibility,
more flexibility and more flexibility to the nutch software
architecture.
With the help of the nutch-plugin-system it is easy possible to plug
in or change own content-extractors, summery algorithms, file
systems, url filter and so on.
The nutch-plugin-system provides developers the possibility to
concentrate on its own implementation and don't take care about
how to integrate own code in the core of nutch.
Developers do not take care about library version conflicts or
license limitations since a plugin is a encapsulate component and
do not need to be open source as well.
For user the plugin-system provide the possibility to assemble the
best search engine for its needs and easy evaluate different
functionality by just change plugins.
How does the plugin-system work in general?
The plugin-system allow to extend the nutch core functionality on
defined points (extension points) with custom, easy to create
components called extensions. Custom code, the extension
implementations and a xml manifest descriptor file are bundled
together into a plugin.
A plugin can contain one or a set of extensions for one or a set of
extension points.
Further more a plugin can provide own extension points where
other plugin can register extensions as well.
The nutch plugin system is a kind of publisher listener pattern with
a central registry tree (so called Plugin-Repository), where
publishers and listener are registered and assembled together until
system start up.
Until runtime a extension point request all installed extension from
the plugin repository and invoke them with a set of parameters to
process the result from the extensions.
Nutch comes with a set of core extension points (actually it is only
the content extractor extension point) that need at least one
extension to run nutch.
Extensions are register as listener to the extension points by using
unique string keys.
A extension point is described by a Interface that need to be
implemented by the extension and a xml schema that describe the
attributes that need to be setuped for the extension in the plugin
manifest file. Actually the xml schema is not validated by the
plugin-system and only to help developers.
Beside the extension point setup and the announcement of new
extension points the plugin manifest file provide informations about
required libraries for the plugin.
It is possible to use libraries only internally for a own plugin or
export them so third-party plugins that depends on own plugins can
reuse the library.
How does the plugin-system work in detail?
The first and best information source is the code. This kind of
standard answer you get mostly in all open source projects and I
happy to say it today as well. ;)
The best point where you should start is browse the code of the
content extractor plugin that comes with this patch. After that you
can browse the code of the TestPluginSystem.java and the
PluginRepository.java.
There is a paper about the ideas of the plugin system from some
months ago, I had written it together with Rathna Prabhu Rajendran
and Antonio Gull�. It can be found here:
http://www.media-style.com/index.jsp?folderPK=422
Since the nutch-plugin-system is strongly inspired by the eclipse
plugin system vocabulary (but do not depend on it and it is a
much different implementation) a good reading is the "Notes on
the Eclipse Plug-in Architecture"
http://eclipse.org/articles/Article-Plug-in-architecture/
plugin_architecture.html
Happy hacking and help to make informations be free.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=954964&group_id=59548
-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers