Bugs item #954964, was opened at 2004-05-16 20:38
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=954964&group_id=59548

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Stefan Groschupf (joa23)
Assigned to: Nobody/Anonymous (nobody)
Summary: nutch plugin system

Initial Comment:
The patch contains: 

1.) a about.txt see text below
2.) the nutch plugin system code
3.) the required undated ant build script
4.) a content extractor plugin with
4.a.) html content extractor
4.b.) text content extractor
4.c.) rtf content extractor
4.d.) SWF content extractor (I _think_ nutch is the first search 
engine that can parse flash movies)

5.) The required re-factored code of fetcher and outputthread to 
use the content extrator extension point
6.) A new FetchHandler object that contains the duplicated 
handleFetch methods and is used by fetcher and outputthread.
7.) the dom4j.jar library that is required by the plugin system and 
should be copied to the nutch/lib folder.

All new code comes with working unit tests as well both crawling 
tutorial working without any problem.


===================================
Here is the text of the about.txt file:
===================================

Stefan Groschupf
www.media-style.com


This patch contains the long discussed plugin-system for nutch and  
the first nutch-plugin. 


Why a plugin-system?                                  

The idea behind the plugin-system is to provide more flexibility, 
more flexibility and more flexibility to the nutch software 
architecture.

With the help of the nutch-plugin-system it is easy possible to plug 
in or change own content-extractors, summery algorithms, file 
systems, url filter and so on.

The nutch-plugin-system provides developers the possibility to 
concentrate on its own implementation and don't take care about 
how to integrate own code in the core of nutch.
Developers do not take care about library version conflicts or 
license limitations since a plugin is a encapsulate component and 
do not need to be open source as well. 
For user the plugin-system provide the possibility to assemble the 
best search engine for its needs and easy evaluate different 
functionality by just change plugins. 
    


How does the plugin-system work in general?

The plugin-system allow to extend the nutch core functionality on 
defined points (extension points) with custom, easy to create 
components called extensions. Custom code, the extension 
implementations and a xml manifest descriptor file are bundled 
together into a plugin.
 
A plugin can contain one or a set of extensions for one or a set of 
extension points.
Further more a plugin can provide own extension points where 
other plugin can register extensions as well.

The nutch plugin system is a kind of publisher listener pattern with 
a central registry tree (so called Plugin-Repository), where 
publishers and listener are registered and assembled together until 
system start up.

Until runtime a extension point request all installed extension from 
the plugin repository and invoke them with a set of parameters to 
process the result from the extensions. 

Nutch comes with a set of core extension points (actually it is only 
the content extractor extension point) that need at least one 
extension to run nutch. 


Extensions are register as listener to the extension points by using 
unique string keys.
A extension point is described by a Interface that need to be 
implemented by the extension and a xml schema that describe the 
attributes that need to be setuped for the extension in the plugin 
manifest file. Actually the xml schema is not validated by the 
plugin-system and only to help developers.

Beside the extension point setup and the announcement of new 
extension points the plugin manifest file provide informations about 
required libraries for the plugin. 
It is possible to use libraries only internally for a own plugin or 
export them so third-party plugins that depends on own plugins can 
reuse the library.


How does the plugin-system work in detail?

The first and best information source is the code. This kind of 
standard answer you get mostly in all open source projects and I 
happy to say it today as well. ;)
The best point where you should start is browse the code of the 
content extractor plugin that comes with this patch. After that  you 
can browse the code of the TestPluginSystem.java and the 
PluginRepository.java.

There is a paper about the ideas of the plugin system from some 
months ago, I had written it together with Rathna Prabhu Rajendran 
and Antonio Gull�. It can be found here:
http://www.media-style.com/index.jsp?folderPK=422
Since the nutch-plugin-system is strongly inspired by the eclipse 
plugin system vocabulary  (but do not depend on it and it is a 
much different implementation) a good reading is the "Notes on 
the Eclipse Plug-in Architecture"
http://eclipse.org/articles/Article-Plug-in-architecture/
plugin_architecture.html 


Happy hacking and help to make informations be free.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=954964&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to