Re: [Nutch-dev] code for index content of mime type beyond text/html

john Tue, 18 May 2004 16:37:41 -0700

On Tue, May 18, 2004 at 11:47:22PM +0200, Stefan Groschupf wrote:
> >
> >Besides text stripping, my patch provides new capabilities/mechanisms
> >at indexing stage and in search output.
> >
> >As in its current state, Stefan's plugin does text stripping only.
> >
> My patch is about an plugin-system the content extractor stuff was an 
> give a way, since Doug mentioned months ago, we only take the 
> plugin-system with a default plugin and a build system. ;-)
> The good thing about a plugin-system is, if you don't like my plugin 
> write a own plugin. ;-]
> In case Byron don't like your plugin and don't like my plugin he do a 
> own plugin, and so on and so on and so on.
> That is flexibility. We don't need to discuss the next 3 weeks what's 
> the best style to extract content, at least it every time depends on 
> the needs of the user.
> How ever we have to found a Extension Point that allow all the 
> possibilities the different plugins needs.


Hi, Stephen,

I do not doubt what a plugin system can offer in general.
Instead I am not sure if it needs to be in current crawler.
I would prefer (as one of Doug's suggestions) the outlink
extraction & text stripping functionalities
be removed from current crwaler and a new tool created
for outlink extraction, possibly text stripping and some other related things.
This new tool can do things in plugin type.

> As you mentioned we should separate it. At first i think it is useful 
> to have plug-able content extractors for different file formats.
> Future more it would be good to have a plug-able crawler and the web 
> crawler is just one plugin.

For your current plugin patch, is it for content extractor or for crawler?
And I would consider these two plugins are different, at least,
in implementation.

To me, "plugin" is basically a dynamic class loading scheme.
Any tool can have it. IndexSimple.java in my patch uses it too.

> As you may be not know the most informations of the world is available 
> behind z39.50 and OAI interfaces.

I wouldn't say "most". There are many others out there.

> 
> http://www.niso.org/z39.50/z3950.html
> http://www.openarchives.org

I have never dealt with them before. After a quick look, I have these questions

(1) shouldn't z3950 be handled in parallel with http and ftp under
./net/nutch/protocols/?

(2) oai seems to be currently implemented on top of http, isn't
subclassing http client a logical way to do it?

Correct me if I am wrong.

Mailing list is very inefficient for this kind of discussion.
We may very well understand each other in 10 min, if
talking face to face.

> 
> >Stephan: Your earlier message mentioned that you may want to
> >use the plugin system to do some magic stuff. Could you be more
> >specific? Is it must be done in crawler?
> 
> Sorry there is a missunderstanding, I just was saying i wish to do a 
> url-normalizer with some "magic" logic inside as plugin. (IP database, 
> DNS lookup, text classification)

All done through one plugin in the crawler? Probably not.

Could we summarize what we have discussed so far?
Doug, take up this task?

Maybe we should seperate near-term task and long-term dream.

Do we at least agree that a seperate tool (for outlink extraction
and text strapping) needs to be created? This leaves Fetcher.java
or RequestScheduler.java for crawling only.

John


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to