Re: [Nutch-dev] code for index content of mime type beyond text/html

Stefan Groschupf Tue, 18 May 2004 17:10:20 -0700

Hi John,

Hi, Stephen,

Let me answer to this mail, in case you address me. ;)

I do not doubt what a plugin system can offer in general. Instead I am not sure if it needs to be in current crawler. I would prefer (as one of Doug's suggestions) the outlink extraction & text stripping functionalities be removed from current crwaler and a new tool created for outlink extraction, possibly text stripping and some other related things.

A great idea, i personal would be happy if we can extract / generate flexible meta data until content extraction by for example text classification.

This new tool can do things in plugin type.

Great, how i can help?

For your current plugin patch, is it for content extractor or for crawler? And I would consider these two plugins are different, at least, in implementation.

My current patch add the plugin mechanism and demonstrate it by make the content extration plugable to easy add more content extractors. But again i like the concept of a new tool that use the plugin mechanism.

To me, "plugin" is basically a dynamic class loading scheme.
Any tool can have it. IndexSimple.java in my patch uses it too.

Right at least it is, but in a smart the plugin-system including life cycle management, own classloading, instance changing, this kind of listener / publisher pattern bla bla.

I wouldn't say "most". There are many others out there.

In case you go to a university library they will say that to you.


(1) shouldn't z3950 be handled in parallel with http and ftp under
./net/nutch/protocols/?

(2) oai seems to be currently implemented on top of http, isn't
subclassing http client a logical way to do it?

We don't need to discuss this issues now, but that are Interfaces to query digital Archives. I have to learn more about that as well but got a hint from university library employee.

Correct me if I am wrong.

Mailing list is very inefficient for this kind of discussion.
We may very well understand each other in 10 min, if
talking face to face.

Right! ;-) In case you pay the flight i come to visit you a week for a code hacking session. ;-] Serious: I have a phone, surprise surprise and we can do web-conferences may be with more people that are interested. Then we can drink a virtual beer together and discuss some nutch issues. I think good communication is the basis for a successful software development process.

Stephan: Your earlier message mentioned that you may want to
use the plugin system to do some magic stuff. Could you be more
specific? Is it must be done in crawler?


Sorry there is a missunderstanding, I just was saying i wish to do a
url-normalizer with some "magic" logic inside as plugin. (IP database,
DNS lookup, text classification)

All done through one plugin in the crawler? Probably not.

We still misunderstand each other. I have to improve me communication, sorry. Just delete magic things and plugins from the data base. I wish to implement a url filter that only parse regional web-pages by more then just regular expressions. It's a new project and have nothing to do with plugins.


Could we summarize what we have discussed so far?
Doug, take up this task?

Maybe we should seperate near-term task and long-term dream.

Do we at least agree that a seperate tool (for outlink extraction
and text strapping) needs to be created? This leaves Fetcher.java
or RequestScheduler.java for crawling only.

I agree, do you have a design idea? May be i can spend some time coding on the weekend. At least UML just for communication (not MDA) i found sometimes useful.

Ste*FAN*


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to