Hi John,
Hi, Stephen,Let me answer to this mail, in case you address me. ;)
A great idea, i personal would be happy if we can extract / generate flexible meta data until content extraction by for example text classification.
I do not doubt what a plugin system can offer in general.
Instead I am not sure if it needs to be in current crawler.
I would prefer (as one of Doug's suggestions) the outlink
extraction & text stripping functionalities
be removed from current crwaler and a new tool created
for outlink extraction, possibly text stripping and some other related things.
Great, how i can help?This new tool can do things in plugin type.
My current patch add the plugin mechanism and demonstrate it by make the content extration plugable to easy add more content extractors.
For your current plugin patch, is it for content extractor or for crawler?
And I would consider these two plugins are different, at least,
in implementation.
But again i like the concept of a new tool that use the plugin mechanism.
Right at least it is, but in a smart the plugin-system including life cycle management, own classloading, instance changing, this kind of listener / publisher pattern bla bla.To me, "plugin" is basically a dynamic class loading scheme. Any tool can have it. IndexSimple.java in my patch uses it too.
In case you go to a university library they will say that to you.
I wouldn't say "most". There are many others out there.
(1) shouldn't z3950 be handled in parallel with http and ftp under ./net/nutch/protocols/?
(2) oai seems to be currently implemented on top of http, isn't subclassing http client a logical way to do it?
We don't need to discuss this issues now, but that are Interfaces to query digital Archives. I have to learn more about that as well but got a hint from university library employee.
Right! ;-) In case you pay the flight i come to visit you a week for a code hacking session. ;-]Correct me if I am wrong.
Mailing list is very inefficient for this kind of discussion. We may very well understand each other in 10 min, if talking face to face.
Serious: I have a phone, surprise surprise and we can do web-conferences may be with more people that are interested.
Then we can drink a virtual beer together and discuss some nutch issues.
I think good communication is the basis for a successful software development process.
We still misunderstand each other. I have to improve me communication, sorry. Just delete magic things and plugins from the data base.
Stephan: Your earlier message mentioned that you may want to use the plugin system to do some magic stuff. Could you be more specific? Is it must be done in crawler?
Sorry there is a missunderstanding, I just was saying i wish to do a url-normalizer with some "magic" logic inside as plugin. (IP database, DNS lookup, text classification)
All done through one plugin in the crawler? Probably not.
I wish to implement a url filter that only parse regional web-pages by more then just regular expressions. It's a new project and have nothing to do with plugins.
I agree, do you have a design idea? May be i can spend some time coding on the weekend.
Could we summarize what we have discussed so far? Doug, take up this task?
Maybe we should seperate near-term task and long-term dream.
Do we at least agree that a seperate tool (for outlink extraction and text strapping) needs to be created? This leaves Fetcher.java or RequestScheduler.java for crawling only.
At least UML just for communication (not MDA) i found sometimes useful.
Ste*FAN*
------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
