Re: [Nutch-dev] code for index content of mime type beyond text/html

Stefan Groschupf Tue, 18 May 2004 14:49:31 -0700


Besides text stripping, my patch provides new capabilities/mechanisms
at indexing stage and in search output.

As in its current state, Stefan's plugin does text stripping only.

My patch is about an plugin-system the content extractor stuff was an give a way, since Doug mentioned months ago, we only take the plugin-system with a default plugin and a build system. ;-) The good thing about a plugin-system is, if you don't like my plugin write a own plugin. ;-] In case Byron don't like your plugin and don't like my plugin he do a own plugin, and so on and so on and so on. That is flexibility. We don't need to discuss the next 3 weeks what's the best style to extract content, at least it every time depends on the needs of the user. How ever we have to found a Extension Point that allow all the possibilities the different plugins needs.

I very much like the idea of extraction meta data since I will need that very soon as well, so just let us bring all together. ;-) A short excursion: The superintendent of the united nation anti-esurience program on day said something very interesting: Bring together is better then to unit something, since the until uniting a lot of important things will be lost because one party is stronger.

However I am in favor of unix way: a tool should only
do one task and do it well. The crawlers (Fetcher.java and
RequestScheduler.java) need only concern themselves going out
to fetch urls.

Joon, I agree with you, since a separation would allow to scale nutch more granulated.

Currently they do text tripping (on text/html),
mostly for the purpose of outlink extraction. Since there are only
a few file formats that have meaningful amount of embedded links worth
harvesting, the benefit of having a full-blown plugin system in crawler
(for the sole purpose of outlink extraction) is not that great.

Two thinks, I'm not agree that only a few file formats can contains interesting outlinks. The "simple" content extractor plugins extract links via regular expressions from any kind of text.

As you mentioned we should separate it. At first i think it is useful to have plug-able content extractors for different file formats. Future more it would be good to have a plug-able crawler and the web crawler is just one plugin. As you may be not know the most informations of the world is available behind z39.50 and OAI interfaces.

http://www.niso.org/z39.50/z3950.html
http://www.openarchives.org

Let me come back to your unix tools metaphor. Please think this metaphor story to end, all unix tools only do one think and are a plugin to the unix kernel. All tools have at least an extension point that is the output and can be an extension that is the input you can assemble them by using pipes. How many compressing tools are available for unix, how many different file formats? Isn't it this kind of flexibility that makes unix so powerful and stable?

I think we are on the same page, just use a different vocabulary.

Stephan: Your earlier message mentioned that you may want to
use the plugin system to do some magic stuff. Could you be more
specific? Is it must be done in crawler?

Sorry there is a missunderstanding, I just was saying i wish to do a url-normalizer with some "magic" logic inside as plugin. (IP database, DNS lookup, text classification) The plugin-system itself isn't magic stuff. It is proved since years for example in the unix architecture.

Best,
Stefan

-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to