Re: [Nutch-dev] code for index content of mime type beyond text/html

Doug Cutting Tue, 18 May 2004 13:39:46 -0700

[EMAIL PROTECTED] wrote:

Besides text stripping, my patch provides new capabilities/mechanisms
at indexing stage and in search output.

As in its current state, Stefan's plugin does text stripping only.

For text stripping part, I would not consider there is a total conflict.
His is more of handling the content analysis on the fly.
Mine is to have that done at late stage with support of meta info saved
in FetcherOutput.


John,

I like your metadata stuff, and don't want to lose that. However you make the architectural assumption that only HTML contains links, while, e.g. PDF, msword and even plain text can too.

So if we only want to parse a page once, then I think we need to either do all of the metadata and link extraction at fetch time, or have the fetcher just store the raw content, then do parsing in a separate pass.

In either case, I think we need a single interface that combines your Textable with Stefan's IContentExtractor. In particular, I think it should look something like:

public interface ParsedContentFactory { ParsedContent getParsedContent(FetchListEntry fle, Response response, URL base); }

public interface ParsedContent {
  String getText();
  MetaData getMetaData();
  Outlink[] getOutlinks();
}

These will be implemented for each content type.

Then we need an extension point like:

public interface DocumentFactory {
  Document getDocument(String segment, long doc,
                       FetcherOutput fo, FetcherText text);
}

A base implementation would index all of the standard fields, and subclasses could index other metadata.

Does this sound reasonable?

Doug

-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to