Hi,

Jukka Zitting wrote:
Hi,

On Thu, Oct 23, 2008 at 5:32 PM, Stephane Bastian
<[EMAIL PROTECTED]> wrote:
> However, a ParserFactory class (which doesn't exist yet) would really help
> us here and could provide public method(s) to do what's currently done
> internally by the class AutoDetectParser

You should be able to achieve this functionality by overriding the
getParser(Metadata) method in CompositeParser (that AutoDetectParser
inherits).

Alteratively you could simply modify the Tika configuration and pass
the modified configuration to the AutoDetectParser instance.

I could certainly subclass CompositeParser and override getParser(metadata), but it seems odd that Tika doesn't provide an easy way to get a Parser based upon a Stream, documentName, and contentType. As a matter of fact, by looking closer into Tika's internal, we'll simply need to add the following new method in MimeTypes:

MimeType mimeType = MimeTypes.getMimeType(inputStream , metadata);

We can then easily get the parser via the class TikaConfig (even though it's not optimal since Tikaconfig.getDefaultConfig() creates a new instance each time it's called. BTW, I can help you here as well in case you want to make TikaConfig immutable. Just let me know what you had in mind and I can work on it)

Going back to the original question, don't you feel that it is a common use case to be able to get a Parser from a Stream and metadata?

More generally, is there a specific reason why you need custom
processing for HTML?

We are using Tika to get metadata, as you may have guessed :), and to extract other data as well. For instance in the case of Html, we were planning on using the content handler to do screen scrapping, based upon the known structure of the html document. We were also planning on using the content handler to extract links that have specific names, or links that come from specific tags (such a HREF, Scripts, Img...). In our case we don't want all the links but only some of same based on some internal logic that we'll put in the content handler. And we can't rely on the full text because name of links (and other information we need to filter-on) are missing.

However, this morning I realized that the contentHandler of the Html parser filters tags such as Divs, Spans and such and doesn't return the original body of the document, this is a bummer... Therefore we are out of luck and can't do screen scrapping because the structure of the document we get has been altered by Tika.

Since the Html Parsers uses cyberneko, the contentHanlder is already returning proper XHtml right? Can't the content handler just returned the original document structure unmodified?

All the best,

Stephane Bastian
BR,

Jukka Zitting


Reply via email to