Re: RFE: adding a ParserFactory class

Stephane Bastian Fri, 24 Oct 2008 03:55:42 -0700

Hi,

Jukka Zitting wrote:

Hi,

On Thu, Oct 23, 2008 at 5:32 PM, Stephane Bastian
<[EMAIL PROTECTED]> wrote:
> However, a ParserFactory class (which doesn't exist yet) would really help
> us here and could provide public method(s) to do what's currently done
> internally by the class AutoDetectParser

You should be able to achieve this functionality by overriding the
getParser(Metadata) method in CompositeParser (that AutoDetectParser
inherits).

Alteratively you could simply modify the Tika configuration and pass
the modified configuration to the AutoDetectParser instance.

I could certainly subclass CompositeParser and overridegetParser(metadata), but it seems odd that Tika doesn't provide an easyway to get a Parser based upon a Stream, documentName, and contentType.As a matter of fact, by looking closer into Tika's internal, we'llsimply need to add the following new method in MimeTypes:


MimeType mimeType = MimeTypes.getMimeType(inputStream , metadata);

We can then easily get the parser via the class TikaConfig (even thoughit's not optimal since Tikaconfig.getDefaultConfig() creates a newinstance each time it's called. BTW, I can help you here as well in caseyou want to make TikaConfig immutable. Just let me know what you had inmind and I can work on it)

Going back to the original question, don't you feel that it is a commonuse case to be able to get a Parser from a Stream and metadata?

More generally, is there a specific reason why you need custom
processing for HTML?

We are using Tika to get metadata, as you may have guessed :), and toextract other data as well. For instance in the case of Html, we wereplanning on using the content handler to do screen scrapping, based uponthe known structure of the html document.We were also planning on using the content handler to extract links thathave specific names, or links that come from specific tags (such a HREF,Scripts, Img...). In our case we don't want all the links but only someof same based on some internal logic that we'll put in the contenthandler. And we can't rely on the full text because name of links (andother information we need to filter-on) are missing.

However, this morning I realized that the contentHandler of the Htmlparser filters tags such as Divs, Spans and such and doesn't return theoriginal body of the document, this is a bummer... Therefore we are outof luck and can't do screen scrapping because the structure of thedocument we get has been altered by Tika.

Since the Html Parsers uses cyberneko, the contentHanlder is alreadyreturning proper XHtml right?Can't the content handler just returned the original document structureunmodified?


All the best,

Stephane Bastian

BR,

Jukka Zitting

Re: RFE: adding a ParserFactory class

Reply via email to