Hi,
Jukka Zitting wrote:
Hi,
On Thu, Oct 23, 2008 at 5:32 PM, Stephane Bastian
<[EMAIL PROTECTED]> wrote:
> However, a ParserFactory class (which doesn't exist yet) would really help
> us here and could provide public method(s) to do what's currently done
> internally by the class AutoDetectParser
You should be able to achieve this functionality by overriding the
getParser(Metadata) method in CompositeParser (that AutoDetectParser
inherits).
Alteratively you could simply modify the Tika configuration and pass
the modified configuration to the AutoDetectParser instance.
I could certainly subclass CompositeParser and override
getParser(metadata), but it seems odd that Tika doesn't provide an easy
way to get a Parser based upon a Stream, documentName, and contentType.
As a matter of fact, by looking closer into Tika's internal, we'll
simply need to add the following new method in MimeTypes:
MimeType mimeType = MimeTypes.getMimeType(inputStream , metadata);
We can then easily get the parser via the class TikaConfig (even though
it's not optimal since Tikaconfig.getDefaultConfig() creates a new
instance each time it's called. BTW, I can help you here as well in case
you want to make TikaConfig immutable. Just let me know what you had in
mind and I can work on it)
Going back to the original question, don't you feel that it is a common
use case to be able to get a Parser from a Stream and metadata?
More generally, is there a specific reason why you need custom
processing for HTML?
We are using Tika to get metadata, as you may have guessed :), and to
extract other data as well. For instance in the case of Html, we were
planning on using the content handler to do screen scrapping, based upon
the known structure of the html document.
We were also planning on using the content handler to extract links that
have specific names, or links that come from specific tags (such a HREF,
Scripts, Img...). In our case we don't want all the links but only some
of same based on some internal logic that we'll put in the content
handler. And we can't rely on the full text because name of links (and
other information we need to filter-on) are missing.
However, this morning I realized that the contentHandler of the Html
parser filters tags such as Divs, Spans and such and doesn't return the
original body of the document, this is a bummer... Therefore we are out
of luck and can't do screen scrapping because the structure of the
document we get has been altered by Tika.
Since the Html Parsers uses cyberneko, the contentHanlder is already
returning proper XHtml right?
Can't the content handler just returned the original document structure
unmodified?
All the best,
Stephane Bastian
BR,
Jukka Zitting