Hi,

On 10/15/07, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote:
> Add encode detection support for HTML parser

This feature sounds like something that the upstream HTML parser
library might want to do. I'm not sure if NekoHTML is maintained
anywhere, but if it was we should probably consider sending a patch
for that.

More generally, IMHO the Tika parsers should optimally be lightweight
adapters to the native interface of the underlying parsing library.
Whenever we come across cases where we find ourselves adding
non-trivial features within Tika to the Parser classes, we should at
least consider sending the improvements as patches to the upstream
parser projects. Otherwise we'll soon end up with tons of bug reports
about the details of parsing specific content types.

BR,

Jukka Zitting

Reply via email to