Hi,

On 10/15/07, Sami Siren <[EMAIL PROTECTED]> wrote:
> Jukka Zitting wrote:
> > On 10/15/07, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote:
> >> Add encode detection support for HTML parser
> >
> > This feature sounds like something that the upstream HTML parser
> > library might want to do. I'm not sure if NekoHTML is maintained
> > anywhere, but if it was we should probably consider sending a patch
> > for that.
>
> Well I used the same piece of code that you used for txtparser so
> detection/decoding is provided by icu4j and and not Tika. So I was
> basically just using icu for getting properly decoded reader nothing
> fancier than that.

Agreed, it's a relatively simple and straightforward enhancement, but
wouldn't it be useful also for other users of NekoHTML? Also, how
about handling <meta http-equiv='Content-Type'
content='text/html;charset=...'> tags or <?xml version="1.0"
encoding="..."?> prefixes?

IMHO concerns like that are a slippery slope that we should avoid
getting involved with within Tika. It's best if all such knowledge is
embedded in the external parser libraries we use.

BR,

Jukka Zitting

Reply via email to