HtmlParser should use CharsetDetector whenever no charset is specified via meta
http-equiv tag
----------------------------------------------------------------------------------------------
Key: TIKA-334
URL: https://issues.apache.org/jira/browse/TIKA-334
Project: Tika
Issue Type: Improvement
Affects Versions: 0.5
Reporter: Ken Krugler
Currently the HtmlParser will just call TagSoup to parse, without specifying a
charset, if no charset is passed in via metadata.
TagSoup uses the platform encoding in this case, which is often going to be
wrong.
The right thing to do is to first check for a charset specified by a meta tag.
If that doesn't exist, then create a CharsetDetector. If there's a charset in
the incoming meta-data, use that to call setDeclaredEncoding().
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.