[
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-332.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.6
Assignee: Jukka Zitting
Patches applied in revision 890009.
> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
> Key: TIKA-332
> URL: https://issues.apache.org/jira/browse/TIKA-332
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.5
> Reporter: Ken Krugler
> Assignee: Jukka Zitting
> Priority: Critical
> Fix For: 0.6
>
> Attachments: TIKA-332-2.patch, TIKA-332.patch
>
>
> Currently Tika doesn't use the charset info that's optionally present in HTML
> documents, via the <meta http-equiv="Content-type" content="text/html;
> charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser,
> then the first 4-8K of text should be converted from bytes to us-ascii, and
> then scanned using a regex something like:
> private static final Pattern HTTP_EQUIV_CHARSET_PATTERN =
> Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the
> HTTP response headers, and (obviously) used to convert the bytes to text
> before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info
> in the meta tag that wound up being different from the detected or HTTP
> response header charset, so this is a pretty important improvement to make.
> Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the
> CharsetDetector, which is another reason for lots of incorrect text. I'll
> file a separate issue about that.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.