[
https://issues.apache.org/jira/browse/TIKA-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-339.
--------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Committed in revision 890130.
> HtmlParser & TXTParser should not use language returned by CharsetDetector if
> language hint has been provided
> -------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-339
> URL: https://issues.apache.org/jira/browse/TIKA-339
> Project: Tika
> Issue Type: Bug
> Affects Versions: 0.6
> Reporter: Ken Krugler
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.6
>
> Attachments: TIKA-339.patch
>
>
> Currently the code used to call CharsetDetector in both TXTParser and
> HtmlParser is that any incoming language in the metadata map gets replaced if
> the detector returns a language.
> Given the low reliability of this language result, it should only be used in
> cases where there is no provided language, as typically this is coming in
> from either the Http response header or (for the HtmlParser) a meta tag or
> some other tag attribute. In all those cases, the incoming language is more
> accurate than the guess by the CharsetDetector.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.