Hans Brende created ANY23-418:
---------------------------------
Summary: Take another look at encoding detection
Key: ANY23-418
URL: https://issues.apache.org/jira/browse/ANY23-418
Project: Apache Any23
Issue Type: Improvement
Components: encoding
Affects Versions: 2.3
Reporter: Hans Brende
Fix For: 2.3
In order to address various shortcomings of Tika encoding detection, I've had
to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411.
In the former, I placed a much greater weight on detected charsets declared in
html meta elements & xml declarations. In the latter, I placed a much greater
weight on charsets returned from HTTP Content-Type headers.
However, after taking a look at TIKA-539, I'm thinking I should reduce this
added weight (for at least html meta elements), and perhaps ignore it
altogether (unless it happens to match UTF-8, since it seems that incorrect
declarations usually declare something *other than* UTF-8, when the correct
charset should be UTF-8).
Something like > 90% of all webpages use UTF-8 encoding, and all of our
encoding detection errors to date have revolved around *something other than
UTF-8* being detected when the correct encoding was actually UTF-8, not the
other way around.
Therefore, what I propose is the following:
(1) In the absence of a Content-Type header, any declared hints that the
charset is UTF-8 should add to the weight for UTF-8, while any declared hints
that the charset is not UTF-8 should be ignored.
(2) In the presence of a Content-Type header, any other declared hints should
be ignored, unless they match UTF-8 and do not match the Content-Type header,
in which case all hints, including the Content-Type header, should be ignored.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)