Hans Brende created ANY23-418:
---------------------------------

             Summary: Take another look at encoding detection
                 Key: ANY23-418
                 URL: https://issues.apache.org/jira/browse/ANY23-418
             Project: Apache Any23
          Issue Type: Improvement
          Components: encoding
    Affects Versions: 2.3
            Reporter: Hans Brende
             Fix For: 2.3


In order to address various shortcomings of Tika encoding detection, I've had 
to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. 
In the former, I placed a much greater weight on detected charsets declared in 
html meta elements & xml declarations. In the latter, I placed a much greater 
weight on charsets returned from HTTP Content-Type headers.

However, after taking a look at TIKA-539, I'm thinking I should reduce this 
added weight (for at least html meta elements), and perhaps ignore it 
altogether (unless it happens to match UTF-8, since it seems that incorrect 
declarations usually declare something *other than* UTF-8, when the correct 
charset should be UTF-8).

Something like > 90% of all webpages use UTF-8 encoding, and all of our 
encoding detection errors to date have revolved around *something other than 
UTF-8* being detected when the correct encoding was actually UTF-8, not the 
other way around.

Therefore, what I propose is the following: 

(1) In the absence of a Content-Type header, any declared hints that the 
charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
that the charset is not UTF-8 should be ignored. 

(2) In the presence of a Content-Type header, any other declared hints should 
be ignored, unless they match UTF-8 and do not match the Content-Type header, 
in which case all hints, including the Content-Type header, should be ignored.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to