[ https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hans Brende updated ANY23-418: ------------------------------ Fix Version/s: (was: 2.4) 2.3 > Take another look at encoding detection > --------------------------------------- > > Key: ANY23-418 > URL: https://issues.apache.org/jira/browse/ANY23-418 > Project: Apache Any23 > Issue Type: Improvement > Components: encoding > Affects Versions: 2.3 > Reporter: Hans Brende > Priority: Major > Fix For: 2.3 > > > In order to address various shortcomings of Tika encoding detection, I've had > to modify the TikaEncodingDetector several times. Cf. ANY23-385 and > ANY23-411. In the former, I placed a much greater weight on detected charsets > declared in html meta elements & xml declarations. In the latter, I placed a > much greater weight on charsets returned from HTTP Content-Type headers. > However, after taking a look at TIKA-539, I'm thinking I should reduce this > added weight (for at least html meta elements), and perhaps ignore it > altogether (unless it happens to match UTF-8, since it seems that incorrect > declarations usually declare something *other than* UTF-8, when the correct > charset should be UTF-8). > Something like > 90% of all webpages use UTF-8 encoding, and all of our > encoding detection errors to date have revolved around *something other than > UTF-8* being detected when the correct encoding was actually UTF-8, not the > other way around. > Therefore, what I propose is the following: > (1) In the absence of a Content-Type header, any declared hints that the > charset is UTF-8 should add to the weight for UTF-8, while any declared hints > that the charset is not UTF-8 should be ignored. > (2) In the presence of a Content-Type header, any other declared hints should > be ignored, unless they match UTF-8 and do not match the Content-Type header, > in which case all hints, including the Content-Type header, should be ignored. > EDIT: The above 2 points are a simplification of what I've actually > implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See > PR 131 for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)