[jira] [Commented] (ANY23-418) Take another look at encoding detection

Hudson (JIRA) Wed, 06 Feb 2019 22:46:55 -0800


    [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762411#comment-16762411
 ]


Hudson commented on ANY23-418:
------------------------------

SUCCESS: Integrated in Jenkins build Any23-trunk #1654 (See 
[https://builds.apache.org/job/Any23-trunk/1654/])
ANY23-418 improve TikaEncodingDetector (hans: rev 
d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b)
* (edit) 
encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java
ANY23-418 add additional unit tests (hans: rev 
e9f11b4979f491d395f76ad22f11869220099be2)
* (edit) 
encoding/src/test/java/org/apache/any23/encoding/TikaEncodingDetectorTest.java
* (edit) 
encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java
ANY23-418 update f8 artifact, cleanup (hans: rev 
dce3c098e8a4c0662e663d847d345a67a978e343)
* (edit) 
encoding/src/test/java/org/apache/any23/encoding/TikaEncodingDetectorTest.java
* (edit) encoding/pom.xml
* (edit) encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java
ANY23-418 update NOTICE.txt (hans: rev e9c001ffa7bcfb7914a91649d2a190857569d054)
* (edit) NOTICE.txt


> Take another look at encoding detection
> ---------------------------------------
>
>                 Key: ANY23-418
>                 URL: https://issues.apache.org/jira/browse/ANY23-418
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually 
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See 
> PR 131 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ANY23-418) Take another look at encoding detection

Reply via email to