[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Tim Allison (Jira) Wed, 14 Jul 2021 08:40:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380688#comment-17380688
 ]


Tim Allison commented on TIKA-3479:
-----------------------------------

As I look at this, the 1.x detection of win-1252 is quite broken as well, just 
less broken than ISO-8859-1.  I think we should punt on this distinction for 
now.  I don't think this is a blocker for 2.x

> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-3479
>                 URL: https://issues.apache.org/jira/browse/TIKA-3479
>             Project: Tika
>          Issue Type: Task
>    Affects Versions: 2.0.0-BETA
>            Reporter: Tim Allison
>            Priority: Minor
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x 
> vs. 1.x.  The key issue appears to be the following (which we do not have in 
> 1.x).
> {noformat}
>     /*
>      * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
>      * If these value's count > 0, return true
>      * */
>     private Boolean hasNonexistentHexInCharsetWindows1252() {
>         return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
>                 statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
>                 statistics.count(0x9d) > 0);
>     }
> {noformat}
> The icu4j detector detects windows-1250 (not supported by the 
> UniversalEncodingDetector), and the characters decoded with encoding do 
> better on google. windows-1252 is _generally_ a better match for windows-1250 
> than ISO-8859-1.
> Not sure how best to handle this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Reply via email to