[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Tim Allison (Jira) Wed, 14 Jul 2021 08:56:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380703#comment-17380703
 ]


Tim Allison commented on TIKA-3479:
-----------------------------------

One way to fix this problem (and open a new can of worms?) is this in the 
UniversalEncodingListener:

{noformat}
        if (charset == null && statistics.isMostlyAscii()) {
            report(Constants.CHARSET_WINDOWS_1252);
        }
{noformat}

This returns windows-1252 if the detector didn't find anything.  This means 
that the ICU4j encoding detector is effectively never run IIUC.  If we got rid 
of this default behavior, this one problem is fixed...

> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-3479
>                 URL: https://issues.apache.org/jira/browse/TIKA-3479
>             Project: Tika
>          Issue Type: Task
>    Affects Versions: 2.0.0-BETA
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: Bates.Motel.S02E08.HDTV.x264-KILLERS.srt
>
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x 
> vs. 1.x.  The key issue appears to be the following (which we do not have in 
> 1.x).
> {noformat}
>     /*
>      * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
>      * If these value's count > 0, return true
>      * */
>     private Boolean hasNonexistentHexInCharsetWindows1252() {
>         return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
>                 statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
>                 statistics.count(0x9d) > 0);
>     }
> {noformat}
> The icu4j detector detects windows-1250 (not supported by the 
> UniversalEncodingDetector), and the characters decoded with encoding do 
> better on google. windows-1252 is _generally_ a better match for windows-1250 
> than ISO-8859-1.
> Not sure how best to handle this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Reply via email to