[jira] [Created] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1252(?) as ISO-8859-1

Tim Allison (Jira) Wed, 14 Jul 2021 08:18:04 -0700

Tim Allison created TIKA-3479:
---------------------------------

             Summary: UniversalCharsetDetector in 2.x is misidentifying 
windows-1252(?) as ISO-8859-1
                 Key: TIKA-3479
                 URL: https://issues.apache.org/jira/browse/TIKA-3479
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison



We've lost quite a few "common words" for Czech and Slovak text files in 2.x 
vs. 1.x.  The key issue appears to be the following (which we do not have in 
1.x).

{noformat}
    /*
     * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
     * If these value's count > 0, return true
     * */
    private Boolean hasNonexistentHexInCharsetWindows1252() {
        return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
                statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
                statistics.count(0x9d) > 0);
    }
{noformat}

I _think_ the files are actually https://en.wikipedia.org/wiki/Code_page_852, 
and they do have these characters. windows-1252 is _generally_ a better batch 
for cp852 than ISO-8859-1.

Not sure how best to handle this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1252(?) as ISO-8859-1

Reply via email to