UTF16-LE not detected
---------------------

                 Key: TIKA-721
                 URL: https://issues.apache.org/jira/browse/TIKA-721
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor
         Attachments: Chinese_Simplified_utf16.txt

I have a test file encoded in UTF16-LE, but Tika fails to detect it.

Note that it is missing the BOM, which is not allowed (for UTF16-BE
the BOM is optional).

Not sure we can realistically fix this; I have no idea how...

Here's what Tika detects:

{noformat}
windows-1250:   confidence=9
windows-1250:   confidence=7
windows-1252:   confidence=7
windows-1252:   confidence=6
windows-1252:   confidence=5
IBM420_ltr:     confidence=4
windows-1252:   confidence=3
windows-1254:   confidence=2
windows-1250:   confidence=2
windows-1252:   confidence=2
IBM420_rtl:     confidence=1
windows-1253:   confidence=1
windows-1250:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
{noformat}

The test file decodes fine as UTF16-LE; eg in Python just run this:

{noformat}
import codecs
codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
{noformat}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to