UTF16-LE not detected --------------------- Key: TIKA-721 URL: https://issues.apache.org/jira/browse/TIKA-721 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: Chinese_Simplified_utf16.txt
I have a test file encoded in UTF16-LE, but Tika fails to detect it. Note that it is missing the BOM, which is not allowed (for UTF16-BE the BOM is optional). Not sure we can realistically fix this; I have no idea how... Here's what Tika detects: {noformat} windows-1250: confidence=9 windows-1250: confidence=7 windows-1252: confidence=7 windows-1252: confidence=6 windows-1252: confidence=5 IBM420_ltr: confidence=4 windows-1252: confidence=3 windows-1254: confidence=2 windows-1250: confidence=2 windows-1252: confidence=2 IBM420_rtl: confidence=1 windows-1253: confidence=1 windows-1250: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 {noformat} The test file decodes fine as UTF16-LE; eg in Python just run this: {noformat} import codecs codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read()) {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira