[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-721: ------------------------------------ Attachment: TIKA-721.patch Attached patch, using three simple heuristics: First, I compute the count distribution of each of the 256 possible byte values for the even vs odd bytes, and compute dot-product between those two (unit-length-normalized) vectors. For double-byte charsets, usually the doc-product will be low, because even vs odd bytes act very differently; but usually very high (near 1.0) for single-byte charsets. Second, I decode all the bytes according to LE or BE, into UTF16 code units, and then count up basic stats: the number of valid and invalid surrogates, the number of valid and invalid code points. Finally, for the valid code points, I count how many times each unicode block had a character; usually a doc will be a in single language and have high percentage of its chars from a single block (I think!?). Then I use simple heuristics from these stats to get a rough confidence. I made [educated] guesses for thresholds to set the confidence choices, having run on random files I have locally... but I'd really prefer to find a nice corpus somewhere to do a more thorough test. > UTF16-LE not detected > --------------------- > > Key: TIKA-721 > URL: https://issues.apache.org/jira/browse/TIKA-721 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch > > > I have a test file encoded in UTF16-LE, but Tika fails to detect it. > Note that it is missing the BOM, which is not allowed (for UTF16-BE > the BOM is optional). > Not sure we can realistically fix this; I have no idea how... > Here's what Tika detects: > {noformat} > windows-1250: confidence=9 > windows-1250: confidence=7 > windows-1252: confidence=7 > windows-1252: confidence=6 > windows-1252: confidence=5 > IBM420_ltr: confidence=4 > windows-1252: confidence=3 > windows-1254: confidence=2 > windows-1250: confidence=2 > windows-1252: confidence=2 > IBM420_rtl: confidence=1 > windows-1253: confidence=1 > windows-1250: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > {noformat} > The test file decodes fine as UTF16-LE; eg in Python just run this: > {noformat} > import codecs > codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read()) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira