[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044 ]
Michael McCandless commented on TIKA-721: ----------------------------------------- {quote} bq. Finally, for the valid code points, I count how many times each unicode block had a character; usually a doc will be a in single language and have high percentage of its chars from a single block (I think!?). I don't think this is a good idea: languages like japanese use multiple blocks, and many writing systems (e.g. cyrillic/arabic/etc) tend to use ascii digits and punctuation... {quote} Hmm, but, what this means is for such docs the new detector gives a worse confidence than it "should". Ie it will result in false negatives, not false positives. Maybe we can use "total number of unique blocks" somehow. For false matches I see lots of random blocks being used (a "long tail") but for a good match, just a few. {quote} bq. If I decode to a Unicode code point, I then call Java's Character.isDefined to see if it's really valid I don't think this is that great either: e.g. java 6 supports a very old version of the unicode standard (4.x) and that method will return false for any completely valid newer unicode characters. {quote} Is there a more accurate way to check validity? We can use the coarse checks from the FAQ, but that doesn't rule out much. So this means newer unicode docs (using chars after Unicode 4.x) will be seen as invalid and we won't detect them. But this will also cause false negatives not false positives... what pctg of the world's docs use the newer chars? Maybe we'll have to couple language detection w/ UTF16 LE/BE detection to get better accuracy. Remember we do no detection for UTF16 LE/BE at all, now, and this patch would at least allow some (if not all) cases to be detected. So that'd be progress, even if it doesn't catch all cases it should. It's the risk of false positives I'm more concerned about, ie where some other double-byte charset is correctly identified today, but breaks when we commit this; that said, I produce fairly low confidence from the detector, except when I see valid surrogate pairs, so this *should* be rare. Still I would really love to test against a corpus... > UTF16-LE not detected > --------------------- > > Key: TIKA-721 > URL: https://issues.apache.org/jira/browse/TIKA-721 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch > > > I have a test file encoded in UTF16-LE, but Tika fails to detect it. > Note that it is missing the BOM, which is not allowed (for UTF16-BE > the BOM is optional). > Not sure we can realistically fix this; I have no idea how... > Here's what Tika detects: > {noformat} > windows-1250: confidence=9 > windows-1250: confidence=7 > windows-1252: confidence=7 > windows-1252: confidence=6 > windows-1252: confidence=5 > IBM420_ltr: confidence=4 > windows-1252: confidence=3 > windows-1254: confidence=2 > windows-1250: confidence=2 > windows-1252: confidence=2 > IBM420_rtl: confidence=1 > windows-1253: confidence=1 > windows-1250: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > {noformat} > The test file decodes fine as UTF16-LE; eg in Python just run this: > {noformat} > import codecs > codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read()) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira