[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405785#comment-15405785 ]
Tim Allison edited comment on TIKA-721 at 8/3/16 12:03 PM: ----------------------------------------------------------- While working on TIKA-2038, I found that ICU4J is now correctly identifying this file. I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet... This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value. A general test corpus would be great. I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus. We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8. We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc. Proposed eval approach 1 ([~faghani]'s approach): assume the actual http-header or the http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header. Proposed eval approach 2: compare potential changes against the current method. Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents. [1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus. was (Author: talli...@mitre.org): While working on TIKA-2038, I found that ICU4J is now correctly identifying this file. I think if we add a stripper to ignore contents of <script>/<style> elements, we might consider promoting ICU4J to run before UniversalChardet... This would currently effectively turn off UniversalChardet, IIRC, because I think ICU4J is guaranteed to returned a non-null value. A general test corpus would be great. I think if we follow the approach and test corpus of [~faghani], we should be able to evaluate the result of potential changes at least against the encodings in his corpus. We also have a decent number of files in our (TIKA-1302)'s regression corpus; most are depressingly English and/or UTF-8. We could augment Shabali's corpus by transcoding to UTF-8/UTF-16 etc. Proposed eval approach 1 ([~faghani]'s approach): assume http-meta header is accurate[1], run ICU4J and the UniversalCharDetector against the files and compare with the meta-header. Proposed eval approach 2: compare potential changes against the current method. Run our tika-eval (TIKA-1332) module against the output and evaluate a random sample of files that have differing contents. [1] This generally gives me great pause, but via random sampling, this appears to be reasonable in Shabanali's corpus. > UTF16-LE not detected > --------------------- > > Key: TIKA-721 > URL: https://issues.apache.org/jira/browse/TIKA-721 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch > > > I have a test file encoded in UTF16-LE, but Tika fails to detect it. > Note that it is missing the BOM, which is not allowed (for UTF16-BE > the BOM is optional). > Not sure we can realistically fix this; I have no idea how... > Here's what Tika detects: > {noformat} > windows-1250: confidence=9 > windows-1250: confidence=7 > windows-1252: confidence=7 > windows-1252: confidence=6 > windows-1252: confidence=5 > IBM420_ltr: confidence=4 > windows-1252: confidence=3 > windows-1254: confidence=2 > windows-1250: confidence=2 > windows-1252: confidence=2 > IBM420_rtl: confidence=1 > windows-1253: confidence=1 > windows-1250: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > windows-1252: confidence=1 > {noformat} > The test file decodes fine as UTF16-LE; eg in Python just run this: > {noformat} > import codecs > codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read()) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)