[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107977#comment-13107977 ]
Nick Burch commented on TIKA-720: --------------------------------- A few IBM specific encodings are supported already in CharsetRecog_sbcs, looks like this one is missing though We'll need to find some suitable detection ngrams, which shouldn't be too hard as I seem to recall that EBCDIC puts a-z, A-Z and 0-9 in a very different place to ascii / the iso8859 formats > EBCDIC encoding not detected > ---------------------------- > > Key: TIKA-720 > URL: https://issues.apache.org/jira/browse/TIKA-720 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: English_EBCDIC.txt > > > I have a test file encoded in EBCDIC, but Tika fails to detect it. > Not sure we can realistically fix this; I have no idea how (and, > realistically, one really ought to convert out of EBCDIC on export > from a mainframe...). > Here's what Tika detects: > {noformat} > Shift_JIS: confidence=51 > Big5: confidence=40 > GB18030: confidence=10 > KOI8-R: confidence=5 > windows-1252: confidence=5 > windows-1253: confidence=2 > IBM866: confidence=1 > windows-1251: confidence=1 > windows-1250: confidence=1 > {noformat} > The test file decodes fine as cp500; eg in Python just run this: > {noformat} > import codecs > codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read()) > {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira