[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112592#comment-13112592 ]
Nick Burch commented on TIKA-720: --------------------------------- I've spent a bit of time studying the code (which comes from icu4j), and I think I know roughly how it works I've sent an email to the icu mailing list asking for some clarifications though, hopefully armed with the answers we can add this support In the mean time, do you have some more sample files we could use for testing/ngram identification? Especially interesting would be ones in other varients of EBCIDIC > EBCDIC encoding not detected > ---------------------------- > > Key: TIKA-720 > URL: https://issues.apache.org/jira/browse/TIKA-720 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: English_EBCDIC.txt > > > I have a test file encoded in EBCDIC, but Tika fails to detect it. > Not sure we can realistically fix this; I have no idea how (and, > realistically, one really ought to convert out of EBCDIC on export > from a mainframe...). > Here's what Tika detects: > {noformat} > Shift_JIS: confidence=51 > Big5: confidence=40 > GB18030: confidence=10 > KOI8-R: confidence=5 > windows-1252: confidence=5 > windows-1253: confidence=2 > IBM866: confidence=1 > windows-1251: confidence=1 > windows-1250: confidence=1 > {noformat} > The test file decodes fine as cp500; eg in Python just run this: > {noformat} > import codecs > codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read()) > {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira