EBCDIC encoding not detected ---------------------------- Key: TIKA-720 URL: https://issues.apache.org/jira/browse/TIKA-720 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor
I have a test file encoded in EBCDIC, but Tika fails to detect it. Not sure we can realistically fix this; I have no idea how (and, realistically, one really ought to convert out of EBCDIC on export from a mainframe...). Here's what Tika detects: {noformat} Shift_JIS: confidence=51 Big5: confidence=40 GB18030: confidence=10 KOI8-R: confidence=5 windows-1252: confidence=5 windows-1253: confidence=2 IBM866: confidence=1 windows-1251: confidence=1 windows-1250: confidence=1 {noformat} The test file decodes fine as cp500; eg in Python just run this: {noformat} import codecs codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read()) {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira