[ 
https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112592#comment-13112592
 ] 

Nick Burch commented on TIKA-720:
---------------------------------

I've spent a bit of time studying the code (which comes from icu4j), and I 
think I know roughly how it works

I've sent an email to the icu mailing list asking for some clarifications 
though, hopefully armed with the answers we can add this support

In the mean time, do you have some more sample files we could use for 
testing/ngram identification? Especially interesting would be ones in other 
varients of EBCIDIC

> EBCDIC encoding not detected
> ----------------------------
>
>                 Key: TIKA-720
>                 URL: https://issues.apache.org/jira/browse/TIKA-720
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: English_EBCDIC.txt
>
>
> I have a test file encoded in EBCDIC, but Tika fails to detect it.
> Not sure we can realistically fix this; I have no idea how (and,
> realistically, one really ought to convert out of EBCDIC on export
> from a mainframe...).
> Here's what Tika detects:
> {noformat}
> Shift_JIS:      confidence=51
> Big5:           confidence=40
> GB18030:        confidence=10
> KOI8-R:         confidence=5
> windows-1252:   confidence=5
> windows-1253:   confidence=2
> IBM866:         confidence=1
> windows-1251:   confidence=1
> windows-1250:   confidence=1
> {noformat}
> The test file decodes fine as cp500; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('cp500')(open('English_EBCDIC.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to