[ 
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-721:
------------------------------------

    Attachment: TIKA-721.patch

Attached patch, using three simple heuristics:

First, I compute the count distribution of each of the 256 possible
byte values for the even vs odd bytes, and compute dot-product between
those two (unit-length-normalized) vectors.  For double-byte charsets,
usually the doc-product will be low, because even vs odd bytes act
very differently; but usually very high (near 1.0) for single-byte
charsets.

Second, I decode all the bytes according to LE or BE, into UTF16 code
units, and then count up basic stats: the number of valid and invalid
surrogates, the number of valid and invalid code points.

Finally, for the valid code points, I count how many times each
unicode block had a character; usually a doc will be a in single
language and have high percentage of its chars from a single block (I
think!?).

Then I use simple heuristics from these stats to get a rough
confidence.  I made [educated] guesses for thresholds to set the
confidence choices, having run on random files I have locally... but
I'd really prefer to find a nice corpus somewhere to do a more
thorough test.

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to