[jira] [Commented] (TIKA-721) UTF16-LE not detected

Michael McCandless (Commented) (JIRA) Sun, 02 Oct 2011 10:50:59 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044
 ]


Michael McCandless commented on TIKA-721:
-----------------------------------------


{quote}
bq. Finally, for the valid code points, I count how many times each unicode 
block had a character; usually a doc will be a in single language and have high 
percentage of its chars from a single block (I think!?).

I don't think this is a good idea: languages like japanese use multiple blocks, 
and many writing 
systems (e.g. cyrillic/arabic/etc) tend to use ascii digits and punctuation...
{quote}

Hmm, but, what this means is for such docs the new detector gives a
worse confidence than it "should".  Ie it will result in false
negatives, not false positives.

Maybe we can use "total number of unique blocks" somehow.  For false
matches I see lots of random blocks being used (a "long tail") but for
a good match, just a few.

{quote}
bq. If I decode to a Unicode code point, I then call Java's Character.isDefined 
to see if it's really valid

I don't think this is that great either: e.g. java 6 supports a very old 
version of the unicode standard (4.x) and that method will return false for any 
completely valid newer unicode characters.
{quote}

Is there a more accurate way to check validity?  We can use the coarse
checks from the FAQ, but that doesn't rule out much.

So this means newer unicode docs (using chars after Unicode 4.x) will
be seen as invalid and we won't detect them. But this will also cause
false negatives not false positives... what pctg of the world's docs
use the newer chars?

Maybe we'll have to couple language detection w/ UTF16 LE/BE
detection to get better accuracy.

Remember we do no detection for UTF16 LE/BE at all, now, and this
patch would at least allow some (if not all) cases to be detected.  So
that'd be progress, even if it doesn't catch all cases it should.

It's the risk of false positives I'm more concerned about, ie where
some other double-byte charset is correctly identified today, but
breaks when we commit this; that said, I produce fairly low confidence
from the detector, except when I see valid surrogate pairs, so this
*should* be rare.

Still I would really love to test against a corpus...

                
> UTF16-LE not detected
> ---------------------
>
>                 Key: TIKA-721
>                 URL: https://issues.apache.org/jira/browse/TIKA-721
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250:   confidence=9
> windows-1250:   confidence=7
> windows-1252:   confidence=7
> windows-1252:   confidence=6
> windows-1252:   confidence=5
> IBM420_ltr:     confidence=4
> windows-1252:   confidence=3
> windows-1254:   confidence=2
> windows-1250:   confidence=2
> windows-1252:   confidence=2
> IBM420_rtl:     confidence=1
> windows-1253:   confidence=1
> windows-1250:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> windows-1252:   confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-721) UTF16-LE not detected

Reply via email to