Tim Allison created TIKA-3147:
---------------------------------

             Summary: String punctuation in lang id component within tika-eval
                 Key: TIKA-3147
                 URL: https://issues.apache.org/jira/browse/TIKA-3147
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


I noticed that "the quick brown fox jumped over the lazy dog" was identified as 
English in tika-eval.  However, if I added semi-colons, it was identified as 
Chinese.

This is in alignment with what I've recently seen on a new batch of technical 
documents that are mostly numbers and abbreviations...these are being 
identified as Chinese.

Ideally, we'd strip out punctuation while building the models, and while 
running language id.

As a short term fix, we should at least strip out punctuation during detection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to