Tim Allison created TIKA-3147:
---------------------------------
Summary: String punctuation in lang id component within tika-eval
Key: TIKA-3147
URL: https://issues.apache.org/jira/browse/TIKA-3147
Project: Tika
Issue Type: Task
Reporter: Tim Allison
I noticed that "the quick brown fox jumped over the lazy dog" was identified as
English in tika-eval. However, if I added semi-colons, it was identified as
Chinese.
This is in alignment with what I've recently seen on a new batch of technical
documents that are mostly numbers and abbreviations...these are being
identified as Chinese.
Ideally, we'd strip out punctuation while building the models, and while
running language id.
As a short term fix, we should at least strip out punctuation during detection.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)