Language detection is weak.
---------------------------
Key: TIKA-209
URL: https://issues.apache.org/jira/browse/TIKA-209
Project: Tika
Issue Type: Bug
Affects Versions: 0.3
Reporter: Robert Newson
in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language
determination without checking the confidence rating from ICU's CharsetDetector.
Please add a configurable level (0-100);
if (language != null && match.getConfidence() > THRESHOLD) {
metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
metadata.set(Metadata.LANGUAGE, match.getLanguage());
}
Obviously using charset to imply language is generally weak but it would be
sufficient if the confidence threshold was controlled. Today, the text "hello"
is tagged as French, for example.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.