[ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney resolved NUTCH-2449. ----------------------------------------- Resolution: Fixed > Usage of Tika LanguageIdentifier in language-identifier plugin > -------------------------------------------------------------- > > Key: NUTCH-2449 > URL: https://issues.apache.org/jira/browse/NUTCH-2449 > Project: Nutch > Issue Type: Improvement > Components: plugin > Affects Versions: 1.13 > Reporter: Yossi Tamari > Priority: Major > Fix For: 1.19 > > > The language-identifier plugin uses > org.apache.tika.language.LanguageIdentifier for extracting the language from > the document text. There are two issues with that: > # LanguageIdentifier is deprecated in Tika. > # It does not support CJK language (and I suspect a lot of other languages - > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), > and it doesn’t even fail gracefully with them - in my experience Chinese was > recognized as Italian. -- This message was sent by Atlassian Jira (v8.20.1#820001)