[ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461770#comment-17461770 ]
Hudson commented on NUTCH-2449: ------------------------------- ABORTED: Integrated in Jenkins build Nutch » Nutch-trunk #63 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/63/]) NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716) (github: [https://github.com/apache/nutch/commit/a9b50a7c7e0ab83865883bf87f2c98f1ce354388]) * (add) src/plugin/language-identifier/build-ivy.xml * (edit) src/plugin/language-identifier/build.xml > Usage of Tika LanguageIdentifier in language-identifier plugin > -------------------------------------------------------------- > > Key: NUTCH-2449 > URL: https://issues.apache.org/jira/browse/NUTCH-2449 > Project: Nutch > Issue Type: Improvement > Components: plugin > Affects Versions: 1.13 > Reporter: Yossi Tamari > Priority: Major > Fix For: 1.19 > > > The language-identifier plugin uses > org.apache.tika.language.LanguageIdentifier for extracting the language from > the document text. There are two issues with that: > # LanguageIdentifier is deprecated in Tika. > # It does not support CJK language (and I suspect a lot of other languages - > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), > and it doesn’t even fail gracefully with them - in my experience Chinese was > recognized as Italian. -- This message was sent by Atlassian Jira (v8.20.1#820001)