[ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461770#comment-17461770
 ] 

Hudson commented on NUTCH-2449:
-------------------------------

ABORTED: Integrated in Jenkins build Nutch » Nutch-trunk #63 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/63/])
NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716) 
(github: 
[https://github.com/apache/nutch/commit/a9b50a7c7e0ab83865883bf87f2c98f1ce354388])
* (add) src/plugin/language-identifier/build-ivy.xml
* (edit) src/plugin/language-identifier/build.xml


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>             Fix For: 1.19
>
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to