[ 
https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198415#comment-13198415
 ] 

Nick Burch commented on TIKA-855:
---------------------------------

I believe we're currently missing language profiles for those two, which would 
explain the detection issue. I think we probably need someone with a large 
corpus of text in the two languages to help with generating them
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 
> 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l 
> .\Japanese.txt) on several Japanese files (both PDF and text files) and it 
> consistently returns lt (Lithuanian???) instead of ja. I also tried on a 
> Chinese file which similarly incorrectly returned lt. Both English language 
> and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to