For Chinese we need to create/get two profiles: Chinese Traditional and Chinese Simplified.
Oleg On Thu, Feb 2, 2012 at 6:13 AM, James Sullivan (Commented) (JIRA) < j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198521#comment-13198521] > > James Sullivan commented on TIKA-855: > ------------------------------------- > > If it is just a missing language profile issue let me know what is needed > as at least for Japanese I am aware of number of large publicly available > corpora that might be suitable and may be able to help generate the > profiles. However, it sounds like there might be more to it than just > generating the profile...I have added this as feature request TIKA-856. > > > Language Detection not working for Japanese and Chinese. > > -------------------------------------------------------- > > > > Key: TIKA-855 > > URL: https://issues.apache.org/jira/browse/TIKA-855 > > Project: Tika > > Issue Type: Bug > > Components: languageidentifier > > Affects Versions: 1.0 > > Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun > Java 6 and Oracle Java 7 > > Reporter: James Sullivan > > Assignee: Ken Krugler > > Priority: Minor > > Labels: Chinese, Japanese > > > > I have tried Tika 1.0 language detection (java -jar tika.jar -l > .\Japanese.txt) on several Japanese files (both PDF and text files) and it > consistently returns lt (Lithuanian???) instead of ja. I also tried on a > Chinese file which similarly incorrectly returned lt. Both English language > and French language detection worked correctly. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > >