[ https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi updated OPENNLP-1197: ------------------------------------ Description: FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages. For example, in Japanese NER problem, typical token classes are as follows: - DIGIT - HIRA : あ, い, う, え, お etc. - KATA : ア, イ, ウ, エ, オ etc. - ALPHA : we don't need to distinguish lower/upper case - OTHER I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing. I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. was: FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages. For example, in Japanese NER problem, typical token classes are as follows: - DIGIT - HIRA : あ, い, う, え, お etc. - KATA : ア, イ, ウ, エ, オ etc. - ALPHA : we don't distinguish lower/upper case - OTHER I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing. I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. > FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words > -------------------------------------------------------------------------- > > Key: OPENNLP-1197 > URL: https://issues.apache.org/jira/browse/OPENNLP-1197 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning > Affects Versions: 1.8.4 > Reporter: Koji Sekiguchi > Priority: Major > > FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" > (lower case). It looks a bug to me because they're not lower case letters, > but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes > care only Europe/American languages. > For example, in Japanese NER problem, typical token classes are as follows: > - DIGIT > - HIRA : あ, い, う, え, お etc. > - KATA : ア, イ, ウ, エ, オ etc. > - ALPHA : we don't need to distinguish lower/upper case > - OTHER > I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have > additional token classes I mentioned above, but later on, someone who comes > from Asia and may claim similar thing. > I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)