Koji Sekiguchi created OPENNLP-1197:
---------------------------------------

             Summary: FeatureGeneratorUtil.tokenFeature() always returns "lc" 
for Japanese words
                 Key: OPENNLP-1197
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1197
             Project: OpenNLP
          Issue Type: Bug
          Components: Machine Learning
    Affects Versions: 1.8.4
            Reporter: Koji Sekiguchi


FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
(lower case). It looks a bug to me because they're not lower case letters, but 
other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care 
only Europe/American languages.

For example, in Japanese NER problem, typical token classes are as follows:

- DIGIT
- HIRA : あ, い, う, え, お etc.
- KATA : ア, イ, ウ, エ, オ etc.
- ALPHA : we don't distinguish lower/upper case
- OTHER

I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
additional token classes I mentioned above, but later on, someone who comes 
from Asia and may claim similar thing.

I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to