Toshiya, While I'm a mentor of opennlp, I'm not that deep in the code. I'm mostly here to help with process. However, I have done a good deal of work on statistical segmentation of Japanese text.
I am presuming that by 'tokenization' for Japanese you are talking about segmentation into words. I appreciate that there has to be a tokenizer in the pipeline somewhere. However, it seems to me that it should be possible to write a bit of code and incorporate an existing segmentation component as an alternative to training a model for the opennlp tokenizer. I also have to wonder whether a component used for languages with whitespace will do a very good job at tokenizing Japanese or Chinese just by training a different model. Perhaps Jorn can shed some light on that; maybe others have used their own data to experiment with that. --benson On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <[email protected]> wrote: > Thanks Benson > > The reason why i'm looking for Japanese Model , is to implement > practical tokenizer into RapidMiner. > > RapidMiner is a datamining software which includes OpenNLP within it. > > In RapidMiner, OpenNLP is used to for tokenizing document data. it > works well for English contents, but for Japanese, not. Because The > models which is bundled with RapidMiner are only English and German. > > Then, I'm lookong for the one for Japanese tokenization. > > On Tuesday, April 5, 2011, Benson Margulies <[email protected]> wrote: >> First of all, do you really need to train your own tokenizer? You >> could use http://www.chasen.org/~taku/software/TinySegmenter/, or >> Chasen, or http://mecab.sourceforge.net/. >> >> I believe that there are corpora available that were used to train >> mecab, but I'm rusty on the subject. >> >> The '1982' Mainichi might be available, but a model trained from it >> will work well for newspapers and not well at all for hiragana-heavy >> informal text. >> >> If you have a special reason to want to train a model, you can create >> training data by using one of the tokenizers above. Of course, your >> accuracy will be somewhat less than what you start with. In our >> experience, however, not so much less. >> >> >> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote: >>> Hi. >>> >>> I'm a newbie at Language Processing. >>> Then, I'm wondering that what kind of data is suitable for training corpus. >>> >>> for english, what is the best for training corpus? >>> >>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote: >>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote: >>>>> >>>>> Hi. >>>>> >>>>> I'm a software developer in Tokyo,Japan. >>>>> I found that RapidMiner uses OpenNLP for its tokenization process. >>>>> >>>>> But, the token given by RapidMiner is strange. >>>>> Because There is no Tokenizer model for Japanese. >>>>> >>>>> Although I've checked the page below, >>>>> The models For Japanese is not found. >>>>> http://opennlp.sourceforge.net/models-1.5/ >>>>> >>>>> How can I get Japanese model? >>>>> Or Can I create one? >>>>> >>>> Currently we do not have support for Japanese, but >>>> we would be happy to add it. >>>> >>>> Do you know a training corpus we could use? >>>> >>>> Jörn >>>> >>>> >>>> >>> >>> >>> >>> -- >>> Toshiya TSURU <[email protected]> >>> http://twitter.com/turutosiya >>> >> > > > -- > Toshiya TSURU <[email protected]> > http://twitter.com/turutosiya >
