First of all, do you really need to train your own tokenizer? You could use http://www.chasen.org/~taku/software/TinySegmenter/, or Chasen, or http://mecab.sourceforge.net/.
I believe that there are corpora available that were used to train mecab, but I'm rusty on the subject. The '1982' Mainichi might be available, but a model trained from it will work well for newspapers and not well at all for hiragana-heavy informal text. If you have a special reason to want to train a model, you can create training data by using one of the tokenizers above. Of course, your accuracy will be somewhat less than what you start with. In our experience, however, not so much less. On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote: > Hi. > > I'm a newbie at Language Processing. > Then, I'm wondering that what kind of data is suitable for training corpus. > > for english, what is the best for training corpus? > > On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote: >> On 4/5/11 8:25 AM, Toshiya TSURU wrote: >>> >>> Hi. >>> >>> I'm a software developer in Tokyo,Japan. >>> I found that RapidMiner uses OpenNLP for its tokenization process. >>> >>> But, the token given by RapidMiner is strange. >>> Because There is no Tokenizer model for Japanese. >>> >>> Although I've checked the page below, >>> The models For Japanese is not found. >>> http://opennlp.sourceforge.net/models-1.5/ >>> >>> How can I get Japanese model? >>> Or Can I create one? >>> >> Currently we do not have support for Japanese, but >> we would be happy to add it. >> >> Do you know a training corpus we could use? >> >> Jörn >> >> >> > > > > -- > Toshiya TSURU <[email protected]> > http://twitter.com/turutosiya >
