Thanks Benson. > I am presuming that by 'tokenization' for Japanese you are talking > about segmentation into words. I appreciate that there has to be a > tokenizer in the pipeline somewhere
Yes. > However, it seems to me that it > should be possible to write a bit of code and incorporate an existing > segmentation component as an alternative to training a model for the > opennlp tokenizer Yes. Because RapidMiner is written by Java, I've been looking for alternatives which is writted by Java. And I find it. "Sen" is the one. http://www.mlab.im.dendai.ac.jp/~yamada/ir/MorphologicalAnalyzer/Sen.html But, at this time, I would not like to write code ( It maybe causes unexpected bugs ), If there a way to do that. Then, I asked you Whether there is a model for Japanese language. Can mecab generate a model for OpenNLP? On Tue, Apr 5, 2011 at 8:43 PM, Benson Margulies <[email protected]> wrote: > Toshiya, > > While I'm a mentor of opennlp, I'm not that deep in the code. I'm > mostly here to help with process. However, I have done a good deal of > work on statistical segmentation of Japanese text. > > I am presuming that by 'tokenization' for Japanese you are talking > about segmentation into words. I appreciate that there has to be a > tokenizer in the pipeline somewhere. However, it seems to me that it > should be possible to write a bit of code and incorporate an existing > segmentation component as an alternative to training a model for the > opennlp tokenizer. I also have to wonder whether a component used for > languages with whitespace will do a very good job at tokenizing > Japanese or Chinese just by training a different model. Perhaps Jorn > can shed some light on that; maybe others have used their own data to > experiment with that. > > --benson > > > On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <[email protected]> wrote: >> Thanks Benson >> >> The reason why i'm looking for Japanese Model , is to implement >> practical tokenizer into RapidMiner. >> >> RapidMiner is a datamining software which includes OpenNLP within it. >> >> In RapidMiner, OpenNLP is used to for tokenizing document data. it >> works well for English contents, but for Japanese, not. Because The >> models which is bundled with RapidMiner are only English and German. >> >> Then, I'm lookong for the one for Japanese tokenization. >> >> On Tuesday, April 5, 2011, Benson Margulies <[email protected]> wrote: >>> First of all, do you really need to train your own tokenizer? You >>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or >>> Chasen, or http://mecab.sourceforge.net/. >>> >>> I believe that there are corpora available that were used to train >>> mecab, but I'm rusty on the subject. >>> >>> The '1982' Mainichi might be available, but a model trained from it >>> will work well for newspapers and not well at all for hiragana-heavy >>> informal text. >>> >>> If you have a special reason to want to train a model, you can create >>> training data by using one of the tokenizers above. Of course, your >>> accuracy will be somewhat less than what you start with. In our >>> experience, however, not so much less. >>> >>> >>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote: >>>> Hi. >>>> >>>> I'm a newbie at Language Processing. >>>> Then, I'm wondering that what kind of data is suitable for training corpus. >>>> >>>> for english, what is the best for training corpus? >>>> >>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote: >>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote: >>>>>> >>>>>> Hi. >>>>>> >>>>>> I'm a software developer in Tokyo,Japan. >>>>>> I found that RapidMiner uses OpenNLP for its tokenization process. >>>>>> >>>>>> But, the token given by RapidMiner is strange. >>>>>> Because There is no Tokenizer model for Japanese. >>>>>> >>>>>> Although I've checked the page below, >>>>>> The models For Japanese is not found. >>>>>> http://opennlp.sourceforge.net/models-1.5/ >>>>>> >>>>>> How can I get Japanese model? >>>>>> Or Can I create one? >>>>>> >>>>> Currently we do not have support for Japanese, but >>>>> we would be happy to add it. >>>>> >>>>> Do you know a training corpus we could use? >>>>> >>>>> Jörn >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Toshiya TSURU <[email protected]> >>>> http://twitter.com/turutosiya >>>> >>> >> >> >> -- >> Toshiya TSURU <[email protected]> >> http://twitter.com/turutosiya >> > -- Toshiya TSURU <[email protected]> http://twitter.com/turutosiya
