Re: Japanese Tokenizer Model

Benson Margulies Tue, 05 Apr 2011 04:44:04 -0700

Toshiya,

While I'm a mentor of opennlp, I'm not that deep in the code. I'm
mostly here to help with process. However, I have done a good deal of
work on statistical segmentation of Japanese text.


I am presuming that by 'tokenization' for Japanese you are talking
about segmentation into words. I appreciate that there has to be a
tokenizer in the pipeline somewhere. However, it seems to me that it
should be possible to write a bit of code and incorporate an existing
segmentation component as an alternative to training a model for the
opennlp tokenizer. I also have to wonder whether a component used for
languages with whitespace will do a very good job at tokenizing
Japanese or Chinese just by training a different model.  Perhaps Jorn
can shed some light on that; maybe others have used their own data to
experiment with that.

--benson


On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <[email protected]> wrote:
> Thanks Benson
>
> The reason why i'm looking for Japanese Model , is to implement
> practical tokenizer into RapidMiner.
>
> RapidMiner is a datamining software which includes OpenNLP within it.
>
> In RapidMiner, OpenNLP is used to for tokenizing document data. it
> works well for English contents, but for Japanese, not. Because The
> models which is bundled with RapidMiner are only English and German.
>
> Then, I'm lookong for the one for Japanese tokenization.
>
> On Tuesday, April 5, 2011, Benson Margulies <[email protected]> wrote:
>> First of all, do you really need to train your own tokenizer? You
>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
>> Chasen, or http://mecab.sourceforge.net/.
>>
>> I believe that there are corpora available that were used to train
>> mecab, but I'm rusty on the subject.
>>
>> The '1982' Mainichi might be available, but a model trained from it
>> will work well for newspapers and not well at all for hiragana-heavy
>> informal text.
>>
>> If you have a special reason to want to train a model, you can create
>> training data by using one of the tokenizers above. Of course, your
>> accuracy will be somewhat less than what you start with. In our
>> experience, however, not so much less.
>>
>>
>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote:
>>> Hi.
>>>
>>> I'm a newbie at Language Processing.
>>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>>
>>> for english, what is the best for training corpus?
>>>
>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote:
>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>> I'm a software developer in Tokyo,Japan.
>>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>>
>>>>> But, the token given by RapidMiner is strange.
>>>>> Because There is no Tokenizer model for Japanese.
>>>>>
>>>>> Although I've checked the page below,
>>>>> The models For Japanese is not found.
>>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>>
>>>>> How can I get Japanese model?
>>>>> Or Can I create one?
>>>>>
>>>> Currently we do not have support for Japanese, but
>>>> we would be happy to add it.
>>>>
>>>> Do you know a training corpus we could use?
>>>>
>>>> Jörn
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Toshiya TSURU <[email protected]>
>>> http://twitter.com/turutosiya
>>>
>>
>
>
> --
> Toshiya TSURU <[email protected]>
> http://twitter.com/turutosiya
>

Re: Japanese Tokenizer Model

Reply via email to