Re: Japanese Tokenizer Model

Benson Margulies Tue, 05 Apr 2011 04:05:47 -0700

First of all, do you really need to train your own tokenizer? You
could use http://www.chasen.org/~taku/software/TinySegmenter/, or
Chasen, or http://mecab.sourceforge.net/.

I believe that there are corpora available that were used to train
mecab, but I'm rusty on the subject.

The '1982' Mainichi might be available, but a model trained from it
will work well for newspapers and not well at all for hiragana-heavy
informal text.

If you have a special reason to want to train a model, you can create
training data by using one of the tokenizers above. Of course, your
accuracy will be somewhat less than what you start with. In our
experience, however, not so much less.

On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote:
> Hi.
>
> I'm a newbie at Language Processing.
> Then, I'm wondering that what kind of data is suitable for training corpus.
>
> for english, what is the best for training corpus?
>
> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote:
>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>
>>> Hi.
>>>
>>> I'm a software developer in Tokyo,Japan.
>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>
>>> But, the token given by RapidMiner is strange.
>>> Because There is no Tokenizer model for Japanese.
>>>
>>> Although I've checked the page below,
>>> The models For Japanese is not found.
>>> http://opennlp.sourceforge.net/models-1.5/
>>>
>>> How can I get Japanese model?
>>> Or Can I create one?
>>>
>> Currently we do not have support for Japanese, but
>> we would be happy to add it.
>>
>> Do you know a training corpus we could use?
>>
>> Jörn
>>
>>
>>
>
>
>
> --
> Toshiya TSURU <[email protected]>
> http://twitter.com/turutosiya
>

Re: Japanese Tokenizer Model

Reply via email to