Re: Japanese Tokenizer Model

Toshiya TSURU Tue, 05 Apr 2011 20:48:09 -0700

Thanks Benson.

> I am presuming that by 'tokenization' for Japanese you are talking
> about segmentation into words. I appreciate that there has to be a
> tokenizer in the pipeline somewhere


Yes.

> However, it seems to me that it
> should be possible to write a bit of code and incorporate an existing
> segmentation component as an alternative to training a model for the
> opennlp tokenizer

Yes.
Because RapidMiner is written by Java, I've been looking for
alternatives which is writted by Java.
And I find it. "Sen" is the one.

http://www.mlab.im.dendai.ac.jp/~yamada/ir/MorphologicalAnalyzer/Sen.html

But, at this time, I would not like to write code ( It maybe causes
unexpected bugs ), If there a way to do that.
Then, I asked you Whether there is a model for Japanese language.


Can mecab generate a model for OpenNLP?

On Tue, Apr 5, 2011 at 8:43 PM, Benson Margulies <[email protected]> wrote:
> Toshiya,
>
> While I'm a mentor of opennlp, I'm not that deep in the code. I'm
> mostly here to help with process. However, I have done a good deal of
> work on statistical segmentation of Japanese text.
>
> I am presuming that by 'tokenization' for Japanese you are talking
> about segmentation into words. I appreciate that there has to be a
> tokenizer in the pipeline somewhere. However, it seems to me that it
> should be possible to write a bit of code and incorporate an existing
> segmentation component as an alternative to training a model for the
> opennlp tokenizer. I also have to wonder whether a component used for
> languages with whitespace will do a very good job at tokenizing
> Japanese or Chinese just by training a different model.  Perhaps Jorn
> can shed some light on that; maybe others have used their own data to
> experiment with that.
>
> --benson
>
>
> On Tue, Apr 5, 2011 at 7:36 AM, Toshiya TSURU <[email protected]> wrote:
>> Thanks Benson
>>
>> The reason why i'm looking for Japanese Model , is to implement
>> practical tokenizer into RapidMiner.
>>
>> RapidMiner is a datamining software which includes OpenNLP within it.
>>
>> In RapidMiner, OpenNLP is used to for tokenizing document data. it
>> works well for English contents, but for Japanese, not. Because The
>> models which is bundled with RapidMiner are only English and German.
>>
>> Then, I'm lookong for the one for Japanese tokenization.
>>
>> On Tuesday, April 5, 2011, Benson Margulies <[email protected]> wrote:
>>> First of all, do you really need to train your own tokenizer? You
>>> could use http://www.chasen.org/~taku/software/TinySegmenter/, or
>>> Chasen, or http://mecab.sourceforge.net/.
>>>
>>> I believe that there are corpora available that were used to train
>>> mecab, but I'm rusty on the subject.
>>>
>>> The '1982' Mainichi might be available, but a model trained from it
>>> will work well for newspapers and not well at all for hiragana-heavy
>>> informal text.
>>>
>>> If you have a special reason to want to train a model, you can create
>>> training data by using one of the tokenizers above. Of course, your
>>> accuracy will be somewhat less than what you start with. In our
>>> experience, however, not so much less.
>>>
>>>
>>> On Tue, Apr 5, 2011 at 3:56 AM, Toshiya TSURU <[email protected]> wrote:
>>>> Hi.
>>>>
>>>> I'm a newbie at Language Processing.
>>>> Then, I'm wondering that what kind of data is suitable for training corpus.
>>>>
>>>> for english, what is the best for training corpus?
>>>>
>>>> On Tue, Apr 5, 2011 at 4:16 PM, Jörn Kottmann <[email protected]> wrote:
>>>>> On 4/5/11 8:25 AM, Toshiya TSURU wrote:
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm a software developer in Tokyo,Japan.
>>>>>> I found that RapidMiner uses OpenNLP for its tokenization process.
>>>>>>
>>>>>> But, the token given by RapidMiner is strange.
>>>>>> Because There is no Tokenizer model for Japanese.
>>>>>>
>>>>>> Although I've checked the page below,
>>>>>> The models For Japanese is not found.
>>>>>> http://opennlp.sourceforge.net/models-1.5/
>>>>>>
>>>>>> How can I get Japanese model?
>>>>>> Or Can I create one?
>>>>>>
>>>>> Currently we do not have support for Japanese, but
>>>>> we would be happy to add it.
>>>>>
>>>>> Do you know a training corpus we could use?
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Toshiya TSURU <[email protected]>
>>>> http://twitter.com/turutosiya
>>>>
>>>
>>
>>
>> --
>> Toshiya TSURU <[email protected]>
>> http://twitter.com/turutosiya
>>
>



-- 
Toshiya TSURU <[email protected]>
http://twitter.com/turutosiya

Re: Japanese Tokenizer Model

Reply via email to