Re: Asian Sentence Detector Models

wl-gao Wed, 21 Mar 2012 19:44:53 -0700

I am a Chinese, living in japan...

Sent from my iPod


On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote:

> Don't worry,
> Korean at least has patterns to the end of a sentence or really a
> thought....  They have specific endings to the words that key an end of
> the thought.
> 
> James
> 
> On 3/21/2012 7:37 PM, Jörn Kottmann wrote:
>> I don't know, I never worked with Asian languages,
>> but it would of course be nice to improve our support in this area.
>> Especially the basic tasks like sentence detection and
>> tokenization are of great interest for many.
>> 
>> Jörn
>> 
>> 
>> On 03/22/2012 12:22 AM, James Kosin wrote:
>>> Jorn,
>>> 
>>> If there isn't anything for Korean, I could put something together.
>>> Only problem would be getting free text.
>>> I can start looking if needed.
>>> 
>>> James
>>> 
>>> On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
>>>> Here is a paper which describes Chinese sentence segmentation:
>>>> www.aclweb.org/anthology/P/P11/P11-2111.pdf
>>>> 
>>>> There they say that commas can be an end-of-sentence marker as well,
>>>> but they are ambiguous.
>>>> 
>>>> So we would need to add it as an eos char and
>>>> we should create a new feature generator.
>>>> 
>>>> Are there any free training data sets which could be used?
>>>> 
>>>> Jörn
>>>> 
>>>> 
>>>> On 03/21/2012 03:34 PM, Joern Kottmann wrote:
>>>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous
>>>>> sentence-ending markers."
>>>>> In this case we might be able to write a rule based sentence detector
>>>>> for these languages?
>>>>> 
>>>>> Jörn
>>>>> 
>>>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
>>>>> <mailto:[email protected]>  <[email protected]
>>>>> <mailto:[email protected]>>  wrote:
>>>>> 
>>>>>     Hi
>>>>> 
>>>>>     There is a Thai model for sentence detector. I don't know who
>>>>>     created it,
>>>>>     but someone from the list knows and can point to some article
>>>>>     about it.
>>>>>     What I can say is that OpenNLP had to be customized to work with
>>>>> Thai,
>>>>>     including the EOS Characters that are ' ' and '\n'
>>>>> 
>>>>> 
>>>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
>>>>> 
>>>>> 
>>>>> 
>>>>>     William
>>>>> 
>>>>> 
>>>>>     On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
>>>>>     <[email protected]<mailto:[email protected]>>wrote:
>>>>> 
>>>>>> Basically you need to know the punctuation signs indicating
>>>>> end of
>>>>>> sentence or find someone who does...then use regex to split
>>>>> the
>>>>>     sentences
>>>>>> at those signs! it's not gonna be perfect - you may have to
>>>>> pass
>>>>>     it once or
>>>>>> twice with your own eyes to make sure everything is ok before
>>>>>     training.
>>>>>> everything depends on the language and how ambiguous
>>>>> punctuation
>>>>>     it has.
>>>>>> 
>>>>>> 
>>>>>> Jim
>>>>>> 
>>>>>> On 20/03/12 18:38, Jairo Sarabia wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I see there aren't Sentence Detect Models for Asian languages
>>>>>     in openNLP
>>>>>>> repository and I need these ones.
>>>>>>> I've to train Sentence Detect Models for Chinese, Japanese
>>>>> and
>>>>>     Korean
>>>>>>> languages, but I don't know these languages.
>>>>>>> How coud I get the data train files for these languages?
>>>>>>> 
>>>>>>> Thanks in advance!,
>>>>>>> 
>>>>>>> Jairo Sarabia
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>

Re: Asian Sentence Detector Models

Reply via email to