Re: Asian Sentence Detector Models

James Kosin Wed, 21 Mar 2012 21:53:13 -0700

Hi,

Do Chinese and/or Japanese also use word endings to signal the end of a
sentence or thought; or do they use punctuation.


Thanks,
James

On 3/21/2012 10:43 PM, wl-gao wrote:
> I am a Chinese, living in japan...
>
> Sent from my iPod
>
> On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote:
>
>> Don't worry,
>> Korean at least has patterns to the end of a sentence or really a
>> thought....  They have specific endings to the words that key an end of
>> the thought.
>>
>> James
>>
>> On 3/21/2012 7:37 PM, Jörn Kottmann wrote:
>>> I don't know, I never worked with Asian languages,
>>> but it would of course be nice to improve our support in this area.
>>> Especially the basic tasks like sentence detection and
>>> tokenization are of great interest for many.
>>>
>>> Jörn
>>>
>>>
>>> On 03/22/2012 12:22 AM, James Kosin wrote:
>>>> Jorn,
>>>>
>>>> If there isn't anything for Korean, I could put something together.
>>>> Only problem would be getting free text.
>>>> I can start looking if needed.
>>>>
>>>> James
>>>>
>>>> On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
>>>>> Here is a paper which describes Chinese sentence segmentation:
>>>>> www.aclweb.org/anthology/P/P11/P11-2111.pdf
>>>>>
>>>>> There they say that commas can be an end-of-sentence marker as well,
>>>>> but they are ambiguous.
>>>>>
>>>>> So we would need to add it as an eos char and
>>>>> we should create a new feature generator.
>>>>>
>>>>> Are there any free training data sets which could be used?
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>>>> On 03/21/2012 03:34 PM, Joern Kottmann wrote:
>>>>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous
>>>>>> sentence-ending markers."
>>>>>> In this case we might be able to write a rule based sentence detector
>>>>>> for these languages?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
>>>>>> <mailto:[email protected]>  <[email protected]
>>>>>> <mailto:[email protected]>>  wrote:
>>>>>>
>>>>>>     Hi
>>>>>>
>>>>>>     There is a Thai model for sentence detector. I don't know who
>>>>>>     created it,
>>>>>>     but someone from the list knows and can point to some article
>>>>>>     about it.
>>>>>>     What I can say is that OpenNLP had to be customized to work with
>>>>>> Thai,
>>>>>>     including the EOS Characters that are ' ' and '\n'
>>>>>>
>>>>>>
>>>>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
>>>>>>
>>>>>>
>>>>>>
>>>>>>     William
>>>>>>
>>>>>>
>>>>>>     On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
>>>>>>     <[email protected]<mailto:[email protected]>>wrote:
>>>>>>
>>>>>>> Basically you need to know the punctuation signs indicating
>>>>>> end of
>>>>>>> sentence or find someone who does...then use regex to split
>>>>>> the
>>>>>>     sentences
>>>>>>> at those signs! it's not gonna be perfect - you may have to
>>>>>> pass
>>>>>>     it once or
>>>>>>> twice with your own eyes to make sure everything is ok before
>>>>>>     training.
>>>>>>> everything depends on the language and how ambiguous
>>>>>> punctuation
>>>>>>     it has.
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>> On 20/03/12 18:38, Jairo Sarabia wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I see there aren't Sentence Detect Models for Asian languages
>>>>>>     in openNLP
>>>>>>>> repository and I need these ones.
>>>>>>>> I've to train Sentence Detect Models for Chinese, Japanese
>>>>>> and
>>>>>>     Korean
>>>>>>>> languages, but I don't know these languages.
>>>>>>>> How coud I get the data train files for these languages?
>>>>>>>>
>>>>>>>> Thanks in advance!,
>>>>>>>>
>>>>>>>> Jairo Sarabia
>>>>>>>>
>>>>>>>>
>>>>>>

Re: Asian Sentence Detector Models

Reply via email to