Re: Asian Sentence Detector Models

James Kosin Wed, 21 Mar 2012 16:43:05 -0700

Don't worry,
Korean at least has patterns to the end of a sentence or really a
thought....  They have specific endings to the words that key an end of
the thought.


James

On 3/21/2012 7:37 PM, Jörn Kottmann wrote:
> I don't know, I never worked with Asian languages,
> but it would of course be nice to improve our support in this area.
> Especially the basic tasks like sentence detection and
> tokenization are of great interest for many.
>
> Jörn
>
>
> On 03/22/2012 12:22 AM, James Kosin wrote:
>> Jorn,
>>
>> If there isn't anything for Korean, I could put something together.
>> Only problem would be getting free text.
>> I can start looking if needed.
>>
>> James
>>
>> On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
>>> Here is a paper which describes Chinese sentence segmentation:
>>> www.aclweb.org/anthology/P/P11/P11-2111.pdf
>>>
>>> There they say that commas can be an end-of-sentence marker as well,
>>> but they are ambiguous.
>>>
>>> So we would need to add it as an eos char and
>>> we should create a new feature generator.
>>>
>>> Are there any free training data sets which could be used?
>>>
>>> Jörn
>>>
>>>
>>> On 03/21/2012 03:34 PM, Joern Kottmann wrote:
>>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous
>>>> sentence-ending markers."
>>>> In this case we might be able to write a rule based sentence detector
>>>> for these languages?
>>>>
>>>> Jörn
>>>>
>>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
>>>> <mailto:[email protected]>  <[email protected]
>>>> <mailto:[email protected]>>  wrote:
>>>>
>>>>      Hi
>>>>
>>>>      There is a Thai model for sentence detector. I don't know who
>>>>      created it,
>>>>      but someone from the list knows and can point to some article
>>>>      about it.
>>>>      What I can say is that OpenNLP had to be customized to work with
>>>> Thai,
>>>>      including the EOS Characters that are ' ' and '\n'
>>>>
>>>>
>>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
>>>>
>>>>
>>>>
>>>>      William
>>>>
>>>>
>>>>      On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
>>>>      <[email protected]<mailto:[email protected]>>wrote:
>>>>
>>>>      >  Basically you need to know the punctuation signs indicating
>>>> end of
>>>>      >  sentence or find someone who does...then use regex to split
>>>> the
>>>>      sentences
>>>>      >  at those signs! it's not gonna be perfect - you may have to
>>>> pass
>>>>      it once or
>>>>      >  twice with your own eyes to make sure everything is ok before
>>>>      training.
>>>>      >  everything depends on the language and how ambiguous
>>>> punctuation
>>>>      it has.
>>>>      >
>>>>      >
>>>>      >  Jim
>>>>      >
>>>>      >  On 20/03/12 18:38, Jairo Sarabia wrote:
>>>>      >
>>>>      >>  Hi all,
>>>>      >>
>>>>      >>  I see there aren't Sentence Detect Models for Asian languages
>>>>      in openNLP
>>>>      >>  repository and I need these ones.
>>>>      >>  I've to train Sentence Detect Models for Chinese, Japanese
>>>> and
>>>>      Korean
>>>>      >>  languages, but I don't know these languages.
>>>>      >>  How coud I get the data train files for these languages?
>>>>      >>
>>>>      >>  Thanks in advance!,
>>>>      >>
>>>>      >>  Jairo Sarabia
>>>>      >>
>>>>      >>
>>>>      >
>>>>
>>>>
>>>
>

Re: Asian Sentence Detector Models

Reply via email to