Re: Asian Sentence Detector Models

Jörn Kottmann Wed, 21 Mar 2012 16:37:58 -0700

I don't know, I never worked with Asian languages,
but it would of course be nice to improve our support in this area.
Especially the basic tasks like sentence detection and
tokenization are of great interest for many.


Jörn


On 03/22/2012 12:22 AM, James Kosin wrote:

Jorn,

If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.

James

On 3/21/2012 2:38 PM, Jörn Kottmann wrote:

Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf

There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.

So we would need to add it as an eos char and
we should create a new feature generator.

Are there any free training data sets which could be used?

Jörn


On 03/21/2012 03:34 PM, Joern Kottmann wrote:

Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?

Jörn

On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]>  <[email protected]
<mailto:[email protected]>>  wrote:

     Hi

     There is a Thai model for sentence detector. I don't know who
     created it,
     but someone from the list knows and can point to some article
     about it.
     What I can say is that OpenNLP had to be customized to work with
Thai,
     including the EOS Characters that are ' ' and '\n'


http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup


     William


     On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
     <[email protected]<mailto:[email protected]>>wrote:

     >  Basically you need to know the punctuation signs indicating end of
     >  sentence or find someone who does...then use regex to split the
     sentences
     >  at those signs! it's not gonna be perfect - you may have to pass
     it once or
     >  twice with your own eyes to make sure everything is ok before
     training.
     >  everything depends on the language and how ambiguous punctuation
     it has.
     >
     >
     >  Jim
     >
     >  On 20/03/12 18:38, Jairo Sarabia wrote:
     >
     >>  Hi all,
     >>
     >>  I see there aren't Sentence Detect Models for Asian languages
     in openNLP
     >>  repository and I need these ones.
     >>  I've to train Sentence Detect Models for Chinese, Japanese and
     Korean
     >>  languages, but I don't know these languages.
     >>  How coud I get the data train files for these languages?
     >>
     >>  Thanks in advance!,
     >>
     >>  Jairo Sarabia
     >>
     >>
     >

Re: Asian Sentence Detector Models

Reply via email to